As someone new to AWS Glue, Amazon’s serverless data integration service, which is an Amazon-enhanced offering of Apache Spark SaaS, I rely heavily on AWS’s excellent documentation.
It’s easy to understand when insufficient documentation can be problematic, but how about when there’s too much documentation?
Like, when there’s enough documentation so that one part of the documentation contradicts another?
Background
One feature of AWS Glue is that it infers the data’s schema based on the data that is ingested. In many cases, this works exceptionally well, and minimizes the amount of manual configuration required by data engineers in order to process the data.
However, there are times when the input data has values that make it impossible to reliably infer the correct schema.
For example, consider the following JSON input data:
{
"id": 1,
"amount": 4
}
This might result in an Parquet schema that looks like this:
root
|-- id: int
|-- amount: int
Great, it was able to infer that both values were integer types. But, watch what happens when we ingest this record next:
{
"id": 2,
"amount": 3.50
}
We get this schema:
root
|-- id: int
|-- amount: choice
| |-- double
| |-- int
See how the amount
went from being an int
type to a choice
type, which acts as a union of both double
for the 3.50
value and int
for the 4
value that has been seen in that column?
We need to help disambiguate that part of the schema, and that’s where the ResolveChoice
transform comes in.
Using the ResolveChoice GlueTransform
AWS Glue provides a GlueTransform called ResolveChoice
that maps those choice
type columns within a DynamicFrame
, using a list of tuples that instructs the transform on how to resolve the choice.
I first discovered this transform when I was reading through the documentation on the DynamicFrame
class. On that page is a section for resolveChoice
, which documents the function. It lists a specs
parameter that accepts a list of tuples that controls how each element in the list should be resolved.
The possible actions are: cast:type
, make_cols
, make_struct
, and project:type
.
At first, I assumed that I’d want to use cast:type
, as I assumed it would do type casting, which is what I wanted. The documentation describes the action like this:
cast:type
– Attempts to cast all values to the specified type. For example,cast:int
.
Bingo! But, let’s not be too hasty. There’s more documentation for a reason, we should read it to make sure we’re not missing some important details. Let’s look at the description for project:type
:
project:type
– Resolves a potential ambiguity by projecting all the data to one of the possible data types. For example, if data in a column could be anint
or astring
, using aproject:string
action produces a column in the resultingDynamicFrame
where all theint
values have been converted to strings.
Hey, that sounds like what we want, “all the [values of one type being] convert to [another type].” Maybe cast
in AWS Glue doesn’t mean type casting the way we know it, and instead we want type projection. After all, we want to convert all our int
values to double
, and project:type
‘s documentation suggests it performs conversion.
Let’s give it a try:
projected_frame = frame.resolveChoice(
specs=[
("amount", "project:double")
])
projected_frame.printSchema()
projected_frame.toDF().show()
This is what we get:
root
|-- id: int
|-- amount: double
+---+------+
| id|amount|
+---+------+
| 1| null|
| 2| 3.5|
+---+------+
Okay, so the schema is transformed the way we expect, with amount
no longer being a choice
but just a double
, but look at the actual resulting data. Our int
amount didn’t get converted to 4.0
but instead became null
.
This isn’t what we wanted. Let’s try cast:type
and see if it’s any better:
cast_frame = frame.resolveChoice(
specs=[
("amount", "cast:double")
])
cast_frame.printSchema()
cast_frame.toDF().show()
This gives us:
root
|-- id: int
|-- amount: double
+---+------+
| id|amount|
+---+------+
| 1| 4.0|
| 2| 3.5|
+---+------+
The schema’s correct, and so is the resulting data! Perfect.
But, why didn’t project:type
work as advertised according to the documentation?
Turns out, there’s more documentation for the ResolveChoice
transform.
On that page, the description for cast:type
is similar to the earlier definition:
cast
: Allows you to specify a type to cast to (for example,cast:int
).
But, check out the description for project:type
:
project
: Resolves a potential ambiguity by retaining only values of a specified type in the resultingDynamicFrame
. For example, if data in aChoiceType
column could be anint
or astring
, specifying aproject:string
action drops values from the resultingDynamicFrame
which are not typestring
.
Oof.
This clearly states that project:type
doesn’t actually do any converting, and instead “drops values from the resulting DynamicFrame
” which is exactly what we observed when we tried using project:double
, originally.
As I wrote earlier, it can be challenging when you don’t have enough documentation to understand how something works or is meant to be used, but here we have too much documentation and at first glance, we can’t know which one to trust, until we verify.
This is an example of Segal’s law:
“A man with a watch knows what time it is. A man with two watches is never sure.”
This is why it’s useful, as a programmer, or even in general, to practice a healthy amount of skepticism, and validate assumptions when you can.
If you, like me, landed upon the first set of documentation and trusted it to convert from one type to another, as it claimed, and didn’t verify the actual results, you’d have incorrect output data and could end up spending a lot of time trying to figure out why it was happening.
Demonstrating all of this with working code
In the famous words of LeVar Burton, “but you don’t have to take my word for it.”
Here’s a Jupyter Notebook that illustrates all of the above in PySpark:
If you found this useful, please share it with your Data Engineering colleagues who might benefit.
And, if you have any questions that I might be able to answer, let me know.
Thanks,
Dossy Shiobara