AWS GlueTransform ResolveChoice cast vs. project

As someone new to AWS Glue, Amazon’s serverless data integration service, which is an Amazon-enhanced offering of Apache Spark SaaS, I rely heavily on AWS’s excellent documentation.

It’s easy to understand when insufficient documentation can be problematic, but how about when there’s too much documentation?

Like, when there’s enough documentation so that one part of the documentation contradicts another?

Background

One feature of AWS Glue is that it infers the data’s schema based on the data that is ingested. In many cases, this works exceptionally well, and minimizes the amount of manual configuration required by data engineers in order to process the data.

However, there are times when the input data has values that make it impossible to reliably infer the correct schema.

For example, consider the following JSON input data:

{
  "id": 1,
  "amount": 4
}

This might result in an Parquet schema that looks like this:

root
|-- id: int
|-- amount: int

Great, it was able to infer that both values were integer types. But, watch what happens when we ingest this record next:

{
  "id": 2,
  "amount": 3.50
}

We get this schema:

root
|-- id: int
|-- amount: choice
|    |-- double
|    |-- int

See how the amount went from being an int type to a choice type, which acts as a union of both double for the 3.50 value and int for the 4 value that has been seen in that column?

We need to help disambiguate that part of the schema, and that’s where the ResolveChoice transform comes in.

Using the ResolveChoice GlueTransform

AWS Glue provides a GlueTransform called ResolveChoice that maps those choice type columns within a DynamicFrame, using a list of tuples that instructs the transform on how to resolve the choice.

I first discovered this transform when I was reading through the documentation on the DynamicFrame class. On that page is a section for resolveChoice, which documents the function. It lists a specs parameter that accepts a list of tuples that controls how each element in the list should be resolved.

The possible actions are: cast:type, make_cols, make_struct, and project:type.

At first, I assumed that I’d want to use cast:type, as I assumed it would do type casting, which is what I wanted. The documentation describes the action like this:

cast:type – Attempts to cast all values to the specified type. For example, cast:int.

Bingo! But, let’s not be too hasty. There’s more documentation for a reason, we should read it to make sure we’re not missing some important details. Let’s look at the description for project:type:

project:type – Resolves a potential ambiguity by projecting all the data to one of the possible data types. For example, if data in a column could be an int or a string, using a project:string action produces a column in the resulting DynamicFrame where all the int values have been converted to strings.

Hey, that sounds like what we want, “all the [values of one type being] convert to [another type].” Maybe cast in AWS Glue doesn’t mean type casting the way we know it, and instead we want type projection. After all, we want to convert all our int values to double, and project:type‘s documentation suggests it performs conversion.

Let’s give it a try:

projected_frame = frame.resolveChoice(
    specs=[
        ("amount", "project:double")
    ])

projected_frame.printSchema()
projected_frame.toDF().show()

This is what we get:

root
|-- id: int
|-- amount: double

+---+------+
| id|amount|
+---+------+
|  1|  null|
|  2|   3.5|
+---+------+

Okay, so the schema is transformed the way we expect, with amount no longer being a choice but just a double, but look at the actual resulting data. Our int amount didn’t get converted to 4.0 but instead became null.

This isn’t what we wanted. Let’s try cast:type and see if it’s any better:

cast_frame = frame.resolveChoice(
    specs=[
        ("amount", "cast:double")
    ])

cast_frame.printSchema()
cast_frame.toDF().show()

This gives us:

root
|-- id: int
|-- amount: double

+---+------+
| id|amount|
+---+------+
|  1|   4.0|
|  2|   3.5|
+---+------+

The schema’s correct, and so is the resulting data! Perfect.

But, why didn’t project:type work as advertised according to the documentation?

Turns out, there’s more documentation for the ResolveChoice transform.

On that page, the description for cast:type is similar to the earlier definition:

cast: Allows you to specify a type to cast to (for example, cast:int).

But, check out the description for project:type:

project: Resolves a potential ambiguity by retaining only values of a specified type in the resulting DynamicFrame. For example, if data in a ChoiceType column could be an int or a string, specifying a project:string action drops values from the resulting DynamicFrame which are not type string.

Oof.

This clearly states that project:type doesn’t actually do any converting, and instead “drops values from the resulting DynamicFrame” which is exactly what we observed when we tried using project:double, originally.

As I wrote earlier, it can be challenging when you don’t have enough documentation to understand how something works or is meant to be used, but here we have too much documentation and at first glance, we can’t know which one to trust, until we verify.

This is an example of Segal’s law:

“A man with a watch knows what time it is. A man with two watches is never sure.”

This is why it’s useful, as a programmer, or even in general, to practice a healthy amount of skepticism, and validate assumptions when you can.

If you, like me, landed upon the first set of documentation and trusted it to convert from one type to another, as it claimed, and didn’t verify the actual results, you’d have incorrect output data and could end up spending a lot of time trying to figure out why it was happening.

Demonstrating all of this with working code

Animated GIF of LeVar Burton from Reading Rainbow saying, "But you don't have to take my word for it."

In the famous words of LeVar Burton, “but you don’t have to take my word for it.”

Here’s a Jupyter Notebook that illustrates all of the above in PySpark:

If you found this useful, please share it with your Data Engineering colleagues who might benefit.

And, if you have any questions that I might be able to answer, let me know.

Thanks,

Dossy Shiobara