What options can be passed to AWS Glue DynamicFrame.toDF()?

3k views Asked by At

The documentation on toDF() method specifies that we can pass an options parameter to this method. But it does not specify what those options can be (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html). Does anyone know if there is further documentation on this? I am specifically interested in passing in a schema when creating a DataFrame from DynamicFrame.

2

There are 2 answers

0
amsh On BEST ANSWER

Unfortunately there's not much documentation available, yet R&D and analysis of source code for dynamicframe suggests the following:

  • options available in toDF have more to do with ResolveOption class then toDF itself, as ResolveOption class adds meaning to the parameters (please read the code).
  • ResolveOption class takes in ChoiceType as a parameter.
  • The options examples available in documentation are similar to the specs available in ResolveChoice that also mention ChoiceType.
  • Options are further converted to sequence and referenced to toDF function from _jdf here.

My understanding after seeing the specs, toDF implementation of dynamicFrame and toDF from spark is that we can't pass schema when creating a DataFrame from DynamicFrame, but only minor column manipulations are possible.

Saying this, a possible approach is to obtain a dataframe from dynamic frame and then manipulate it to change its schema.

0
Zigglzworth On

The documentation is very unclear. It states:

options – A list of options. Specify the target type if you choose the Project and Cast action type. Examples include the following.

toDF([ResolveOption("a.b.c", "KeepAsStruct")]) toDF([ResolveOption("a.b.c", "Project", DoubleType())])

However 'Cast' is not an allowed action from what I can tell, and ResolveOption is just the name of a tuple that they expect you to define which adheres to their attribute structure.

So here is an example of what to pass to dynamicframe toDF() in python:

from awsglue.dynamicframe import DynamicFrame
from awsglue.gluetypes import *
from collections import namedtuple
#any other imports you need..



# Define a named tuple called ResolveOption with attributes 'path', 'action', and 'target'

ResolveOption = namedtuple('ResolveOption', ['path', 'action', 'target'])

#Create an array of ResolveOption tuples 
#(Good for when converting to a DataFrame and you need to project the data types for your schema so you don't end up with unresolved JSON values like {int:111, double:null} etc)
#action must be one of KeepAsStruct and Project
#target should be types such as (for example) StringType(), DoubleType(), etc..

ResolveOptions = [
    ResolveOption(path="columnname", action="Project", target=StringType()),
    ....
]

#Assuming you created a dynamic frame named YourDynamicFrame earlier

YourDataFrame = YourDynamicFrame.toDF(ResolveOptions)

Tested and works. Hope this helps