How to avoid that AWS Glue DynamicFrame drops empty columns when read a CSV?

1.7k views Asked by At

If I have a CSV with (simple case) the header and one row of data, where some of the values are not there (null) like this:

name,surname,age
John,,32

and the relative catalog is like this:

MyDataTable:
  Type: AWS::Glue::Table
  DependsOn: CatalogDatabaseName
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: my_db
    TableInput:
      Name: my_data
      TableType: EXTERNAL_TABLE
      Parameters: {
        "skip.header.line.count": "1",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "true",
        "delimiter": ",",
        "typeOfData": "file",
        "header": "true",
        "inferSchema": false,
        "quote": "\"",
        "escape": "\""
      }
      StorageDescriptor:
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Columns:
          - Name: name
            Type: string
          - Name: surname
            Type: string
          - Name: age
            Type: int
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        Location: s3://somewhere/
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.OpenCSVSerde

if I try to read the data through the catalog in this way (via Spark):

glueContext.getCatalogSource(
      database = "my_db",
      tableName = "my_data")
   .getDynamicFrame()
   .printSchema()

I can see that the column surname is gone because there is no data for that specific column:

root
|-- name: string
|-- age: int

How can I avoid that Glue/AWS drops that column and generally any empty column?

1

There are 1 answers

0
Denys On

We observed the same behaviour on our side. For the record—not all values should be empty in a column for the bug to appear—it is enough to have empty values in one partition.

This is not exactly a solution—rather a fix: I solved it by using Glue 1.0—not Glue 2.0.

The difference in Glue versions was the only difference between our Test and Prod environments and I could not grasp why the job was failing on prod, but running on test for a loooong time.