If I have a CSV with (simple case) the header and one row of data, where some of the values are not there (null) like this:
name,surname,age
John,,32
and the relative catalog is like this:
MyDataTable:
Type: AWS::Glue::Table
DependsOn: CatalogDatabaseName
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: my_db
TableInput:
Name: my_data
TableType: EXTERNAL_TABLE
Parameters: {
"skip.header.line.count": "1",
"compressionType": "none",
"classification": "csv",
"columnsOrdered": "true",
"areColumnsQuoted": "true",
"delimiter": ",",
"typeOfData": "file",
"header": "true",
"inferSchema": false,
"quote": "\"",
"escape": "\""
}
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: name
Type: string
- Name: surname
Type: string
- Name: age
Type: int
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://somewhere/
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.serde2.OpenCSVSerde
if I try to read the data through the catalog in this way (via Spark):
glueContext.getCatalogSource(
database = "my_db",
tableName = "my_data")
.getDynamicFrame()
.printSchema()
I can see that the column surname
is gone because there is no data for that specific column:
root
|-- name: string
|-- age: int
How can I avoid that Glue/AWS drops that column and generally any empty column?
We observed the same behaviour on our side. For the record—not all values should be empty in a column for the bug to appear—it is enough to have empty values in one partition.
This is not exactly a solution—rather a fix: I solved it by using Glue 1.0—not Glue 2.0.
The difference in Glue versions was the only difference between our Test and Prod environments and I could not grasp why the job was failing on prod, but running on test for a loooong time.