Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks
This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 ( f1 string, s2 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\") LOCATION 's3://user-test-region/dataset/test1/'