I have a snappy.parquet file with a schema like this:

{
    "type": "struct",
    "fields": [{
            "name": "MyTinyInt",
            "type": "byte",
            "nullable": true,
            "metadata": {}

        }
        ...
    ]
}

Update: parquet-tools reveals this:

############ Column(MyTinyInt) ############
name: MyTinyInt
path: MyTinyInt
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8

When I try and run a stored procedure in Azure Data Studio to load this into an external staging table with PolyBase I get the error:

11:16:21Started executing query at Line 113
Msg 106000, Level 16, State 1, Line 1
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Integer cannot be cast to class parquet.io.api.Binary (java.lang.Integer is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')

The load into the external table works fine with only varchars

CREATE EXTERNAL TABLE [domain].[TempTable] 
    (
        ...
        MyTinyInt tinyint NULL,
        ...
        
    )
    WITH
    (
        LOCATION = ''' + @Location + ''',
        DATA_SOURCE = datalake,
        FILE_FORMAT = parquet_snappy
    )

The data will eventually be merged into a Data Warehouse Synapse table. In that table the column will have to be of type tinyint.

1

There are 1 answers

0
atul On

I have the same issue and good support plan in Azure, so I've got an answer from Microsoft:

there is a known bug in ADF for this particular scenario: The date type in parquet should be mapped as data type date in Sql sever however, ADF incorrectly converts this type to Datetime2 which causes a conflict in PolyBase. I have confirmation for the core engineering team that this will be rectified with a fix by the end of November and will be published directly into the ADF product.

In the meantime, as a workaround:

  1. Create the target table with data type DATE as opposed to DATETIME2
  2. Configure the Copy Activity Sink settings to use Copy Command as opposed to PolyBase

but even Copy command don't work for me, so only one workaround is to use Bulk insert, but Bulk is extremely slow and on big datasets it's would be a problem