I'm trying to copy data from an Oracle database to GCS using Airflow's OracleToGCSOperator:

copy_data = OracleToGCSOperator(
   task_id='copy_data_task',
   oracle_conn_id='my_conn',
   sql="SELECT * FROM MY_TABLE",
   bucket=MY_BUCKET,
   filename=FILEPATH.,
   export_format='PARQUET'
)

When executed I'm getting the error pyarrow.lib.ArrowTypeError: Expected bytes, got a 'LOB' object.

MY_TABLE has more than 800 columns and only 2 of them CLOB. I assume this is what GCS/parquet doesn't like.

Is there any way I convert the CLOB columns to strings at Operator level?

1

There are 1 answers

0
drake10k On BEST ANSWER

Easiest solution is to set the fetch_lobs parameter to false when creating the connection to oracle. This retrieves LOBs as strings. The limit is 1GB after which the LOBs will have to be streamed.

https://airflow.apache.org/docs/apache-airflow-providers-oracle/stable/connections/oracle.html https://python-oracledb.readthedocs.io/en/latest/api_manual/defaults.html#defaults.fetch_decimals