I'm trying to copy data from an Oracle database to GCS using Airflow's OracleToGCSOperator:
copy_data = OracleToGCSOperator(
task_id='copy_data_task',
oracle_conn_id='my_conn',
sql="SELECT * FROM MY_TABLE",
bucket=MY_BUCKET,
filename=FILEPATH.,
export_format='PARQUET'
)
When executed I'm getting the error pyarrow.lib.ArrowTypeError: Expected bytes, got a 'LOB' object.
MY_TABLE has more than 800 columns and only 2 of them CLOB. I assume this is what GCS/parquet doesn't like.
Is there any way I convert the CLOB columns to strings at Operator level?
Easiest solution is to set the fetch_lobs parameter to false when creating the connection to oracle. This retrieves LOBs as strings. The limit is 1GB after which the LOBs will have to be streamed.
https://airflow.apache.org/docs/apache-airflow-providers-oracle/stable/connections/oracle.html https://python-oracledb.readthedocs.io/en/latest/api_manual/defaults.html#defaults.fetch_decimals