I'm using AWS Glue as ETL tool for building my data warehouse. In the example below I'm moving data from one MySQL database to another.
When I see on the target DB (SHOW PROCESSLIST
), I notice that there is an insert statement for each record individually. How can I achieve some kind of bulk insert to speed up performance?
##Step 1: truncate the target staging table
source_jdbc_conf = glueContext.extract_jdbc_conf(glue_catalog_connection)
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url'), source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))
cstmt = conn.prepareCall(qry_truncate);
results = cstmt.execute();
conn.close()
##Step 2: data flow
SRC_view_LZ = (
glueContext.create_dynamic_frame.from_catalog(
database=glue_catalog_name,
table_name=glue_catalog_lz_viewname,
transformation_ctx="SRC_view_LZ",
)
)
DC_InsertDate = SRC_view_LZ.gs_now(
colName="etl_insertdate"
)
DST_STG = (
glueContext.write_dynamic_frame.from_jdbc_conf(
frame = DC_InsertDate,
catalog_connection = glue_catalog_connection,
connection_options = {
"dbtable": stg_tablename,
"database": stg_database,
"batchsize": "10000",
"bulkSize": "1000"
},
transformation_ctx = "DST_STG"
)
)
Tried using the data catalog as target, this gives more or less same performance.