I'm currently using Bigquery in a Django application for logging data like events and those are mostly request based. With different requests, I insert data into different Bigquery tables. For example user login logs are kept in Bigquery.
I'm currently using bigquery.Client.insert_rows_json
but only inserting a single JSON in each request and I want to optimize latency, cost and performance (and increase the rate limit).
I'm considering using the Bigquery Storage Write API
with the default
stream.
I wrote a working demo like
...
write_client = bigquery_storage_v1.BigQueryWriteClient()
...
request = types.AppendRowsRequest()
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.rows = proto_rows
proto_data.writer_schema = proto_schema
request.proto_rows = proto_data
request.write_stream = stream_name
requests = [request]
write_client.append_rows(iter(requests))
With this implementation though I'm creating WriteClient for each insert and setting the writer_schema in each insert too (because I'm inserting into different tables depending on the request). I'm not sure if this will be any better than insert_rows_json
.
What's the most efficient way to insert many single rows from multiple Django instances one by one.
Note: Up to 5 minute insertion latency is ok but I want to avoid data loss and duplication.