I am trying to ingest a large amount of data into Elasticsearch using Python. For this purpose, I am using the bulk API helper and I have developed a function that looks something like this,
def __load(self, docs, index):
try:
# begin load
logging.info("Begin indexing documents")
progress = tqdm.tqdm(unit="docs", total=len(docs))
successes = 0
# load each document and update status
for ok, action in streaming_bulk(
client=self.es_client, index=index, actions=docs,
):
progress.update(1)
successes += ok
logging.info("Indexed %d/%d documents" % (successes, len(docs)))
logging.info("Data successfully loaded to " + index + " index")
return "COMPLETED", len(docs)
except:
return "FAILED", 0
This is the part where the actual ingestion takes place,
for ok, action in streaming_bulk(
client=self.es_client, index=index, actions=docs,
):
progress.update(1)
successes += ok
Now, each of my documents contain quite a large amount of data (I have a couple of fields which are big strings) and I have noticed that this ingestion process is quite slow. I am ingesting the data in chunks and it takes a little more than a minute to index 10000 documents.
Is there a more efficient way to do this? I am trying to make the process faster.
Please take a look at the Tune for indexing speed doc. An easy (though somewhat limited) way to parallelize might be to use parallel_bulk.
If those measures show no effect, your indexing application also can be a bottleneck. If that's the case you'll have to review your indexing pipeline architecture to allow a few indexing machines to run in parallel.