Efficient Bulk Loading Options for Elasticsearch in Python

1.3k views Asked by At

I am trying to ingest a large amount of data into Elasticsearch using Python. For this purpose, I am using the bulk API helper and I have developed a function that looks something like this,

def __load(self, docs, index):
    try:
        # begin load
        logging.info("Begin indexing documents")
        progress = tqdm.tqdm(unit="docs", total=len(docs))
        successes = 0

        # load each document and update status
        for ok, action in streaming_bulk(
                client=self.es_client, index=index, actions=docs,
        ):
            progress.update(1)
            successes += ok
        logging.info("Indexed %d/%d documents" % (successes, len(docs)))
        logging.info("Data successfully loaded to " + index + " index")

        return "COMPLETED", len(docs)
    except:
        return "FAILED", 0

This is the part where the actual ingestion takes place,

    for ok, action in streaming_bulk(
            client=self.es_client, index=index, actions=docs,
    ):
        progress.update(1)
        successes += ok

Now, each of my documents contain quite a large amount of data (I have a couple of fields which are big strings) and I have noticed that this ingestion process is quite slow. I am ingesting the data in chunks and it takes a little more than a minute to index 10000 documents.

Is there a more efficient way to do this? I am trying to make the process faster.

1

There are 1 answers

0
ilvar On

Please take a look at the Tune for indexing speed doc. An easy (though somewhat limited) way to parallelize might be to use parallel_bulk.

If those measures show no effect, your indexing application also can be a bottleneck. If that's the case you'll have to review your indexing pipeline architecture to allow a few indexing machines to run in parallel.