I am generating a large number of elasticsearch documents with random content using python and index them with elasticsearch-py.
Simplified working example (document with just one field):
from elasticsearch import Elasticsearch
from random import getrandbits
es_client = Elasticsearch('https://elastic.host:9200')
for i in range(1,10000000):
document = {'my_field': getrandbits(64)}
es_client.index(index='my_index', document=document)
Since this makes one request per document, I tried to speed it up by sending chunks of 1000 documents each using the _bulk
API. However, my attempts so far have been unsuccessful.
My understanding from the docs is that you can pass an iterable to bulk()
, so I tried:
from elasticsearch import Elasticsearch
from random import getrandbits
es_client = Elasticsearch('https://elastic.host:9200')
document_list = []
for i in range(1,10000000):
document = {'my_field': getrandbits(64)}
document_list.append(document)
if i % 1000 == 0:
es_client.bulk(operations=document_list, index='my_index')
document_list = []
but this results in a
elasticsearch.BadRequestError: BadRequestError(400, 'illegal_argument_exception', 'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Ok, seems I have mixed up two different functions:
helpers.bulk()
andElasticsearch.bulk()
. Either can be used to achieve what I intended to do, but they have a slightly different signature.The
helpers.bulk()
function takes anElasticsearch()
object and an iterable containing the documents as parameters. The operation can be specified as_op_type
and can be one ofindex
,create
,delete
, orupdate
. Since_op_type
defaults toindex
, we can just omit it and simply pass the list of documents in this case:This works fine.
The
Elasticsearch.bulk()
function can be used alternatively, but the actions/operations are mandatory as part of the iterable here and the syntax is slightly different. This means that instead of just adict
with the document contents, we need to have adict
specifying both the action (in this case"index": {}
), as well as the body for each document. See also_bulk
documentation:This works fine as well.
I assume that both of the above generate the same
_bulk
REST API statement internally, so they should be equivalent in the end.