It is necessary to index a large number (about 100k) of json files of various sizes (for a total of 100k files, there is a total of 1.2GB of textual information).
I am writing in python, respectively, using a standard module for working with Elasticsearch. From the composition I use the helpers.bulk function in the following way:
es = Elasticsearch(ES_CLUSTER) json_docs = [] for filename in os.listdir(os.getcwd()): if filename.endswith('.json'): with open(filename) as open_file: json_docs.append(json.load(open_file)) helpers.bulk(ES_INDEX, ES_TYPE, json_docs)
As a result, only 570 files are indexed. Moreover, I noticed that the size of the index after each new run of the program varies greatly from 2-4mb to 15mb, although the number of indexed files remains unchanged.
The size of the index is recognized by the query:
curl 'localhost:9200/_cat/indices?v'
I clean it like this:
curl -XDELETE 'localhost:9200/_all/'
Apparently using not the best way to index.