It is necessary to index a large number (about 100k) of json files of various sizes (for a total of 100k files, there is a total of 1.2GB of textual information).

I am writing in python, respectively, using a standard module for working with Elasticsearch. From the composition I use the helpers.bulk function in the following way:

es = Elasticsearch(ES_CLUSTER) json_docs = [] for filename in os.listdir(os.getcwd()): if filename.endswith('.json'): with open(filename) as open_file: json_docs.append(json.load(open_file)) helpers.bulk(ES_INDEX, ES_TYPE, json_docs) 

As a result, only 570 files are indexed. Moreover, I noticed that the size of the index after each new run of the program varies greatly from 2-4mb to 15mb, although the number of indexed files remains unchanged.

The size of the index is recognized by the query:

 curl 'localhost:9200/_cat/indices?v' 

I clean it like this:

 curl -XDELETE 'localhost:9200/_all/' 

Apparently using not the best way to index.

    2 answers 2

    The default indexing queue in Elasticsearch is limited. You are trying to feed him too many records at a time. This is a Bulk API, but it works a little differently:

    It is necessary to send portions of several hundred records (selected experimentally)

     es = Elasticsearch(ES_CLUSTER) json_docs = [] i=0 bulkSize=500 for filename in os.listdir(os.getcwd()): i++ if filename.endswith('.json'): with open(filename) as open_file: json_docs.append(json.load(open_file)) if len(json_docs) >= bulkSize: print(i, "current file:", datetime.now(), filename) try: helpers.bulk(ES_INDEX, ES_TYPE, json_docs) except Exception as error: print(error) del json_docs json_docs = [] //do not forget to put rest helpers.bulk(ES_INDEX, ES_TYPE, json_docs) 

      We in our project (not python) also index a large amount of information. Highlights:

      1. We do not index through API - curl from bash
      2. Our script converts the data into bulk format and folds it into a plain text file in chunks of 1000 documents.
      3. After processing all the data, run the console script:
       files=(${1}*.txt) total=${#files[@]}; count=0 pstr="[=======================================================================]" echo "Start export to ElasticSearch ${total} files with data" for i in ${1}*.txt; do curl -XPOST http://elastic.domain.conm/_bulk --data-binary @${i} &>/dev/null count=$(( $count + 1 )) pd=$(( $count * 73 / $total )) printf "\r%3d.%1d%% %.${pd}s" $(( $count * 100 / $total )) $(( ($count * 1000 / $total) % 10 )) $pstr done printf "\n"