Reading / writing a large number of json files

Question

For the subsequent data analysis, I need to get json with api answers, weight, saved in text format ~ 2.7 megabytes for each. There are 200-250 thousand such files. The question is how to store and read. Please tell me the best in your opinion way.

Now my solution is to use the gzip module (Python 3.6) to compress one large file containing the json response in each line. Each line is written and read iteratively (plus at the end of the line break "\ n"), without requiring loading the entire file into the RAM. Thus, in terms of each file, 2.7 MB is compressed to 150 KB. The question is - is there a more convenient solution for read / write speed?

import gzip import json from requests import get with gzip.open('large_data.json.gz', 'wb') as outfile: while True: response = get(url).json() outfile.write(json.dumps(response).encode('utf-8') + b'\n')

It is possible to store / process a little more compact in bson .
But if the main thing is a small volume - then, of course, compress.
"optimal" in different circumstances means different things.
The answer to the question “can it be faster” is always positive - the question is how much time, effort and necessary resources you are willing to spend.
So that the question would be useful to someone else: try specifying a more specific goal: for example: “here’s the code (+ input data) that runs on such a gland in X minutes, how to make X / 2 minutes run on that same hardware (spindle drive, SSD), software versions (python, OS) »Measurement results that indicate where the bottleneck is: conversion to json, compression, writing to disk, do not interfere.
You wanted to remind you about object-relational mapping, or you just like this word and you use it for something that is not clear?

Accepted Answer · 2018-01-14T12:13:05

With your data volumes, it makes sense to think about using Hadoop Cluster and Apache Spark (usually a part of Hadoop Cluster) for parallel data processing.

I would do the following:

on the fly, we convert JSON's to Parquet (+ Snappy compression is very fast for "uncompressing") and save to HDFS (distributed cluster file system). It is also worth trying to combine files (by day, month or other parameters) - HDFS works much more efficiently with fewer large files than with a large number of small ones.
Apache Spark (supports the following languages: Scala, Python, Java) and / or Hive / Impala can be used to process data in Hadoop Cluster.

Hadoop Cluster can be built from relatively cheap iron — it scales better horizontally than vertically — that is, more servers with less resources (RAM, CPU, IO) are better than fewer Pts. powerful servers. It is not recommended to use servers with a large (512GiB +) RAM, because Almost all the components of Hadoop Cluster are written in Java, and when garbage collection garbage collection with large JVMs, peak loads can occur.

PS gzip - with good compression performance is very slow for both compression and decompression.

UPDATE:

If you do not plan to "explosive" growth of data, then you can start with something simpler, for example, one of the following options:

add data to the database (some databases support compression)
store data in HDF5 (supports data access by indexes, i.e. only the index and the data that satisfy the condition are read from the disk) in conjunction with Pandas (if you have enough memory) or Dask Dataframe (allows you to work with data that do not fit in ram) . Both options support JSON reading. Also, HDF5 supports different data compression algorithms ( zlib , lzo , bzip2 , blosc , zlib ) . Pandas and Dask DataFrame have a very similar API and ideology with Apache Spark. Those. when you realize that it is time for you to switch to Apache Spark - the transition will not be so painful / difficult.

with a relatively small amount of data that easily fit on a laptop (hundreds of gigabytes), using things like hadoop can slow down the process.
@jfs, probably so with current volumes, but data tend to grow non-linearly and accumulate.
An example from the personal professional life - we started to implement Hadoop when we were already “rested” on the restrictions on the RDBMS (Oracle RAC Cluster) side and if we started a few years earlier we could have avoided a whole series of unpleasant problems ...
more often, beginning professional programmers have the opposite problem: too many levels of abstraction are hung (about the principle, use everything I have - regardless of whether it makes sense for the selected task or not).
It is necessary to use the simplest solution that works and do not produce entities beyond what is necessary.

Reading / writing a large number of json files

1 answer 1

More articles: