For the subsequent data analysis, I need to get json with api answers, weight, saved in text format ~ 2.7 megabytes for each. There are 200-250 thousand such files. The question is how to store and read. Please tell me the best in your opinion way.

Now my solution is to use the gzip module (Python 3.6) to compress one large file containing the json response in each line. Each line is written and read iteratively (plus at the end of the line break "\ n"), without requiring loading the entire file into the RAM. Thus, in terms of each file, 2.7 MB is compressed to 150 KB. The question is - is there a more convenient solution for read / write speed?

import gzip import json from requests import get with gzip.open('large_data.json.gz', 'wb') as outfile: while True: response = get(url).json() outfile.write(json.dumps(response).encode('utf-8') + b'\n') 
  • 2
    It is possible to store / process a little more compact in bson . But if the main thing is a small volume - then, of course, compress. - Vladimir Gamalyan
  • 2
    "optimal" in different circumstances means different things. The answer to the question “can it be faster” is always positive - the question is how much time, effort and necessary resources you are willing to spend. So that the question would be useful to someone else: try specifying a more specific goal: for example: “here’s the code (+ input data) that runs on such a gland in X minutes, how to make X / 2 minutes run on that same hardware (spindle drive, SSD), software versions (python, OS) »Measurement results that indicate where the bottleneck is: conversion to json, compression, writing to disk, do not interfere. - jfs
  • Mapping is not ...? - And
  • one
    @And Try to re-read what you wrote. "Transformation - not ...?". Or did you want to say "Topographical survey - not ...?". And what of this to understand? You wanted to remind you about object-relational mapping, or you just like this word and you use it for something that is not clear? - Frank

1 answer 1

With your data volumes, it makes sense to think about using Hadoop Cluster and Apache Spark (usually a part of Hadoop Cluster) for parallel data processing.

I would do the following:

  1. on the fly, we convert JSON's to Parquet (+ Snappy compression is very fast for "uncompressing") and save to HDFS (distributed cluster file system). It is also worth trying to combine files (by day, month or other parameters) - HDFS works much more efficiently with fewer large files than with a large number of small ones.
  2. Apache Spark (supports the following languages: Scala, Python, Java) and / or Hive / Impala can be used to process data in Hadoop Cluster.

Hadoop Cluster can be built from relatively cheap iron — it scales better horizontally than vertically — that is, more servers with less resources (RAM, CPU, IO) are better than fewer Pts. powerful servers. It is not recommended to use servers with a large (512GiB +) RAM, because Almost all the components of Hadoop Cluster are written in Java, and when garbage collection garbage collection with large JVMs, peak loads can occur.

PS gzip - with good compression performance is very slow for both compression and decompression.

UPDATE:

If you do not plan to "explosive" growth of data, then you can start with something simpler, for example, one of the following options:

  • with a relatively small amount of data that easily fit on a laptop (hundreds of gigabytes), using things like hadoop can slow down the process. Don't use Hadoop - jfs
  • one
    @jfs, probably so with current volumes, but data tend to grow non-linearly and accumulate. An example from the personal professional life - we started to implement Hadoop when we were already “rested” on the restrictions on the RDBMS (Oracle RAC Cluster) side and if we started a few years earlier we could have avoided a whole series of unpleasant problems ... - MaxU
  • 3
    more often, beginning professional programmers have the opposite problem: too many levels of abstraction are hung (about the principle, use everything I have - regardless of whether it makes sense for the selected task or not). It is necessary to use the simplest solution that works and do not produce entities beyond what is necessary. - jfs