Please tell me which format to choose for storing large-scale text data. It must meet the following criteria (at least 3 of 5). The criteria are in order of priority:

  • Convenient view of data structure (data nesting can be of arbitrary depth)
  • Python library support (converting, parsing)
  • Fast and small (parsing, loading into the database)
  • Convenient distribution (download, send over the network)
  • Human readable view (ability to read and edit directly by human)

Considering formats - CSV, XML, JSON. I would be grateful for advice on the choice of these formats or your suggestions.


UPD. Few clarifications to the question. Why bothered to choose the format?

Collected a large amount of data for your project (engineering and scientific data).

The task of structuring and storing them has arisen, suddenly information will be useful to someone, and I will be able to pass it on to him. Consequently, a humanoid look would be most welcome.

It is possible that some value will change, and in order not to start parsing on a new one, you must edit the file directly.

In addition, the received data must be imported into the database, in my case PostgresQL, and any person who received my text data can do the same in any convenient database.

  • 2
    In descending order of preference: json, xml, csv. Json has less overhead (more compact), xml has been losing popularity among programmers for several years in a row. Both formats can be validated. Support is already penetrating the DBMS, from Microsoft at least. - AK
  • 3
    Easier than CSV with escaping delimiters you will not find. And you can edit it with any table editor. And the question is subject to closure, because of the series "advise me <...>". - D-side
  • one
    added clarifications to the question to clarify why I was bothered with the format search - while1pass
  • 3
    It may be worthwhile to immediately structure it in a database (relational, columnar, graphical, document-oriented ..., for a task), with the necessary connections, quick sampling. It is doubtful that someone will quickly be able to figure out if the links are not explicitly described, even in human-readable format. The sql dump is also easy to edit with the editor. From the database export to csv, json is easier than vice versa. - Igor
  • one
    github.com/caesar0301/awesome-public-datasets let's easier. Look in what format they give out data for everyone. Use this format - strangeqargo

1 answer 1

So, turn on the head:

The same data in csv, json:

csv: (instead of \ t will be |)

country|city US|New York Russia|Moscow 

json:

 {[{"country":"US", "city":"New York"},{"country":"Russia", "city":"Moscow"}]} 

compare the length of the lines. Who has more? in json.

JSON / XML are convenient because they are structured, they can describe the data schema. CSV is convenient because it is very compact, the minimum cost of parsing.

  • Any non-binary format can be edited very easily. Some binary formats are simple enough to be edited in a hex editor, especially if you are used to.

  • officially JSON supports ONLY UTF-8. CSV can be in any encoding.

  • if you have very complex data that is strongly related, data that is difficult to imagine in the form of one or two tables, you may need to look at json / xml.

If you simply upload texts to the database, export them, csv will do the job.

In general, it is a matter of choosing the temporary format - the format for export or broadcast over the network, to external systems (no one stores data in csv / json / xml as the main online storage)

If you have very large straight texts, store them in text files and in a database, and in csv / json / xml, give links to files. The structure is complicated, but it is also easier to edit for a person

But the difference between the formats is leveled. In short, as always, it all depends on the architecture and tasks.

  • Thanks for the detailed answer. as I understand, large nesting harms csv, impairs its understanding by a person - while1pass
  • 2
    csv is not "big nesting", it just has all the fields in a row, two-dimensional data of a table type. Each line can be interpreted as you please, on the other hand, JSON / XML always states what and where it is, but parsing them is more expensive and more difficult - strangeqargo
  • He added clarifications to the question to clarify why I was worried about finding a format. Yes, the main criterion, as you noticed, is distribution over the network, so that anyone can use the data for their own purposes. the structure has a large nesting (3-5 times), and it will probably be more complicated, so for clarity, JSON / XML will probably be better - while1pass
  • then csv, all sql-bd can import csv. - strangeqargo
  • one
    Nooo. engineering and scientific data (fizchim properties, data structures, coordinates, atomic parameters), there is a vinaigrette that I am trying to streamline - while1pass