Good day! Need your help. Immediately I apologize if it was already somewhere, but, honestly, I searched and found nothing suitable. We are starting a new project (java, spring, tomcat) and are faced with the problem of choosing a repository. The fact is that we have the following data:

[{id:1, name:"test", relationships: [1, 2, 3, 4], charges: [{name:"charge1", type:1, validFor: {from:"", to:""}}...]}]

In general, there is json under 1 MB, in which part of the data is denormalized and nested in the document itself, and part is the identifiers of other entities. The bottom line is that we need to keep such a document in ourselves and work with it (search by parameters, editing, etc.). First, we tried to put everything in PostgreSQL, since there is a relation, but something terrible turned out, about 200 tables and creepy joins in the queries, also encountered the problem that the document itself and the entities associated with it often change (some fields and bundles with other entities). From here, they decided that it doesn’t fit badly into the relational, they started looking towards NoSQL and stalled, since we don’t have this experience, but a lot of options. We thought that it was possible to denormalize all the data in order to get rid of the relation, but then it turns out that when the general entity changes, we will have to run through all the documents that contain it, and this, we think, is not a good idea. The first thing that caught our eye and what is ideal for us in terms of functionality is Amazon SimpleDb, but it is paid and this is a third-party service, the customer will never agree to store their data on third-party services, so we must have all the servers.

Therefore, question number 1: What would you advise, in principle, to store such data and work with them?

Also, unfortunately, there is the task of storing the history of changes for each document, including entities associated with it, it is necessary that the user can view the history of document changes (what, by whom and when changed), if desired, roll them back, and also see the list of all documents and their changes within the project (roughly speaking, this is the name of group changes)

Hence, question number 2: What is the best way to implement this in storage from question 1?

  • elastic or mongodb, as an option. If you have BIG DATA and HDFS - Solr. How much data? About - Chubatiy
  • Honestly, it smells like holivar. 200 tables in itself are not good and not bad, in any banking ABS with 200 tables, everything is just beginning ... And nothing, they sit on relational databases. And, as usual, it begs "do on what you can do." - Dmitriy
  • @Chubatiy, thanks for the answer, a little data, about 200,000. Mongu was considered, but as far as I studied it, there is no normal transactionalness, and we need group operations on documents, and in order to resolve relations by identifiers, we will need to make additional requests, and there is a limit on 16Mb of the document, and if we keep a history of changes, it is possible that we get out of these limits (provided that the history is stored in the document itself) - DeadMeat
  • @Dmitry, I agree with you, 200 tables of course are many, but you can live, the problem is not in their number, but in supporting such a number, the fact is that new fields and new entities with connections will appear often (the business requirements of the telecom are changing rapidly ) and each time it will be very difficult to add and remove (move) columns and create new tables during migration, which is why we are looking at the storage with the absence of a strict scheme. - DeadMeat
  • In postgrese can combine both models. There are also arrays and the ability to make your own composite types. Where you want to denormalize - determine the type of records T and get a column with the type "array of records T", and what you plan to refer to, put in a separate table. Have you considered these features, or have you looked at postgres only as an RDBMS without denormalization per se? - D-side

1 answer 1

The bottom line is that we need to keep such a document in ourselves and work with it (search by parameters, editing, etc.).

If the structure of the document does not change during the operation of the application, i.e. your process is as follows - you have agreed on the structure of the document coming in the request and in case of changing its structure, release a new version of the product that will work with the new version of the request. If so, then this property is suitable for you and RDBMS. If the data format may be different with each request, it is unlikely that a RDBMS would be a good option. RDBMS gives you more control over data integrity - it controls torture to write data of an incorrect type to the table cell, or incorrect data in which part of the data is missing. With regards to NoSQL DBMS, the document-oriented type is more suitable for you, since in essence, your request is a document in JSON format, which you most likely write in unchanged form in the database. NoSQL such as MongoDB works directly with data in JSON format (in fact, BSON is used), which allows you to greatly simplify the operation of writing data to the database and focus on developing tools for analyzing this data. What is essentially required from your project. Based on the above, if you choose an RDBMS, you will need to spend additional resources on developing code that will transform your data into a relational view, as well as on developing a relational representation of your data — a set of tables and relationships between them.

but something terrible turned out, about 200 tables and creepy joins in queries

With regards to the tables, I formulated a comparison above; in the case of a NoSQL database storing data in JSON, you will not need any serious additional effort to write data. In the case of a denormalized structure with a doc.-or. DBMS you have two solutions to the problem:

  1. Full denormalization - then each document will store all its constituent data within itself and obtaining and changing them will be a simple task. But you will encounter problems updating common entities for the entire set of documents, since each will contain copies of it.
  2. All external entities stored in separate documents and refer to them, the approach is similar to the approach of the RDBMS and in fact is an attempt to normalize the data. In this approach, the same difficulties will arise with "creepy joins in queries", but this will not be solved by means of the SQL language, but by combining the necessary collections of representing documents directly in the code.

In essence, the mix of these approaches will be optimal, you will see where it is easier for you to store data as aggregates and where to use links to common entities.

We also encountered the problem that the document itself and the entities associated with it often change (some fields and links to other entities are added and removed).

This problem, which is also related to the previous one, can be easily solved with the help of graph DBs, which are created for the purpose of easy management of many entities interconnected in a complex way. There are also mixed models that combine graph and document-oriented approach - OrientDB. This approach can in your case radically simplify the solution of the current problem. But it is important to understand that if your system acquires new functionality, this approach may not be optimal for the new functionality, and in this case you should add another system with the appropriate type to the system - multivariate persistence. I think OrientDB or a similar system combining graph and doc.-oriented approach will free you from the many problems associated with the inconsistency of your data and DBMS structure.

Therefore, question number 1: What would you advise, in principle, to store such data and work with them?

I would advise you OrientDB or equivalent. Yes, you have to learn a new technology. But it is extremely important to understand that the biggest problems appear when the project is already launched, settled and it is necessary to actively develop it. Due to the unsuccessful selection of the model at the beginning, as the complexity of the project grows, it becomes overwhelmed with problems like a snowball, to introduce / delete / modify a new field or table you will have to put an incredible amount of effort, just to modify queries that will affect this change . My conclusion is to choose the storage system that will be most natural for your task and focus on solving the problem - data analysis.

Also, unfortunately, there is the task of storing the history of changes for each document, including entities associated with it, it is necessary that the user can view the history of document changes (what, by whom and when changed), if desired, roll them back, and also see the list of all documents and their changes within the project (roughly speaking, this is the name of group changes)

Many NoSQL DBMSs support the built-in auditing mechanism - MongoDB, OrientDB, both with the purchase of a commercial license. But you can search for a solution from the community or write it yourself.

Summing up:

Based on the problem you described, the DBMS is suitable for you - a mixture of graph and document-oriented (optimally, storing data in JSON). Because you have all the data originally presented in JSON and have many links to each other. You gave an example of such a DBMS, you can look in the direction of alternatives.

Before choosing, I advise you to read a book (not large in size) - "NoSQL. New methodology for developing non-relational databases." Translation into Russian is somewhat crooked, but enough to understand what NoSQL DB can give you.

  • Thanks for the answer, we really discovered graph stores and see that this is a solution to our problems, now we are considering Neo4j, but we will use your advice and consider OrientDB. - DeadMeat