Search by denormalized data

Question

There is an entity of the "conference" type, in which there can be several participants with different tags:

{ "participants": { "a": "79219998877", "b": "79219998878", "c": "79219998879" } }

I need to store this data in the database (a specific engine is not important now - it will most likely change after some time - we believe that there are no joins or indexing of array fields) and search for them - I need to get conferences that satisfy an arbitrary number conditions " a was 79219998877 ", " b was 79219998878 " and analogues, a and b can be absolutely arbitrary tags. Is it possible to somehow present this data in some form (without field arrays and joins), which allows such a search to be performed with a simple sample, or is it impossible in primary?

I understand correctly that you should always get the latest conference for two given participants?
At the moment, it’s only necessary to search for one and two, but I’m looking for a silver bullet, I can build the view / additional table for specific cases, but all this confusion is due to the fact that the project has already arrived at refactoring and I want to protect myself from everything from which it is only possible to provide the end user with arbitrary search options.
and only JSON will suit you or can the data be altered for XML?
Store as an array (no matter in which container) while, most likely, it will not work.
@Etki - you do not need to store as an array, store either as pieces of JSON or XML, which modern databases can work with (for example, using xquery in MSSQL): filter, select the necessary entities

Answer 1 · 2016-04-21T15:20:34

You have an incorrect formulation of the problem.

How can I store such data in order to quickly and smoothly search by an arbitrary number of participants?

But at the same time

I need to store this data in the database (a specific engine is not important now - it will most likely change after some time - we believe that there are no joins or indexing of array fields) and perform a search on them.

The answer will depend on the type of database you select. The most "painless" option - keep everything in JSON and Filter to help you. But this is probably the easiest option, because there are still other data that will be inconvenient to store in such a form and painful to use.

When choosing NoSQL storage, you have almost complete control over the data and the sample, but watch out for the consistency of the data. But then you can use any Filter that selects the records you want. And if further in the project you will have an entity with a large number of connections, then you can torture yourself to follow them.

In the case of a relational database, everything is more complicated, because the complexity of the filter is constrained by relational algebra, but everything is in order with consistency and sampling by index.

The example filter can be implemented with something like: http://sqlfiddle.com/#!9/b891f

  SELECT Conference.conf_id, COUNT(Conference.conf_id) as cnt FROM Conference JOIN Participants ON Conference.conf_id = Participants.conf_id AND (Participants.key, Participants.val) IN ( ('a', '79219998878'), ('b', '79219998877') ) GROUP BY (Conference.conf_id) HAVING cnt = 2

If something more complicated, then most likely the constraints of relational algebra will not allow you to do this. But from this situation, as you have suggested, there is a way out - to store your structure with JSON and impose a condition on the sample.

 SELECT * FROM Conf WHERE JSON_CONTAINS(json_field, "79219998877", "$.a") AND JSON_CONTAINS(json_field, "79219998878", "$.b") -- или через виртуальные колонки -- WHERE json_field->"$.a" = "79219998877" AND json_field->"$.b" = "79219998878"

virtual speakers

So you get a hybrid of both worlds, but is only available in MySQL> 5.7. There you can even build indexes .

UDP: reminded about postgres, thanks Yes, there also has similar functionality with jsonb and you can also build requests across the field. See @> and <@ operators.

UDP: after reading the comments, it became more or less clear to me what was meant

At the moment, the project is sitting on a relational server, in the future it is planned to switch to a row-column storage (NoSQL is not only json-like storage), which involves preparing data for simple queries in a "flat" form, and in which I cannot search on the part of the associative array (only the full version). This raises the above-described question, which boils down not to finding workarounds, but to whether you can somehow present all this in a prepared form.

In this case, the "workaround" solution is its own implementation of the index on the array of documents. But the writing of "his" such an indexer is a very difficult task.

You cannot embed this indexer in your database. And its implementation in php / ruby / python will most likely be inferior in performance to complete enumeration in the database. So you have to write it in the "system" language and communicate via IPC / socket.

Then I see the way how to use it: when creating / deleting a document, you will transfer this document to the indexer, and it will change the index for this reason. Then when you need a sample, you make a request to him, and he quickly returns the ID of the elements that satisfy this request, and already with these IDs you climb into the database and select records. Then why reinvent this bike if you can pick up some MongoDB thread and use it the same way.

The following is an example for Mongo.

Documents stored better in the form of:

 { "_id" : ObjectId("571923c7e4b08c60be5228a4"), "id" : 1, "participants" : [ { "key" : "a", "value" : "79219998878" }, { "key" : "b", "value" : "79219998877" }, { "key" : "c", "value" : "79219998879" } ] } { "_id" : ObjectId("571923f0e4b08c60be5228a9"), "id" : 2, "participants" : [ { "key" : "a", "value" : "79219998877" }, { "key" : "b", "value" : "79219998878" } ] } { "_id" : ObjectId("57193370e4b08c60be522acb"), "id" : 3, "participants" : [ { "key" : "a", "value" : "79219998877" }, { "key" : "c", "value" : "79219998879" } ] } { "_id" : ObjectId("571933c2e4b08c60be522ad4"), "id" : 4, "participants" : [ { "key" : "a", "value" : "79219998878" }, { "key" : "b", "value" : "79219998877" }, { "key" : "d", "value" : "79219998873" } ] }

Make an index:

 db.participants.createIndex({ "participants.key" : 1 , "participants.value" : 1})

And look for:

 db.participants.find( { "participants" : { "$all" : [ { "$elemMatch" : { "key" : "a", "value" : "79219998878" } }, { "$elemMatch" : { "key" : "b", "value" : "79219998877" } } ] } } ).pretty()

Conclusion

 { "_id" : ObjectId("571923c7e4b08c60be5228a4"), "id" : 1, "participants" : [ { "key" : "a", "value" : "79219998878" }, { "key" : "b", "value" : "79219998877" }, { "key" : "c", "value" : "79219998879" } ] } { "_id" : ObjectId("571933c2e4b08c60be522ad4"), "id" : 4, "participants" : [ { "key" : "a", "value" : "79219998878" }, { "key" : "b", "value" : "79219998877" }, { "key" : "d", "value" : "79219998873" } ] }

If you make explain (), you will see that indexes are used.

 "winning plan": { "inputStage" : { // INDEX SCAN!!! "stage" : "IXSCAN", "keyPattern" : { "participants.key" : 1, "participants.value" : 1 }, "indexName" : "participants.key_1_participants.value_1", "isMultiKey" : true, "direction" : "forward", "indexBounds" : { "participants.key" : [ "[\"a\", \"a\"]" ], "participants.value" : [ "[\"79219998878\", \"79219998878\"]" ] } }

And just use it as an external indexer.

Yes, there is an overhead projector, that for the sake of this indexer you will have to run a whole Mongu. But you can also transmit not a whole document, but only an ID and an array of participants. I think in other NoSQL databases there is also an indexer with the required functionality and, perhaps, they are "easier" to monga, you can use them.

If you really want it, you can dig into the source code of Mongi, understand how such an indexer works and rewrite it yourself. But, in my opinion, an overhead on Mongu is cheaper than selling a bicycle on steroids.

... only there this type is called not json , but jsonb and it is in PostgreSQL 9.3+;
json in postgrese, for the most part, just text and validation.
I am about to present solutions that run into a specific repository, but now I'm just trying to get away from it.
At the moment, the project is sitting on a relational server, in the future it is planned to switch to a row-column storage (NoSQL is not only json-like storage), which involves preparing data for simple queries in a "flat" form, and in which I cannot search on the part of the associative array (only the full version).
This raises the above-described question, which boils down not to finding workarounds, but to whether you can somehow present all this in a prepared form.
> Hence, the above described question arises, which boils down not to finding workarounds, but to whether it is possible to somehow present all this in a prepared form.
Then the question comes down to “I want an indexing algorithm with a previously unknown number of keys”

Manushin Igor Manushin Igor 617 3 silver marks 11 bronze marks · Answer 2 · 2016-04-20T13:16:33

You can create a table with the fields a, b, conf_id, conf_date and make a select. Ideally, one could also create an index for a and b. Without it, it will be a long time, but quicker than a direct search on the table with conferences. Without an index, see option 3.
If there is no indexing of fields, but there is sorting and the ability to quickly create tables, instead of an index, you can create tables with the name simulated_index_{a}_{b} , in fact, an index imitation. By the way, it will work almost faster than in the first case with indices.
If there is only the possibility of sorting by an arbitrary field and there is no index, then you can take option 1, but add the field ab , by which we will sort. ab = a * 2 ^ 32 + b (then ab will be int64, 2 ^ 32 - the size of int). Or ab = md5 (a) ^ b. Well, that is, you need to somehow combine the two fields, and then calculate the desired value on the fly.
If everything should be kept in one table (which is higher), but columns can be added to this table, then we simply add the ab column from example 3.

I am sure there are more options. Write if these do not fit.

A and B can be absolutely arbitrary tags, in fact I should search for conferences that satisfy a certain number of type filters присутствует участник с тегом X и значением номера Y , i.e.
In int32 it will definitely not work, this is string data (extensions and international calls are possible).
Now then I have two options: or to normalize the base, then it will be possible to use indices and make effective queries.
Or create a decision tree (in fact - the simulation of indices).
In other cases, all the work will be extremely slow, since xml / json work very inefficiently, compared to columns and indexes.

Community spirit ♦ one · Answer 3 · 2016-04-21T11:27:27

Option 1.

If the form of data in the form of JSON is not critical, then in MSSQL you can use the XML datatype. In more detail, this is a very voluminous topic, you can read here or, for example, here . Unfortunately support for JSON itself - MS promises only in version 2016.

Option 2.

In MySql , the JSON datatype is already implemented (you can read it here ), which allows you to simply store this conference in the field and search. However, I didn’t work much with MySQL and it’s unlikely that I’m going to work with it. But to dig in this direction, you probably will be interested.

Community spirit ♦

one

Mirdin

5.672 1 golden mark 17 silver marks 29 bronze marks

There are no indexes for xml / json fields, so almost the entire table will have to be searched for. Well, or, if the conferences are sorted by time, you can shorten the search, but this does not change the main thing: if there is no such conference, then you need to check all the records. And it is ineffective. - Manushin Igor
@ManushinIgor, let's start from the beginning, what are you going to index? In fact, there is no data on the amount of data in the question and, moreover, their arbitrariness (unnormalization) is indicated. - Mirdin
If the site is already in sale, then the data scheme is, as the search will go on under this scheme. Let the scheme be supported only in 60% of the data (the rest will be in the span immediately). In this case, it is possible for these 60% of data to break up one json into a bunch of tables with which to join (even if in the memory of the application). - Manushin Igor
@ManushinIgor, you are ready to lose 40% of the information without even knowing how many such records there will be. In my opinion, this is exactly what is called "premature optimization". - Mirdin
we believe that there are a lot of records, otherwise everything is easier to store in the application's memory. Once again: if the filter cannot find the required fields in 40%, then it will return false. And most importantly, I am not talking about replacing this table, but about creating additional ones. Information is not lost. - Manushin Igor

|

Search by denormalized data

3 answers 3

More articles: