Calculating the number of identical records from two groups of tables

Question

There are two groups of single-column tables. An example of a table (they are all identical in structure):

CREATE TABLE IF NOT EXISTS `member_lst_%1` ( `user_id` BIGINT UNSIGNED NOT NULL , PRIMARY KEY(`user_id`) )

The name of each table contains a numeric identifier. In the example, it is listed as %1 . This is just a number, for example, "666", but unique for each table.

The task is to count the number of identical user_id present in both groups of tables.

The problem is that the tables themselves are large (from a million rows), and the tables themselves in each of the groups are also not too small (from several hundred).

Important note: the same user_id should not be found between the tables of the same group, but between the two groups. That is, if a certain user_id present in at least one group table, then it is considered that it is present in the entire group. This moment excludes the possibility of sequential table INNER JOIN through INNER JOIN . So, you must first create a list of all user_id one group, then of the second group, and only then compare these two huge lists.

Is it possible to solve this problem somehow?

Update

 select min(user_id),max(user_id),count(1) from member_lst_100475993 min(user_id) max(user_id) count(1) 4784 361583895 31746 select count(1) from ( select min(user_id) start,max(user_id) stop from( select user_id,@i:=@i+1,@grp:=if(@i=user_id,@grp,@grp+1) grp from member_lst_100475993,(select @i:=0,@grp:=0) A order by user_id ) G group by grp ) T count(1) 31746

Relational databases are one of the most inappropriate tools for solving such problems.
I can not imagine how much time mysql will take to contain 100 tables of a million records.
Given that well-represented data in the form of intervals and / or using bitmaps will work thousands of times faster
The mice cried, injected ... :)) It may be somehow intervally to put identifiers into a separate table ... Something like that, with such and such a number, identifiers belong to such and such groups of tables.
Or really refuse to store in the database and move on to something else.
@Mike, could you please tell me the name of the algorithm by which this task could be accomplished.
Simply, I feel that I don’t do an interval with the database.
I do not know the names, I usually invent algorithms myself.
Here it is necessary to understand not one task, but all the tasks facing such a system.
But working with them is difficult, especially updating data, gluing or splitting intervals when adding / deleting.
It might be easier to leave the ID one by one, but to keep the bit field next to it, in which it is noted which groups it belongs to.
But such a field allows to solve the problem "which groups the ID belongs to", but does not allow "Which IDs belong to a group"
Well, rather, "What IDs are included" can be found by bitmask, but the indices will not work, only by a exhaustive search of the table

alexis031182 alexis031182 5.966 one 9 44 · Accepted Answer · 2016-04-24T18:24:34

I decided to go the way, the idea of which @Mike was voiced in his last comment on this issue (for which he thanks a lot).

Finding out which and how many rows are the same between two tables is done in a very short time (on the strength of a few seconds, even in the most difficult cases, when both tables have a huge size of several million positions each). However, the tables themselves are many and the union of the same rows for counting into one table in memory leads to an unacceptable waste of resources. Yes, and such a request is executed an hour or more.

For this reason, I decided to compare groups of tables in such a way that each table of one group is compared with all tables of another group by separate queries. I created a temporary table and put the result in several streams with such simple queries:

 INSERT IGNORE INTO `member_tmp` (`user_id`) SELECT `user_id` FROM `member_lst_%1` INNER JOIN `member_lst_%2` USING(`user_id`)

The IGNORE attribute in the query allowed us to weed out the same rows already between comparisons of each pair of tables and leave only unique values. This decision reduced the waiting time for the result to ten minutes. And of course the lack of unnecessary memory consumption of the machine (okromya disk space, but I do not take it into account in the absence of special restrictions).

In general, it would be possible to stop at this, it was hardly possible to improve here, but still this time spent on the calculation began to depress. It is clear that you can organize caching, so as not to recalculate everything all over again, but there is a nuance in my task, which I did not mention in the question: tables can quite often for one reason or another change their group. This fact negates the entire possible profit from the use of the cache. As a result, it was necessary to recount everything constantly and constantly had to wait a dozen minutes.

Having tried in vain a bunch of all sorts of decisions, I decided to try to come in from the other side. In fact, the percentage of identical rows between any two tables is usually not more than one percent of their total. It turns out that when sameness is calculated over the entire volume of the two groups of tables, the efficiency of this entire action is minimal. The work is carried out mad, the machine puffs, and at the exit, although useful zilch, but it is precisely that zilch is in general a small sign.

Then I decided to create a table that will hold in itself all the same rows between the tables, including those that are part of the same group. Indeed, if the table changes its group, the previously calculated values, with which it is there in the same rows, will allow nothing more to be recalculated.

Things are easy:

 CREATE TABLE IF NOT EXISTS `member_unite` ( `group1_id` BIGINT UNSIGNED NOT NULL , `group2_id` BIGINT UNSIGNED NOT NULL , `user_id` BIGINT UNSIGNED NOT NULL , PRIMARY KEY(`group1_id`, `group2_id`, `user_id`) )

Filled the table with test data, but the same size that is in reality. The result was great. In the worst case, sampling is done in a couple of seconds. After all, there are really relatively few intersections between tables (the size of just ten million lines turned out) and the growth of these intersections with time is completely small (even a loss is possible).

It remains to record the actual data in this table and update it once a day. This is what the same spider took me to do, in fact, updating the data in the source tables.

Calculating the number of identical records from two groups of tables

1 answer 1

More articles: