There is a table of users in which the history of changes in their state should be entered daily:

CREATE TABLE IF NOT EXISTS `users` ( `user_id` BIGINT UNSIGNED NOT NULL , `day` INTEGER UNSIGNED NOT NULL , `action` TINYINT(1) UNSIGNED NOT NULL , PRIMARY KEY(`user_id`,`day`) , INDEX(`day`) ) 

The day column is the day number, the action column is an event about whether the user is added (value, for example, "1") or has been deleted (value "0").

The problem is that there are a lot of users (from 1 million) and it is required to synchronize with the received data as quickly and economically as possible in terms of memory consumption.

The incoming data (from the network) by users can be accumulated in a separate table, and then synchronized. Actually, I did, but due to the fact that this is a history table, and not, say, just the current state, synchronization requests become very resource-intensive. For example, one has to take into account that each user can be deleted (the value in action is "0"), and a day later it is added again (value "1"). And such "swings" (added-deleted-added) can occur day after day.

It may make sense to create a separate table before synchronization, where to drop all users with their current state, i.e. create a snapshot of their current state. Synchronize incoming data with this temporary table and only then change users who have changed their state to overwrite in the target table. But I, unfortunately, have no confidence that this approach will work faster, because one way or another, you still need to read the target table completely in order to make a copy of the latter at the time of state synchronization.

Perhaps there is a solution more interesting, rather than what I was able to come up with. Please prompt.

Update

The data received for synchronization is a list of user IDs that are currently available.

Update 2

The list of users comes from day to day almost the same, but with changes for a certain number of users. Parts of them that were previously will not, but there are new ones. It is this difference in composition that needs to be determined by appending lines to the history table with changes for each user, if any. If the incoming data is the same as in the target table (the user lists are equal), then the target table does not change at all.

  • How strange the "day number" looks, why, for example, not two fields "date interval" so as not to fence a bunch of identical records in a row 1 1 1. And what is coming from the outside in what form, such as a list of user id's that exist today? - Mike
  • @Mike, instead of the day number, you can simply timestamp or something else, the essence of the role does not change. It just turns out that for each user you need to keep a date when something happened to him (added or deleted). If there were no changes, that is, for example, it was previously added, and the next day it also appears in the incoming data, then we do not change or add anything. And the incoming data, you are right, this is a list of user IDs that is currently available. - alexis031182
  • @Mike, I was sealed when I wrote a question. There are two columns in PRIMARY KEY : both user_id and day . That is, it is important to save all state changes and when they occurred. - alexis031182
  • one
    Well, in the database continue to store history. And the script loader after writing the next piece of data to the database writes the last state in its format to the disk. Without the packaging of 1 million records from 1 to 1 million takes 8 MB. On the next bootup, it picks up the last state from the disk and compares it with it without reading the database again - Mike
  • one
    I think so, it’s much easier to compare with her. If in sql through full outer join I pasted two tables - that's the whole difference. Further optimization - removal from the current state database for verification in RAM - Mike

1 answer 1

Easier to compare lists of IDs without additional information. We need to store in a separate table a list of user IDs that are currently considered to exist (your status is 1 in the history table). Store id not existing in the latest edition is not necessary. We load into the temporary table a new incoming list and compare it with a query like:

 insert into hystory(id, dt, state) select id, now(), 1 from NEW left join OLD on NEW.id=OLD.id where OLD.id is null union select id, now(), 0 from NEW right join OLD on NEW.id=OLD.id where NEW.id is null 

After that, we still need to delete / add entries to the last state table. If the database is not MySQL and supports FULL OUTER JOIN or MINUS operations, this will simplify the query 2 times.

Option 2: If we have many consecutive or closely spaced IDs in the database, we can try to store the current state as records containing an initial ID of a range of 64 values ​​and a bitint mask in the form of bitint in which the currently existing IDs are marked with single bits. A matching query builds incoming IDs in the same format, reattaches the tables, and finds the difference using bitwise operations.

Option 3: Compare the lists of old and new IDs not in the database, but in an external script. A list of 1 million values ​​is easily compared in RAM in a couple of seconds. After that, changes are written to the database in the history table. And the script saves the new current list of existing id to a file on disk in a format convenient for the script to quickly read in the future.

  • one
    Detailed answer with options, thanks a lot. - alexis031182