There are two tables, in one entities are stored, and in the second they are versioned. How to choose the latest versions for these entities?

To get the output

| account_actual_id | account_id | name | added_date | |______________________________________________________| | 3 | 2 |'First' | 2016-11-03 | |______________________________________________________| | 4 | 5 |'Second'| 2016-11-03 | 

Structure for tests.

 CREATE TABLE `account` ( `account_id` INT(11) NOT NULL AUTO_INCREMENT, `name` VARCHAR(255) NULL DEFAULT NULL, PRIMARY KEY (`account_id`) ) COLLATE='utf8_general_ci' ENGINE=InnoDB; CREATE TABLE `account_actual` ( `account_actual_id` BIGINT(20) NOT NULL AUTO_INCREMENT, `account_id` INT(11) NOT NULL, `added_date` DATE NOT NULL, PRIMARY KEY (`account_actual_id`), UNIQUE INDEX `u` (`account_id`, `added_date`), INDEX `added_date` (`added_date`), INDEX `account_id` (`account_id`), CONSTRAINT `fk__account_actual_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON UPDATE CASCADE ON DELETE CASCADE ) COLLATE='utf8_general_ci' ENGINE=InnoDB; INSERT INTO `account` VALUES (2, 'First'); INSERT INTO `account` VALUES (3, 'Second'); INSERT INTO `account_actual` VALUES (1, 2, '2016-11-01'); INSERT INTO `account_actual` VALUES (2, 3, '2016-11-01'); INSERT INTO `account_actual` VALUES (3, 2, '2016-11-03'); INSERT INTO `account_actual` VALUES (4, 3, '2016-11-03'); INSERT INTO `account_actual` VALUES (5, 2, '2016-11-02'); INSERT INTO `account_actual` VALUES (6, 3, '2016-11-02'); 

http://sqlfiddle.com/#!9/933f9c/1

Productivity is extremely important. Million tables

    3 answers 3

    Here is an option

     select a.*, b.* from account a join account_actual b on b.account_id = a.account_id left join account_actual c on c.account_id = b.account_id and b.added_date < c.added_date where c.account_actual_id is null 

    For each b selects all c that are added later. It is clear that if c is not present, then b is the latest version.

    If you add to the account_actual field prev_id, which indicates the previous version, it will be much better

     select a.*, b.* from account a join account_actual b on b.account_id = a.account_id left join account_actual c on c.prev_id = b.actual_account_id where c.account_actual_id is null 

    If there is no such c, which refers to b as the previous version, then b is the latest version.
    Here, unlike the previous version, for each b a maximum of one entry is selected c.

    And of course you can search for the maximum added_date

     select a.*, b.* from account a join account_actual b on b.account_id = a.account_id where b.added_date = (select max(added_date) from account_actual c where c.account_id = b.account_id) 

    Or

     select a.*, b.* from account a join account_actual b on b.account_id = a.account_id where b.account_actual_id = (select account_actual_id from account_actual c where c.account_id = b.account_id order by added_date limit 1) 

    And if in the account to store the id of the latest version, it will be generally great. This is a trivial case.

    Feel free to add information that will help reduce the sample size, thereby increasing the speed of the query.

    Already two additions described, I can offer and the third. You can add a field to the actual_accaount, for example, deprecated_date, for the last entry it is set to NULL or a very large date, to which no one lives anyway. In the old version, it is set to the added_date of the next version.

     select a.*, b.* from account a join account_actual b on b.account_id = a.account_id and b.deprecated_date IS NULL 
    • Thank. Although the 4th version does not work correctly) Alas, it is impossible to change the structure of the base. ReadOnly. I will use option 1 or 3, we will look at the practice that will work faster. - Ninazu 5:03

    Productivity is extremely important. Million tables

    In this case, it was enough just to forget about the usual paradigm and try to find a solution, starting from the task.

    Maximum performance you achieve in the following case:

    • There is one table for storing versions, the primary key is the identity of the entity and the version of the entity
    • There is a second table in which only the current versions are stored, absolutely identical except that the primary key consists only of the entity identifier.

    After that, it is enough to use the first table for analytics / auditing, and to start all productive samples on the second table. At the same time, no joins or anything else are involved, so you can hardly think of something more productive; instead of query optimization, you can simply optimize the storage algorithm itself.

    • I agree, I described only the tip of the iceberg. In practice, the user can still select slices. For example in the last three days. Or range in the middle in general. So without added_date in terms of not getting around most likely - Ninazu
    • @Ninazu still fits into one table - etki

    Then there will be 2 join:

      select account_actual_id, t.*, name from (select account_id, max(added_date) added_date from `account_actual` group by 1) t join `account` using(account_id) join `account_actual` aa on aa.account_id=t.account_id AND aa.added_date=t.added_date; 

    other ways are described in this article.

    • Thanks for the article. For account_actual_id , no) - Ninazu
    • Is the latest version a larger account_actual_id or added_date? - retvizan
    • We need to build on added_date - Ninazu
    • then 2 join, updated answer - retvizan
    • By the way, you have a redundant INDEX account_id ( account_id ), - retvizan