There is a large sample of data (50 million), it is necessary to derive from it the first 10 records on the most recent date (after which another 10 records are loaded). The difficulty is that the data are grouped into 10 sections (each from 2-3 to 10 million records) and only data from this section should be output.

SELECT * FROM posts WHERE type=1 ORDER BY updated DESC LIMIT 10; 

For this query, EXPLAIN shows that MySQL first selects several million rows of the correct type, and then sorts them by date.

The table is a composite index (type, updated), but it does not help much. MySQL uses only the first part.

Same when displaying subsequent pages.

 SELECT * FROM posts WHERE type=1 AND updated<'offset' ORDER BY updated DESC LIMIT 10; 

How to optimize this query?

  • Strange, topN sorting by index mysql can do. What is the version of the DBMS? innodb? If you choose not everything, but only the primary key - does it behave the same way? - Shallow
  • yes, InnoDB. tried and primary key, the same thing. In general, the behavior of MySQL is more than logical. even with the use of the index, it first searches for the desired values ​​by type (and this is several million records), after which they are already sorted by date. - Dmitry Maslennikov
  • I suppose that it is necessary to use an approach where sorting by updated first takes place, after which 10 necessary records are selected from it (according to the appropriate type, since ORDER BY is needed before WHERE), but the question is how to implement it? - Dmitry Maslennikov
  • knowledgeable people have suggested that there are MySQL cursors, but for now I am studying how to use them correctly. - Dmitry Maslennikov
  • Cursors have nothing to do with it; they save a bit of memory when processing a request that returns a lot of rows. According to btree, to subtract a constant prefix and, according to it, the last N records is an elementary thing. Even a stupid mysql should be able to backward index scan. dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html - Small

1 answer 1

As a result, I came to a solution at the level of application logic. The main database constantly contains 0.5-1 million of the most recent records, and the rest are transferred to the archive. The script-archiver once a day transfers old records to the archive. Initially, users are given only the most recent data from the main database (99% of requests), and if they are not enough, then an archive search is performed.

PS In addition, if the user started flipping through the tape, he made a conclusion not 10, but 50 records. When scrolling, the client issues data from the request 1 time, and places the remaining 4 packets into the array and outputs as needed. When customer data runs out, a new request is made. I don’t immediately display 50 in order not to slow down the browser (there are a lot of graphics). Something similar is found in vk and a number of other large sites.

PPS The final solution at the MySQL level is as follows. I did manual testing with samples based on a simple or composite index that is rigidly written (use index). The best performance was the use of a composite index (for example, (type, update)), the use of which is strictly prescribed in the application code depending on the specific type of sample (by default, in some cases MySQL chooses not the most productive index). I was surprised at the discrepancy between EXPLAIN data and actual performance indicators. Thus, a simple index (EXPLAIN shows rows 10) worked hundreds of times slower than the compound index with rows of several million entries.

PPPS In general, the problem was in the wrong choice of the MySQL index engine for which the search was performed (in some cases only a simple index was used when it was better to use a composite one, and in some cases the search was performed immediately on 2 indices with a combination of results). When prescribing a USE INDEX manually (for each particular case), the performance increased many times over.