There is an InnoDB table consisting of three columns.

+------------+---------------------+------+-----+-------------------+-----------------------------+ | Field | Type | Null | Key | Default | Extra | +------------+---------------------+------+-----+-------------------+-----------------------------+ | url_id | bigint(20) unsigned | NO | PRI | 0 | | | visitor_id | bigint(20) unsigned | YES | MUL | NULL | | | visit_time | timestamp | NO | MUL | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP | +------------+---------------------+------+-----+-------------------+-----------------------------+ 

In table 2+ million lines. I need to select values ​​that have visit_time > "2016-01-01" . All fields are indexed.

The problem is that if I do this using visit_time, it will take a very long time.

 mysql> SELECT COUNT(lu1.visitor_id) -> FROM log_url as lu1 -> WHERE lu1.visit_time > "2016-01-01"; +-----------------------+ | COUNT(lu1.visitor_id) | +-----------------------+ | 787719 | +-----------------------+ 1 row in set (2,71 sec) 

And if done through visitor_id, it turns out much faster.

 mysql> SELECT COUNT(lu1.visitor_id) -> FROM log_url as lu1 -> WHERE lu1.visitor_id > 600000; +-----------------------+ | COUNT(lu1.visitor_id) | +-----------------------+ | 787719 | +-----------------------+ 1 row in set (0,25 sec) 

Tell me, what's the problem?

UPD: I forgot to attach EXPLAIN right away. It turns out at sampling by date it does not use an index. Although the index is.

 mysql> EXPLAIN -> SELECT COUNT(lu1.visitor_id) -> FROM log_url as lu1 -> WHERE lu1.visitor_id > 600000; +----+-------------+-------+-------+---------------+------------+---------+------+---------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+------------+---------+------+---------+--------------------------+ | 1 | SIMPLE | lu1 | range | visitor_id | visitor_id | 9 | NULL | 1020864 | Using where; Using index | +----+-------------+-------+-------+---------------+------------+---------+------+---------+--------------------------+ 

visit_time:

 mysql> EXPLAIN -> SELECT COUNT(lu1.visitor_id) -> FROM log_url as lu1 -> WHERE lu1.visit_time > "2016-01-01"; +----+-------------+-------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | lu1 | ALL | visit_time | NULL | NULL | NULL | 2041728 | Using where | +----+-------------+-------+------+---------------+------+---------+------+---------+-------------+ 
  • Make EXPLAIN , then you can think. - D-side
  • @ D-side, updated the question. - danil

1 answer 1

In both cases, when selecting by ID and selecting by date, you give count (ID). The count() function should calculate all NOT NULL values ​​in the column indicated in it. Therefore, when searching by ID>n optimizer understands that you can go for a sample by index ID and find the values ​​for passing them to the count right in the same index. Based on this, he decides to count the records without even looking at the data blocks, only by the index (which is indicated by 'using index' in the execution plan).

In the case when you specify a sample by date, date>n and at the same time you want to count the number of count(ID) optimizer understands that the data in the index by date is not enough, because there is no ID. Consequently, in order to fulfill the query, he would have to find each record by the index that satisfies the condition and look into the data block where the ID is located. The optimizer knows that there are a lot of records in the table and suspects that, according to a given sample of the sample, he will have to look into the index many times and then into the data block. The operation of the direct search of data blocks for 2 million records is much faster than the enumeration of 700k records with peering into the index, and then lifting the data blocks. As a result, the optimizer takes the only right decision - a complete table lookup, without using an index.

In cases where you need to get the number of rows, you should always write count(1) , and not specify a specific column. Then you can be sure that the optimizer will rely solely on other, more important, details of the query, rather than getting values ​​for the sake of counting them. count(колонка) should be used only in one case - when you really want to count not just the number of records, but the number of records containing in the specified column a NOT NULL value.

  • Thanks for your reply. He explains this situation, but after it, I began to try variations with this query, in particular GROUP BY, and SELECT *, and now I have much more questions :) - danil
  • @danil You still play around with the condition and just look at explain. The optimizer is smart enough. If you take the same count (id) where dt> '2016-04-01' it is quite possible that it will go on the index, if it does not, try setting the value even closer to the current date or limit limit. At some point, he may well understand that the records are not expected to be very much and go on the index :) - Mike