Sort using the index when merging

Question

There is a table:

create table log ( id serial not null constraint user_pkey primary key, date timestamp with time zone default now() not null constraint log_pkey primary key, user_id integer constraint log_user_id_fkey references "user" on delete set null, ... ); create index log_date on log (date DESC);

When sampling only from it:

 SELECT * FROM "log" WHERE date > '1.1.2018' ORDER BY date DESC;

sorting goes by index.

  Index Scan using log_date on log (cost=0.29..2330.05 rows=35689 width=341) (actual time=0.024..16.422 rows=35782 loops=1) Index Cond: (date > '2018-01-01 00:00:00+03'::timestamp with time zone) Planning time: 0.589 ms Execution time: 18.088 ms

But when merging:

 SELECT * FROM "log" JOIN "user" ON user.id = log.user_id WHERE date > '1.1.2018' ORDER BY log.date DESC;

sorting is already without an index.

  Sort (cost=3805.74..3894.96 rows=35689 width=509) (actual time=22.159..24.455 rows=13064 loops=1) Sort Key: log.date DESC Sort Method: quicksort Memory: 9573kB -> Merge Join (cost=0.43..1107.08 rows=35689 width=509) (actual time=0.012..9.642 rows=13064 loops=1) Merge Cond: (u.id = log.user_id) -> Index Scan using user_pkey on "user" u (cost=0.14..5.35 rows=59 width=168) (actual time=0.002..0.029 rows=61 loops=1) -> Index Scan using log_user_id on log (cost=0.29..2451.96 rows=35689 width=341) (actual time=0.006..6.190 rows=13065 loops=1) Filter: (date > '2018-01-01 00:00:00+03'::timestamp with time zone) Planning time: 0.564 ms Execution time: 25.899 ms

Is this normal behavior? How can I persuade the scheduler to sort by index?

Neither in the documentation nor in Google did not find information.
Postgresql 9.6.

UPD. As mike noted in the comment, the left merge solves the problem.

  Sort (cost=4816.65..4902.76 rows=34443 width=523) (actual time=48.332..55.717 rows=35000 loops=1) Sort Key: log.date DESC Sort Method: quicksort Memory: 21961kB -> Hash Left Join (cost=3.33..2221.04 rows=34443 width=523) (actual time=0.063..19.772 rows=35000 loops=1) Hash Cond: (log.user_id = "user".id) -> Seq Scan on log (cost=0.00..1957.54 rows=34443 width=347) (actual time=0.008..9.118 rows=35000 loops=1) Filter: (date > '2018-01-01 00:00:00+03'::timestamp with time zone) -> Hash (cost=2.59..2.59 rows=59 width=168) (actual time=0.030..0.030 rows=61 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 21kB -> Seq Scan on "user" (cost=0.00..2.59 rows=59 width=168) (actual time=0.004..0.013 rows=61 loops=1) Planning time: 0.967 ms Execution time: 62.681 ms

In this particular query, the optimizer was right, but it helped dramatically on the combat query. And mike's statement helped that in one query one index per table is used.

According to the plan, it is clear that the optimizer considered it more profitable to go to the log table by the index log_user_id.
Two indexes on the same table in the same query can not be used, therefore, to sort the index is no longer found.
But the optimizer is right in this case or not, it's hard to say.
try to make left join, there is some probability that the optimizer will refuse merge
And where can I find information about a single index on a table request?
Or write the answer, I will close the question in the normal mode.

Accepted Answer · 2018-09-29T10:13:03

From the execution plan with join, it can be seen that in the first case, when the table is one, the optimizer applies the only correct strategy in this case - to take records by date using the appropriate index. Since this index is used, at the same time, the optimizer decides to perform the sorting on it as well, since it will initially receive the records in the right order.

In the second case, the optimizer decides to use the merge-join operation. With this type of fusion, the postgresql tables go parallel to the records that are “sorted” in the same order and match them to each other. At the same time, he immediately takes the records in the desired order by going over the index in the user_id field. Since it is unrealistic to use two indices at the same table at the same time, the optimizer decides to re-sort the sample in the required order, i.e. by date.

The use of the left join operation, instead of the usual join, may prompt the optimizer that there may be no data in one of the data tables, and records from the "left" table are still needed. In this case, the operation merge-join is considered an unprofitable strategy and the optimizer still takes records by date from the log and for each of them individually look for a match in the user table. Since the index is again used by date, it can also be used instead of sorting.

As can be seen on the updated question, despite the fact that in the case of a regular join, a final sorting of the sample is required, the merge-join operation itself saves so much time that it more than pays for this sorting and the query is performed 25ms instead of 62ms, which are needed for multiple searches of some the same entries in the case of using left. So we can conclude that the optimizer in this case fully justifies its name and really chooses a more optimal way to solve the problem. Therefore, when optimizing queries, you should never rely on templates like “sorting is always bad” and you should not avoid it by any means. It is worth trying a few options and choose the best one for the total execution time.

PS In response to the comment " And where can I find information about a single index on a request for a table ".

With this more difficult, in some documentation on some DBMS this is probably mentioned in passing. Whether there is a similar on postgresql I do not know. But all DBMS operate according to similar principles.

In this case, I prefer to imagine what I would do on the DBMS site if I had to do this work manually. Imagine that we have two dictionaries (index), in one word arranged alphabetically (next to the words it is indicated how often they occur), in the other they are also the most common. And there is another dictionary, with a slightly different set of words, alphabetically. We need to find the words in both dictionaries and at the same time derive them in order of prevalence. We decide that it is convenient for us to read two dictionaries in alphabetical order at the same time, because seeing the "watermelon" in one dictionary is simply a matter of looking at the second dictionary already open on page "A" and seeing it there. (this is merge-join). Write out all the words on a piece of paper with the number of occurrences and, after a complete comparison of the dictionaries, sort the final set manually. A dictionary with words that have already been sorted by quantity cannot help us, because we will have to completely re-read the entire list in order to find a watermelon there, which may turn out to be somewhere in the end. Thus, the use of the second index is practically unreal.

If we take initially the dictionary, where the words are common and decide to compare it with the second one. Then, of course, we will initially get all the words in the order we need, no sorting is required. But at the same time the search itself becomes more complicated. We take the first word from the dictionary of quantities, say "train". Scroll through the second dictionary to the letter "P", we find it. The second word is “milk”, we leaf through the second dictionary to “M” ... This is approximately what happens, in the case when we used the left join (well, a little faster, there is a table of contents :)) ...

Sort using the index when merging

1 answer 1

More articles: