SQL query optimization in Postgres

Question

there is a request to build routes from flights

WITH RECURSIVE tree AS ( SELECT f0.id::text AS flights, '/'||origin_id||'/'||destination_id||'/' AS route, '/'||airline||'/' AS airlines, '/'||invisible||'/' AS invisible, origin_id, destination_id, std AS departure, sta AS arrival, 0 as hops, (sta - std)::INTERVAL AS total_time, '00:00'::INTERVAL AS waiting_time, available AS available FROM flights f0 INNER JOIN airports a ON f0.origin_id = a.id AND NOT a.cargo_accepted = false WHERE origin_id = '13' AND airline IN ('LH', 'OK', 'QS', 'U6') AND std >= '2017-06-02 06:11:02 UTC' AND NOT f0.available = false UNION ALL SELECT s.flights || '-->' || f1.id::text AS flights, s.route||f1.destination_id||'/', s.airlines||f1.airline||'/', s.invisible||f1.invisible||'/', s.origin_id, f1.destination_id, s.departure AS departure, f1.sta AS arrival, s.hops + 1 AS hops, s.total_time + (f1.sta - f1.std)::INTERVAL AS total_time, s.waiting_time + (f1.std - s.arrival)::INTERVAL AS waiting_time, s.available FROM tree s JOIN flights f1 INNER JOIN airports a1 ON f1.origin_id = a1.id AND NOT a1.cargo_accepted = false AND NOT a1.transit = false ON f1.origin_id = s.destination_id AND f1.std >= s.arrival + '8 hour' AND f1.std <= s.arrival + '7 day' AND NOT s.route LIKE '%/'||f1.destination_id||'/%' AND s.hops < 4 AND NOT s.available = false AND NOT f1.available = false AND f1.airline IN ('LH', 'OK', 'QS', 'U6') ) SELECT * FROM tree INNER JOIN airports a3 ON destination_id = a3.id AND NOT a3.cargo_accepted = false WHERE destination_id = '112' AND waiting_time < '25 day' AND arrival <= '2017-06-27 06:00:28 UTC' AND airlines LIKE '%U6%' ORDER BY arrival, departure, waiting_time

the number of flights grows and the query starts to think longer and longer at the moment in the database of 64105 flights, as though this is trivial things for a DB, but the request takes about 12-16 seconds.

Help me figure out what is wrong?

did the analysis https://gist.github.com/sanyco86/26f07fbf567e09639cc75ac7b557e660

Well, actually, if someone wants to delve into the problem of test data http://beta.u6-cargo.com/assets/Postgtes_backup.sql

and by the way, would you be able to give some initial query conditions on which the results will be not only a direct flight but also with some sort of transfers
(the arrival date defaults to the departure date + 25 days) this is the maximum period, all the rest is the default values

Mike mike 38.7k one 25 62 · Accepted Answer · 2017-06-09T23:11:15

The main problem of this request is to quickly replicate the number of records processed during recursion. The main and practically the only way to deal with this is as much as possible limiting the records selected inside the recursion. And one extra entry in the first step of recursion can generate thousands of entries behind it to step 4 ...

The first thing that catches the eye: the query inside the recursive part selects all flights starting from a specific date, but without limiting the maximum possible date. In this case, after recursion, all flights with a landing date greater than a certain one are discarded. Adding conditions for the maximum date in both parts of the recursive CTE reduces the running time from 12 seconds to 2.5.

The second thing you can do is optimize your work with indexes. Many indices have been created on the table, but by single fields. And the use of several indexes at once on a single table in the query, although it is possible, but does not greatly increase the performance, and sometimes it reduces it. The whole question of the selectivity of an index. If according to the index condition too many records are selected, then work on it only slows down the selection. In order to increase the selectivity, it is worth making composite indices of several fields that select as few records as possible in the query. In your case, you must do the following:

 DROP INDEX "index_flights_on_available"; CREATE INDEX "index_dep_std" ON "flights" USING "btree" ("origin_id", "std", "sta", "destination_id", "available");

The index on the available binary field has a very low selectivity, selecting one half of the entire table from one key at a time. It makes no sense. And in this case, the optimizer used it all the same, combining the results with a choice on a different index, but this somewhat slowed down the query, the cost of the merger was higher than the filtering on it gave. The index being created, as much as possible, covers the needs of this request, since the search for flights begins with a specific departure date by date range, so we include these fields first, first checked for exact equality, then those for the range. Enabling the destination AP and available allows you to check almost all the conditions from the query as you move through the index tree. This operation reduces the request to 1 second.

And finally, one more sentence, the most difficult to implement ... At present, the request goes through all possible combinations of airports from the entire route of the network, but it is obvious that it is useless to check flights through some of the PA. If we need to get from Moscow to Sochi, then fly first to Mirny, which, to put it mildly, is not the other way around and the main thing, which has a very specific network of its flights, is obviously a losing option. However, the recursion considers such options and generates, after one flight to Mirny, a bunch of flights from it. To filter such situations, it is possible to calculate in advance the reachability of any AP when flying through others and in recursion not to go into the branches for which the required destination AP is obviously not reachable for the remaining N flights. Create a reachability table of the AP from others:

 create table approach( orig integer not null, -- Аэропорт вылета dest integer not null, -- Аэропорт назначения hop integer not null, -- Минимальное количество перелетов для достижения constraint approach_pkey primary key(dest,orig) );

Now it needs to be filled, this is the most interesting and difficult part. The fact is that the contents of this table will always need to be kept up to date, as soon as there is a flight connecting airports, between which there was no direct communication between them - we need to recalculate the entire network. Change the fields cargo_accepted and transit in the directory of airports - again recalculate. And so that the table would respond to the current situation, in addition it is necessary to recalculate it, for example, once a day. Recalculation is performed by the following request:

 WITH RECURSIVE Q(o,d,transit) as( select origin_id o, destination_id d, a.transit -- Список пар АП из рейсов from flights f, airports a where std >= localtimestamp - interval '1 day' -- and std <= localtimestamp + interval '60 day' and a.id=f.origin_id and NOT a.cargo_accepted = false group by origin_id, destination_id, a.transit ), Tree(orig, dest, hop, transit, route) as( select o, d, 1, transit, -- Затравка рекурсивной части '/'||o||'/'||d||'/' AS route from Q union all -- Рекурсия select T.orig, Qd, T.hop+1, T.transit, T.route||Qd||'/' from Tree T, Q where Qo=T.dest and not Q.transit=false and T.route not like '%/'||Qd||'/%' and T.hop<4 ), New(orig,dest,hop) as( -- Таким должно быть содержимое таблицы select orig, dest, min(hop) hop from Tree group by orig, dest union select distinct o,o,0 -- Записи из АП в него самого за 0 перелетов нужны поиску from Q ), Upd(orig,dest) as( -- Обновляем существующие записи update approach a set hop=n.hop from New n where a.orig=n.orig and a.dest=n.dest returning a.orig, a.dest ), Ins(orig, dest) as( -- Вставляем недостающие insert into approach(orig,dest,hop) select orig, dest, hop from New n where (n.orig,n.dest) not in(select orig,dest from Upd) returning orig,dest ) -- И удаляем устаревшие delete from approach where (orig,dest) not in(select orig,dest from New)

Pay special attention to the conditions at the very beginning. In this case, all flights are taken starting from yesterday, so the table can be used only when searching for future flights. If you need to look for what happened in the past (I hope you don’t need it yet), you will have to either change this condition, which will lower the accuracy at the current moment, or not use it when searching in history. Also, if for example the system never needs to search for flights that take off in two months, you can uncover the limit on the maximum date.

Once again I remind you that the table should be kept up to date. To do this, I recommend to make a stored procedure with this request and call it once a day, in the trigger on the airports directory and in the trigger on adding flights (no matter how frequent it is, call it only if there is no record in the approach table with a couple of airports from flight with hop field equal to 1 (direct flight)). In principle, it runs 80 ms on the current flight network.

Finally, the final version of the search query:

 WITH RECURSIVE tree AS ( SELECT f0.id::text AS flights, '/'||origin_id||'/'||destination_id||'/' AS route, '/'||airline||'/' AS airlines, '/'||invisible||'/' AS invisible, origin_id, destination_id, std AS departure, sta AS arrival, 0 as hops, (sta - std)::INTERVAL AS total_time, '00:00'::INTERVAL AS waiting_time, available AS available FROM flights f0 INNER JOIN airports a ON f0.origin_id = a.id AND NOT a.cargo_accepted = false WHERE origin_id = 13 AND airline IN ('LH', 'OK', 'QS', 'U6') AND std >= '2017-06-02 06:11:02 UTC' and sta <= '2017-06-27 06:00:28 UTC' AND NOT f0.available = false UNION ALL SELECT s.flights || '-->' || f1.id::text AS flights, s.route||f1.destination_id||'/', s.airlines||f1.airline||'/', s.invisible||f1.invisible||'/', s.origin_id, f1.destination_id, s.departure AS departure, f1.sta AS arrival, s.hops + 1 AS hops, s.total_time + (f1.sta - f1.std)::INTERVAL AS total_time, s.waiting_time + (f1.std - s.arrival)::INTERVAL AS waiting_time, s.available FROM tree s JOIN airports a1 ON s.destination_id = a1.id AND NOT a1.cargo_accepted = false AND NOT a1.transit = false JOIN flights f1 ON f1.origin_id = s.destination_id AND f1.std >= s.arrival + '8 hour' AND f1.std <= s.arrival + '7 day' AND NOT f1.available = false JOIN approach app ON app.dest=49 -- <-- Заменить на требуемый АП назначения and app.hop<4-1-s.hops -- <-- 4 - та же, что в условии s.hops<4 and f1.destination_id=app.orig WHERE NOT s.route LIKE '%/'||f1.destination_id||'/%' AND s.hops < 4 AND NOT s.available = false AND f1.airline IN ('LH', 'OK', 'QS', 'U6') and f1.sta <= '2017-06-27 06:00:28 UTC' ) SELECT * FROM tree INNER JOIN airports a3 ON destination_id = a3.id AND NOT a3.cargo_accepted = false WHERE destination_id = 49 AND waiting_time < '25 day' AND arrival <= '2017-06-27 06:00:28 UTC' AND airlines LIKE '%U6%' ORDER BY arrival, departure, waiting_time

Unfortunately, on routes that really require several flights, the speed decreased only to 500-600ms (like 13-49 in the example). But on the flight 13-112, the runtime was reduced to 11 ms (i.e., 1000 times as compared with the variant from the question). I recommend to compare just in case the results after the execution of real requests without using approach and with it. In principle, I did not notice the difference in the number of records in several test directions.

Ruslan Masgutov Ruslan Masgutov 714 2 eight · Answer 2 · 2017-06-06T08:59:33

I recommend taking a look at the postgrestpro performance documentation.

According to your query plan, it is clear that it is not optimized at all, too many FILTER operations (it is very expensive) indicates that either there are no necessary indices or not correctly constructed (there are several indices on separate columns, although a multi-column index is needed).

If this data is only inserted (DELETE or UPDATE operations are rare) and they are needed for simple display, I recommend making an additional flat table (without joins) where data will be written when inserted into the main one - this will work much faster.

SQL query optimization in Postgres

2 answers 2

More articles: