DISTINCT optimization in PostgreSQL

Question

There is a table of the form:

CREATE SEQUENCE history_id_seq; CREATE TABLE history ( id INT8 DEFAULT nextval('history_id_seq') NOT NULL CONSTRAINT history_key PRIMARY KEY, objid TEXT NOT NULL, descr TEXT, value FLOAT4, time TIMESTAMP ); ALTER TABLE public.history OWNER TO postgres; CREATE INDEX idx_history_objid ON history USING BTREE (objid); CREATE INDEX idx_history_time ON history USING BTREE (time); INSERT INTO history ( objid, descr, value, time ) VALUES ( 'a', 'a_descr1', 0, '2016-06-23 00:00:00' ), ( 'b', 'b_descr1', 1, '2016-06-23 00:00:01' ), ( 'c', 'c_descr1', 2, '2016-06-23 00:00:02' ), ( 'a', 'a_descr2', 3, '2016-06-23 00:00:03' );

You need to select unique identifiers (objid) and recent descriptions (descr), since Descriptions may change over time.

  objid | descr -------+---------- a | a_descr2 b | b_descr1 c | c_descr1

The required functionality gives me a query:

 SELECT DISTINCT ON (objid) objid, descr, time FROM history ORDER BY objid, time DESC;

But with millions of entries, this is a very long process. Simply select the unique objid quickly turned out with the help of the recursive function:

  WITH RECURSIVE t(i) AS ( SELECT min(objid) FROM history UNION SELECT( SELECT objid FROM history WHERE objid > i ORDER BY objid LIMIT 1 ) FROM t WHERE i is not null ) SELECT * FROM t;

but how to take descriptions is not clear. I would be grateful for any idea.

In practice, they will be the minimum, almost constant amount tending to 0% (300 by 60 million in the current state, every day + 4 million)
And the option to start a table reference book is not considered at all?
You can also try to get a unique set separately, and then select descriptions for a unique set by index in separate subqueries.
There was an idea of a directory compiled using triggers on tables, but the write speed would drop, which is critical.
The second proposed option is implemented (it will be rarely enough to be done), I forgot to write about it, but I wanted a more beautiful solution.
In a nutshell, on large volumes such a query: SELECT h1.objId, h1.descr FROM history AS h1 JOIN (SELECT objid, MAX(time) AS time FROM history GROUP BY objId) AS h2 ON h2.objId = h1.objId AND h2.time = h1.time ORDER BY h1.objId, h1.time DESC works 3-4 times faster than your source.

Accepted Answer · 2016-06-23T12:37:54

Try this:

 SELECT * from history h join (SELECT objid, max(time) as _time FROM tab GROUP BY objid) t on h.objid = t.objid and h.time = t._time ORDER BY h.objid;

UPD In general, of course, this story is not intended to permanently receive a snapshot as the main working table, given such total volumes compared to changes. The story is designed to watch the changes for each object or changes in a short period of time, such as the last in an hour ...

You have rightly noticed the story - the snapshot will occasionally be needed to refine the list and then save it separately to the file system.
Your example gave acceptable results on a test base of 2 million lines with 10 unique, later I will check on something more.

DISTINCT optimization in PostgreSQL

1 answer 1

More articles: