SQL - how to remove duplicate records in the database?

Question

There is such a status plate of elephants, that is, an elephant can have several statuses of its location.

status: id int auto_increment pk slon_id int status varchar date datetime

When a software bug (or rather a flaw) appeared duplicate records of the form

 1 1 Розовый слон в Африке 18.11.2016... 2 1 Розовый слон в Африке 18.11.2016... 3 1 розовый слон уже в Зимбабве 19.11.2016... 4 2 Зеленый слон в Европе 15.11.2016...

Interested in SQL solution. Is it possible to somehow remove duplicate records? Is it possible to remove duplicate records leaving one by one in order not to lose history? (That is, if there are several entries about Africa and the Pink Elephant - delete all but one).

duplicates on the status and identifier of the elephant, in this case, lines 1 and 2.

Accepted Answer · 2016-11-23T19:13:39

Algorithm of actions:

understand the criteria by which records should be considered identical
write a query to search for duplicate data sets (you can not rows, only data sets)
understand how to decide which of the entries should be left
accordingly, correct the request so that it displays all duplicate records, except the one that should be left
write the query delete where id in (subquery) .

The result is something like this, if we take the uniqfield field, which should be unique, leave a string with a minimum id :

 delete from tablename where id in ( select id from ( select id from tablename join ( select min(id) as firstdup, uniqfield from tablename group by uniqfield ) duplicates using(uniqfield) where id != firstdup ) subqueryhack )

For a combination of two fields so that it can be seen that changes

 delete from tablename where id in ( select id from ( select id from tablename join ( select min(id) as firstdup, uniqfield1, uniqfield2 from tablename group by uniqfield1, uniqfield2 ) duplicates using(uniqfield1, uniqfield2) where id != firstdup ) subqueryhack )

delete in mysql will not allow you to directly delete from the table that the subquery reads, but it costs, if necessary, another subquery.

If it does not matter which lines to leave, and which ones to remove and allow the version of the DBMS, then before mysql 5.7.4 you could just hang a unique index indicating ignore

 ALTER IGNORE TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);

That will remove all duplicates on these fields, except for a single line. In current versions, the ignore behavior is removed and causes an error. He didn’t understand why he was deleted, most likely due to unobvious behavior, which line will be deleted and the general rate of returning to the SQL standard.

Thanks, everything works on the test table, but I couldn’t put it into practice (as I understood later), since different elephants can be in the same place at the same time, and the name of the elephant is not in status, that is, the status is слон в Африке 18.11.2016 may be different.
There is an elephant id, I have not yet mastered requests of this kind.
That's all the difference, I added an example of a query with a search for duplicates in a pair of fields

progmb progmb 700 8 silver marks 33 bronze marks · Answer 2 · 2016-11-24T08:37:05

This option should work. TESTTEST needs to be replaced with your own table name. This request will leave one entry for each slon_id with a maximum date.

 DELETE FROM TESTTEST WHERE id NOT IN ( SELECT MAX(id) FROM (SELECT t.slon_id, MAX(t.[date]) AS [date] FROM TESTTEST t GROUP BY t.slon_id) t JOIN TESTTEST t2 ON t.slon_id = t2.slon_id AND t.[date] = t2.[date] GROUP BY t2.id )

there is no error, it simply finds no duplicate, but simply the maximum ID from the table.
It deletes all entries for this elephant, except for entries with a maximum date and a maximum id.
for one date there can be several entries for the same elephant.

SQL - how to remove duplicate records in the database?

2 answers 2

More articles: