Sql query optimization

Question

Hello.

There is a table of persons.

Structure:

- number - INT (первичный) - fio - TEXT - phone - TEXT - phone_2 - TEXT - active - INT

It is necessary for all non-empty phone_2 fields to check for duplicates with the same fio field and active equal to 1. If there are none, then output number.

In the database there are more than 120.000 records.

I do this:

 SELECT number FROM (SELECT * FROM persons WHERE phone_2 != "") AS a WHERE NOT EXISTS (SELECT * FROM (SELECT * FROM persons WHERE active = 1) AS b WHERE a.fio = b.fio)

The request works 5 hours !!

UPD:

Option 2:

 SELECT number FROM persons AS a WHERE phone_2 != "" AND NOT EXISTS (SELECT * FROM persons AS b WHERE active = 1 AND a.fio = b.fio)

Which option will work faster?

Write the entire structure (including field types, indices, pk).
Is it such a normal execution time, or as they say, they laugh at such a request?) Is the number of subqueries normal?
the number of subqueries is completely abnormal> It is necessary for all non-empty phone_2 fields to check for duplicates with the same fio field and active equal to 1. If they are not there, then print number and if there are, then what should we output?
So if there is, then they will not be displayed, since NOT EXISTS will return TRUE only if there is no match.
@temagr, please describe in detail and thoroughly what you want to receive.
But there is a feeling that you can significantly simplify your query using GROUP BY and aggregate functions, without all this pornography in the form of a heap of nested queries

DreamChild DreamChild 34k 2 thirty 76 · Answer 1 · 2014-04-01T11:11:11

Your table suffers from a bad design. It is a violation of normalization. In particular, storing the name in one column is a violation of the first normal form. In an amicable manner, the full name should be broken down into atomic entities - Surname, First Name and Patronymic. Judging by your words “check for duplicates with the same fio field” , you may have a situation where several records correspond to one person (despite the presence of the primary key number) or you want to identify the person not by the primary key, but by full name, which in itself is bad and wrong. However, for this case it is not so important.

The second problem area is the availability of telephone numbers in the persons table, or rather, a situation where one person can have one phone or two (and theoretically three, four and even eleven, which cannot be displayed in your structure). So, in an amicable way, you should create a Phones table, in which three fields should be stored - a key, PersonId - a foreign key for the persons table, identifying the person and PhoneNumber - a phone number. Thereby you would be able to achieve flexibility in terms of the number of telephone numbers, allowing one person to have any number of numbers - from zero to infinity (at the same time you, besides everything else, are also forced to write NULL or an empty string in the phone_2 field? not)

However, this is all - the subjunctive mood. In the current implementation, the query that you brought suffers from an excessive (no, even glaring) number of subqueries. Of course, this greatly affects productivity (say, to indicate the usual condition phone_2! = "" It is not necessary to do a separate subquery for this, which will go through all 120,000 records again. And so on 120,000 iterations for each of 120,000 records. In short, the number of iterations goes to billions)

In principle, such a query should be reduced to using GROUP BY and hAVING and represent something like this:

 SELECT fio, phone_2 FROM persons WHERE phone_2 != '' AND ACTIVE = 1 GROUP BY fio, phone_2 HAVING COUNT(*) = 1

however, your case is specifically complicated by the fact that you need to get a field that is not in the GROUP BY clause , so an additional query will be required. For example:

 SELECT p.number FROM persons p JOIN ( SELECT fio, phone_2 FROM persons WHERE phone_2 != '' AND ACTIVE = 1 GROUP BY fio, phone_2 HAVING COUNT(*) = 1 ) p2 ON p2.fio = p.fio

I will not argue that it will run in the blink of an eye (to be sure of something, it would be worth trying it on your data), but it should take substantially less than five hours

Answer 2 · 2014-04-01T11:31:37

Concerning normalization it agree. I dig in the old project. By the way, why is there a check in the table p2 for the emptiness of the number? After all, the second phone may not be empty, and the active field does not matter 1. Please see, in this case, whether I have compiled correctly:

 SELECT number FROM persons AS a JOIN ( SELECT fio FROM persons WHERE active = 1 GROUP BY fio HAVING COUNT(*) = 0 ) AS b ON a.fio = b.fio WHERE phone_2 != ''

Just active = 1 per line with non-empty phone_2 exception is very rare. You can simply add a request for a particular case.

In the internal subquery, we find all not the same fio and phone_2, for which ACTIVE = 1. Next, through JOIN, we combine all fields and again indicate that ACTIVE = 1.> SELECT a.number FROM persons a> JOIN (> SELECT fio, phone_2> FROM persons> WHERE phone_2! = '' AND ACTIVE = 1> GROUP BY fio, phone_2> HAVING COUNT (*) = 1>) b ON a.fio = b.fio AND a.phone_2 = b.phone_2> WHERE a. ACTIVE = 1

Sql query optimization

2 answers 2

More articles: