I made a parser that every 30 minutes takes ads from the real estate forum and writes them to the MySQL database. But now there are many duplicate records in the database. Next, I made a table in which the ad uniq field is saved. A unique value is written there, for example, showthread.php? S = 4ad705f976ce73fb739b76820a3a573f & t = 1485914 (the last 7 digits of each ad are unique!). Tell me, please, how is it better for me, now, to organize an ad test for uniqueness? Thanks in advance for any help.
4 answers
For example, read UNIQUE KEY and INSERT IGNORE tutorials ...
Posha towards ON DUPLICATE KEY.
And what's stopping you from simply recording the ad id from the address bar?
showthread.php?s=4ad705f976ce73fb739b76820a3a573f&t=1485914
immediately - t = this is most likely the ID of the ad, which is unique. According to him and check.
- And instead of one insert, do 2 queries: select + insert. Of course, some kind of optimization is needed here at least in order not to pull out too much, but this is already in another place :) - user6550
- Not really. If you have a field with an index, then there the type request to choose to write with such a numeric id is done so quickly that you can ignore it. - FlashXXX
- Depends on. We do not know all the scales of the tragedy :) Here is a simple example: you can first get a list of all t, exclude those that are in the database, and for the rest, download and add. By itself, to exclude - one, instead of on each t. - user6550
- Did not quite understand you. "get a list of all t" - where to get? If it is necessary to check the presence in DB of a set of values, something is in the sql of IN options, well, of type in: select * from notes where id IN (23,43,54,55,123,344). In general, I advise you to better study the syntax sql + features of the subd with which you work, then much will immediately become clearer and easier. - FlashXXX
- I have no idea where from. But I suppose that at first the forum pages with links to ads are parsed, that's where the lines like "showthread.php? S = 4ad705f976ce73fb739b76820a3a573f & amp; t = 148591" are taken. And with the tips - this is not for me :) - user6550
It is best to organize this check by the DBMS. Declare a primary key (a field or set of fields that completely defines a record and cannot be NULL) or a uniqueness constraint (field values ​​that are not equal to NULL cannot be repeated, but equal ones can :)). But first you need to remove duplicates.
You can add a constraint like this: ALTER TABLE myTable ADD CONSTRAINT constraintName UNIQUE (mycolumn);