It is given: a table with the fields id and txt , where txt is a small article, post or comment containing HTML, the approximate record size is 2-10 kb, total records are 100k-1 million.

It is necessary to extract groups of records in which the text is approximately similar, i.e. match percent to 80-90, because identically equal rows in the database there.

Do existing database mechanisms allow such a sampling, and how?

UPD: (very close) is there an analogue of php-shny similar_text() in mysql? The speed of the query does not matter.

2 answers 2

As mentioned above, you can try the Levenshtein distance. Another implementation example for mysql is here: http://www.artfulsoftware.com/infotree/qrytip.php?id=552

You can make an adapted version of this example:

 CREATE FUNCTION levenshtein( s1 text, s2 text ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; -- max strlen=255 DECLARE cv0, cv1 VARBINARY(10240); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; RETURN c; END; CREATE FUNCTION levenshtein_ratio( s1 text, s2 text ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, max_len INT; SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF; RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); END; 

Select pairwise similar lines like this:

 select t1.id, t1.txt, t2.id, t2.txt from table1 t1 join table2 t2 on t2.id <> t1.id and levenshtein_ratio(t1.txt, t2.txt) > 80 

How to break the result of the sample into groups will have to think.

According to the work, if briefly, the algorithm counts the number of substitutions / additions of characters in the text to get completely similar entries. But the main problem is that the algorithm performs a symbolic search, which is rather expensive. And now if you consider that the text length is 2-10 kb and up to 1 million records, the database will simply die from such an “analytics” (and do not forget that the main task of the database is to store and access the data, but not to implement complex computational algorithms).

Therefore, I would recommend, before it is too late, to think about taking the implementation out of the base. You can find a lot of "fast" implementation of the calculation of Levenshtein distance in almost all languages. When implementing, I would suggest doing some kind of preliminary filtering, for example, along the text, i.e. if the length does not coincide by 80%, then the comparison is not even carried out, because even on productive machines, the comparison of such volumes of text will already be engaged in significant time.

    You can solve this problem by using stored procedures.

    For example:

     CREATE PROCEDURE dbo.spSimil_FirstNameLastName @str1 nvarchar(max), @threshold float AS SET NOCOUNT ON SELECT * FROM (SELECT dbo.fnSimil(@str1, Person.Person.FirstName + N' ' + Person.Person.LastName) AS Simil, * FROM Person.Person) AS T WHERE T.Simil >= @threshold ORDER BY T.Simil DESC; 

    The procedure is invoked as follows:

     EXEC dbo.spSimil_FirstNameLastName N'John Adams', 0.75 

    More details can be found here: http://www.accessmvp.com/tomvanstiphout/simil.htm

    To solve your problem, you can try to resort to the Levenshtein distance: link to the implementation for mysql: https://github.com/ifsnop/damlev

    • Thank. I will read - kanaris
    • one
      Bezarius, in question we are talking about mysql , and not about ms / sql . - aleksandr barakin
    • When the question was asked, the specific bdshka was not specified, and we are talking about the mechanism. - Mstislav Pavlov