String hashes comparison

Question

Is it permissible to draw conclusions about the equality of the contents of strings based on the equality of their hashes?

:) If the hashes match, it does not mean that the strings are equal.
Well, look, Angelina, how can you compare strings by calling the strcmp function for example, right?
strcmp compares both strings byte-byte and gives us a certain result (equal, not equal, the first is less than the second, etc.).
And we can hash these lines, get a short hash, and since if the used hash function generates quite unique hashes (low collision percentage), the hash will be shorter than the line (which can be very long), it is faster to draw conclusions about the equality of the hashes based on!
And everything would be fine, except that the program written using this algorithm shamelessly buggy and gave out words that are completely absent.

stanislav stanislav 29.5k 20 74 191 · Accepted Answer · 2012-05-03T18:00:07

In general, hashing is not a one-to-one mapping, that is, one cannot argue that two different strings will give two different hashes. Take for example MD5 hash. The input is a string of arbitrary length. The output is a 128-bit hash. Thus, the input is infinite, and the output is finite. Obviously, in an infinite set there are an infinite number of lines that will give the same hash.

stanislav

29.5k 20 74 191

That is, the use of this method is strictly speaking unreliable - PaulD
one
Strictly speaking, no, it is not reliable. But in practice, the probability of a collision is extremely small. That is, you can assume that with equal hashes, the source strings with a high degree of probability are the same. - stanislav
Strictly speaking, yes, in the general case of infinite sets a reliable result will give only a complete search. - karmadro4
If we are not talking about cryptographic hashes, but about hashes like hashcode() , then the statement about the exceptional smallness of the probability of a collision is, of course, incorrect. Take, for example, in Java lines that have 256 identical first characters and N different characters later - they will all have the same default hash. (lied :) - Costantino Rupert
four
@ Kotik_hokhet_kusat, strange, the documentation does not confirm this . The hash of a string is calculated as a polynomial of the entire string, and not just the first 256 characters. - northerner

|

Answer 2 · 2012-05-03T17:56:19

No, unequivocally on the basis of comparing caches, one can speak of the inequality of objects with the inequality of caches. The coincidence of caches speaks about the likely equality of cached objects, so you need to check their equality directly.

In other words, the hash equality of strings is a necessary (but not sufficient) condition of equality of the strings themselves, and a direct comparison is necessary and sufficient.

Answer 3 · 2012-05-03T17:57:27

Of course. Equality of hashes means binary equivalence of the contents of strings. Naturally, for lexicographic comparison hashes will not work.

Considering comments on collisions, I note that the hash function must be carefully selected.

Mm, well, if binary compatible, then surely and lexicographically, too, right?

String hashes comparison

3 answers 3

More articles: