Is it permissible to draw conclusions about the equality of the contents of strings based on the equality of their hashes?

  • one
    I do not understand the question. Probably because of the words "equality" and "content." :) If the hashes match, it does not mean that the strings are equal. - Angelina_Jo
  • Well, it happens: D you read again, think about it ..) - PaulD
  • So I try, fail. :))) - Angelina_Jo
  • Well, look, Angelina, how can you compare strings by calling the strcmp function for example, right? strcmp compares both strings byte-byte and gives us a certain result (equal, not equal, the first is less than the second, etc.). And we can hash these lines, get a short hash, and since if the used hash function generates quite unique hashes (low collision percentage), the hash will be shorter than the line (which can be very long), it is faster to draw conclusions about the equality of the hashes based on! - PaulD
  • And everything would be fine, except that the program written using this algorithm shamelessly buggy and gave out words that are completely absent. - PaulD

3 answers 3

In general, hashing is not a one-to-one mapping, that is, one cannot argue that two different strings will give two different hashes. Take for example MD5 hash. The input is a string of arbitrary length. The output is a 128-bit hash. Thus, the input is infinite, and the output is finite. Obviously, in an infinite set there are an infinite number of lines that will give the same hash.

  • That is, the use of this method is strictly speaking unreliable - PaulD
  • one
    Strictly speaking, no, it is not reliable. But in practice, the probability of a collision is extremely small. That is, you can assume that with equal hashes, the source strings with a high degree of probability are the same. - stanislav
  • Strictly speaking, yes, in the general case of infinite sets a reliable result will give only a complete search. - karmadro4
  • If we are not talking about cryptographic hashes, but about hashes like hashcode() , then the statement about the exceptional smallness of the probability of a collision is, of course, incorrect. Take, for example, in Java lines that have 256 identical first characters and N different characters later - they will all have the same default hash. (lied :) - Costantino Rupert
  • four
    @ Kotik_hokhet_kusat, strange, the documentation does not confirm this . The hash of a string is calculated as a polynomial of the entire string, and not just the first 256 characters. - northerner

No, unequivocally on the basis of comparing caches, one can speak of the inequality of objects with the inequality of caches. The coincidence of caches speaks about the likely equality of cached objects, so you need to check their equality directly.

  • In other words, the hash equality of strings is a necessary (but not sufficient) condition of equality of the strings themselves, and a direct comparison is necessary and sufficient. - insolor

Of course. Equality of hashes means binary equivalence of the contents of strings. Naturally, for lexicographic comparison hashes will not work.

Considering comments on collisions, I note that the hash function must be carefully selected.

  • Mm, well, if binary compatible, then surely and lexicographically, too, right? - PaulD