There is a base with a huge number of lines. All strings are unique. They contain English, Russian letters and numbers, may contain special characters. In the line from 1 to N words.
Question: how can you make a unique hash for everyone?
Using long possible, but you need to fit into the int box.
The algorithm will be described on c #, in theory, you can use asm to speed it up, but we’re thinking about it.
2 answers
Suppose we have a random hash function, M its possible values and N elements. Then the probability that the hashes of the elements will be different is 1 × (1 - 1 / M) × (1 - 2 / M) × ... × (1 - (N-1) / M) = M! / ((MN)! × M N ) . When M = 2 32 and N = 8 000 000 it will be equal to 1.75 × 10 -3238 .
For comparison, the likelihood of a meteorite falling tomorrow is more than a probability of no collisions in such conditions.
This value is more than small enough to recognize brute force hash functions in search of a suitable impossible.
If the lines are unknown to you in advance - this can be done. Any function you choose will almost certainly collide on an unknown set of strings.
But if the lines are known in advance, then an ideal hashing algorithm can be constructed for them. This is done like this:
- Select the number of bits for the "older" part of the hash. The number of bits h will give H = 2 h different values.
- We take any hash function, and divide the source strings by H baskets according to it. On average, we get N1 = N / H rows in the basket.
- For each basket, we select (automatically, of course) a separate ideal hash function for the lower bits of the hash. This will be M1 = M / H of different values.
For step 3 to last for an adequate time, the value of M1 2 must be comparable with N1 . We get - M 2 / H 2 ∽ N / H - that is, we must take H of order M 2 / N.
We get H = 16,384 (h = 14), M1 = 288, N1 = 262,144.
In total, the algorithm for obtaining hash functions will be as follows:
- We consider the hash function 1 modulo 16 384, stored in the variable h1.
- Go to the table, calculated in advance, at the index h1. Get the parameters for the hash function 2.
- We consider the hash function 2 modulo 262 144, memorized in the variable h2.
- We consider
(h1 << 18) + h2- this is the unique hash.
As a hash function 2, you need to choose something parametric, so that there is where to choose random parameters. That is, a polynomial hash would do well here.
As a hash function 1, you can choose any good hash function.
The algorithm is quite fast (asm is not needed) - but it will require a constant kilobyte table of 32 or 64.
- Thank you so much, it remains to digest everything to think and do. Strings are known at the beginning. - ParanoidPanda
Pay attention to the comment @Harry !!!
- Guaranteed uniqueness can be provided only by a dictionary.
- Any hash functions - provide only a high probability of uniqueness.
Many experts involuntarily confuse the domain of using probabilistic values. If we work with a probabilistic value, then we must, despite the magnitude of the probability, handle the two outcomes "worked" vs "did not work." Otherwise, we must indicate that the subsystem being created does not give a 100% guarantee of failure. A simple example of the correct use of probabilistic quantities is prediction compression algorithms. Found often repeating sequences - well, not found - processed anyway. Therefore, it is not necessary to “buy” on the amount, even so impressive, of probability. Next comic example ...
"The brick fell on the head"
A little reflection on the probabilities of a simple, one might say ordinary event, which, alas, happens occasionally. And the story is. It was a man on business near the construction team. A brick fell on its head. There is no man. What was the chance to live a little man a little more?
Postulates:
- surface area of our planet is 510,072,000 km²
- 1 km² = 1000000m²
- half the land refers to about 3 to 7
- 1854288 bricks are required for the construction of a residential 12-storey brick house (considered relative to the calculated one-story building, took 6 apartments per floor and 12 floors, plus amended by 2/3, such as savings)
- in the year 365 * 24 * 60 * 60 = 31536000 seconds
- with a day of 24 * 60 * 60 = 86400 seconds
- an ordinary person will have no more than 500 life situations that will induce him to do this or that action
- a person will go on foot if his path is no more than 2 km, otherwise he will use transport
- a person older than 10 years will go on his own, let him live for 70 years, then 60 years = 1892160000 seconds
A man goes to the right place, at the right time, in the right case, the right brick breaks down, and at the right moment completely unnecessarily gets into his head. So, all events take place independently:
Probability multiplication theorem for independent events
P (AB) = P (A) * P (B) - the probability of the simultaneous occurrence of two independent events is equal to the product of the probabilities of these events.
Further calculation on wonderful Ruby:
v1 = 510072000000000*3/7 # площадь суши в кв.метрах v2 = 1854288 # количество кирпичей для постройки 12-этажного дома v3 = 31536000 # количество секунд в году v4 = 86400 # количество секунд в сутках v5 = 500 # жизненных ситуаций v6 = 2000 # путь в метрах v7 = 1892160000 # "самостоятельная жизнь в секундах" puts "Сколько было вариантов у мужика выжить: #{v1*v2*v3*v4*v5*v6*v7-1}" Answer:
Сколько вариантов было у мужика выжить: 2089825832201191353090774690693119999999999999999 Well, or the approximate probability of such a hit was: 10⁻⁴⁹
The truth is almost unreal?
Ps. Of course, all the calculations are extremely extremely approximate, the goal was to realize the order of smallness of the probability value.
- This is not the answer to the question. And Harry did not give an answer. - Pavel Mayorov
- Every second a brick the size of a meter per meter jumps out of an arbitrary place of the only house in the world and falls into the same place. - Qwertiy ♦
- He advised to build on a dictionary. I quote: "Well, if you so need guaranteed uniqueness - then make a dictionary, renumbering all the lines, and then the hash will be just the line number. Uniqueness is guaranteed :)" - Majestio
- Qwertiy, not "every" but "into one of" - Majestio
int- well, I would think about calculating CRC32 ... - Harrygperf(a generator of ideal hash functions), but this is possible only if you know all the expected lines in advance. - D-side