The task is to compare two sets of unique positive integers, and find those present in both at once. Everything is exactly in the range from 1 to 200 million. Usually in each of the two sets from 0 to 5 million numbers.
I still do head-on: I bring both sets into MySQL temporary tables. Two single-column tables, where the numbers are primary keys. The comparison is quick if the sets are small and fit in the engine=MEMORY
. Slowly when the tables are large and you have to create them on disk. When it is necessary to perform such comparisons in large quantities and often - brakes.
What if to reproduce the indexed MySQL columns in own code? One of the sets to keep in memory, and each element of the second to check for the presence in the first.
Do not store each of the dialing numbers (32bit, 2.5 million on average = 80Mb), but work with the bit mask of all possible values. 200 million is, with a margin of 2 ^ 28 = 268,435,456 bits = 32Mb. Installed - the number is in the set, 0 - no. Compare the set bits.
For each set, storing the whole set of bits for each set is inefficient. Surely, such data can be great to compress. Most of the bits will be 0, which means that their sequences can be encoded by their number in a row, for example.
There will be one question for such a compressed array: is there another required number in the set, or not?
Simplify for example. Let there be a total of 32 values: 0..31. Our array will consist of 32 zeros / units. There are only two values in the set: 17 and 22. 16 zeros, one, 4 zeros, 1. And we write them as 16.4: 10000100
. Only 8 bits instead of 2 * 6. compression saved 25%. But this is my very clumsy idea of a possible method of compression, without separators, units in a row, etc.
Need to know about the number 19, is there a set? We pass on our 8 bits: 16 is still less than 19, another 4 is already overkill, the answer is “not in the set”.
Do you think there is any sense in such a bike at all, can it speed up the comparison of two sets, compared to MySQL?
Upd. Simply formulate the question. Compression is sought for data when its parameters and constraints are known: only natural numbers from .. to .., not consecutive, not sorted, without repetitions, order is not important. And even without the need for unpacking: you just need to be able to answer the question “is there such a number in the set, or not?”.