The task is to compare two sets of unique positive integers, and find those present in both at once. Everything is exactly in the range from 1 to 200 million. Usually in each of the two sets from 0 to 5 million numbers.

I still do head-on: I bring both sets into MySQL temporary tables. Two single-column tables, where the numbers are primary keys. The comparison is quick if the sets are small and fit in the engine=MEMORY . Slowly when the tables are large and you have to create them on disk. When it is necessary to perform such comparisons in large quantities and often - brakes.

What if to reproduce the indexed MySQL columns in own code? One of the sets to keep in memory, and each element of the second to check for the presence in the first.

Do not store each of the dialing numbers (32bit, 2.5 million on average = 80Mb), but work with the bit mask of all possible values. 200 million is, with a margin of 2 ^ 28 = 268,435,456 bits = 32Mb. Installed - the number is in the set, 0 - no. Compare the set bits.

For each set, storing the whole set of bits for each set is inefficient. Surely, such data can be great to compress. Most of the bits will be 0, which means that their sequences can be encoded by their number in a row, for example.

There will be one question for such a compressed array: is there another required number in the set, or not?

Simplify for example. Let there be a total of 32 values: 0..31. Our array will consist of 32 zeros / units. There are only two values ​​in the set: 17 and 22. 16 zeros, one, 4 zeros, 1. And we write them as 16.4: 10000100 . Only 8 bits instead of 2 * 6. compression saved 25%. But this is my very clumsy idea of ​​a possible method of compression, without separators, units in a row, etc.

Need to know about the number 19, is there a set? We pass on our 8 bits: 16 is still less than 19, another 4 is already overkill, the answer is “not in the set”.

Do you think there is any sense in such a bike at all, can it speed up the comparison of two sets, compared to MySQL?

Upd. Simply formulate the question. Compression is sought for data when its parameters and constraints are known: only natural numbers from .. to .., not consecutive, not sorted, without repetitions, order is not important. And even without the need for unpacking: you just need to be able to answer the question “is there such a number in the set, or not?”.

  • How do you associate bitfields and binary trees? - IronVbif
  • This is the question. As I vaguely imagine, the tree will consist of only 0 and 1, and only the branches leading to the set bits are “completely drawn”, and the zero bits are discarded. Perhaps something close to Huffman , but without statistical analysis of the resulting set. - Sergiks
  • one
    read how indexes work in DBMS for numeric data types. rather, for everything you need an analogue of the index based on a binary tree, but you can also combine several different ones - jmu
  • one
    @sergiks, where did you get 8 GB? The bit string for all 32-bit integers is 512 MB . Also a lot. You write that the maximum set size is 5 million. Then a hash table with 50% fill will take 10 million cells, i.e. 40 MB (without compression). Think about it. - avp
  • one
    @sergiks, and in fact, if a bit vector of 2 ^ 28 bits (you write 268,435,456 bits) contains only 5,000,000 ones, then it should be compressed well. On average, we get one for 53 zero bits. It is possible to compress in the most trivial way - by a repetition count (probably not a bit, but zero bytes). Those. Each set is stored as a compressed string of 2 ^ 28 bits, and in the process of reading you expand it in memory and at the same time (!!!) perform the AND operation. Moreover, IMHO can be written in such a way that this unpacking and AND will require a minimum (literally a few words) of memory (except for the buffer, of course). - avp

2 answers 2

1) DB is not for solving such problems.

2) use either ordered lists or hash tables

3) I would use an ordered list or array, and then compare using the pairwise merge method: - both pointers to the first elements

  • we take an element from the first list,
  • compare with the second list item
  • if the values ​​are equal, then the elements are listed in the result list and increase both pointers and to paragraph 1
  • if 1 <2, increase the value of the first list pointer
  • otherwise second

altogether, the algorithms used: 2 * qsort and pairwise merging

if the lists are very large ... (20 million is quite tolerable for memory, but 200 may already be brute force), then they can be broken up into parts and compared first part 1, then part 2.

The algorithm is approximately as follows: there are lists 1: A + B + C + D ... and 2: A + B + C + D ... - we compare the list 1А and 2А

  • if the list 1A is over, then compare the end of the list 2A and the list 1B
  • otherwise, the end of the list is 2B with the list of 1A
  • as soon as the list ends with the current one, the next one from this sequence becomes (1 or 2)
  • and so on
  • Thanks for the pairwise matching algorithm. I think that what is needed is paired with compression from the @avp comment. Memory is very small - the project still lives on shared hosting, hence the forced outsourcing to the database server, where there is more memory than scripts. - Sergiks
  • write in C ++ or JAVA, maybe in python it is also possible, but I don’t know exactly how arrays are organized there. it is impossible to write such things on PHP, it will not fit in memory. - akalend
  • @akalend I will try to write a module to nginx, which is already there .. :) - Sergiks
  • and why do you need extra hemorrhoids? write the scgi module on libscgi in the nginx config file with the location / numpers {scgi_pass localhost: 8080; } - akalend 1:53 pm
  • In general, you cannot get rid of the database, because it is necessary to store all these sets for a long time. In compressed binary form, now, probably. I'll try to implement all the same in PHP, or perl, because I do not own C, and Java has only been used in the context of Processing.org yet. - Sergiks

I would start with simple solutions and run-time measurements for different data sets. Can you always store data in a previously prepared form (for example, already in a binary tree or in a sorted array)? Then you can find elements like this:

  // для отсортированным массивов. Сложность О(a.length + b.length) public static List<Int32> GetSameNumber(Int32[] a, Int32[] b) { var i = 0; var j = 0; var ans = new List<Int32>(); while (i < a.Length && j < b.Length) { if (a[i] == b[j]) { ans.Add(a[i]); i++; j++; } else if (a[i] > b[j]) j++; else i++; } return ans; } // для бинарных деревьев. Сложность О(a.length) или О(a.length + b.length)в зависимости от реализации public static List<Int32> GetSameNumber(SortedSet<Int32> a, SortedSet<Int32> b) { var ans = new List<Int32>(); foreach (var i in a) { if (b.Contains(i)) ans.Add(i); } return ans; //var ansset = new SortedSet<Int32>(a); //ansset.IntersectWith(b); //return ansset.ToList(); // или просто return ansset } 

If the data cannot be stored in the required form, then transfer it to it each time it is called. O (nlogn) appears to create a binary tree or sort an array.

A big tree with 8GB bats will probably give you a cash and pager penalty.

  • Just wanted to post the same code. @sergiks: Binary trees are optional and the tables, your data perfectly fit in memory. Sort them and run through both arrays once, as in the first example answer. - VladD
  • @IronVbif: yet the complexity of the first variant is O(n log n) , where n = max(len(a), len(b)) , because sorting is still needed. - VladD
  • @VladD: Well, there from above it is proposed to always store the items in sorted form, if possible. And the bottom is written about nlogn in the case of sorting - IronVbif
  • @VladD: data will be sorted in advance. - Sergiks