The problem is as follows. There is a database with approximately 50 billion unsigned bigint lines. For further work with them, they need to be clustered according to the condition of a given Hamming distance (i.e. the number of clusters is not known in advance). And, of course, no RAM is enough to process this volume at a time. Accordingly, the algorithm should be incremental, with progressive addition of new data. Actually, the question is: is there such an algorithm in nature? It is desirable, implemented in some common language (because pure math, I do not understand). In vskikdu, google failed: /

The task is not hypothetical, the project is commercial :) Thank you in advance.

  • The second link in Google: Counting Hamming distance on a large data set - Yura Ivanov
  • Of course, I saw this post). HEngine does not fit - it stores the entire volume in RAM - the server is already down by 50 million. If you rewrite under sql, then, firstly, all the speed is lost, and secondly, there is no possibility of sorting and binary search. Therefore disappears: / - anagamin

1 answer 1

So if this distance between the numbers will be 1, then the whole data set in one cluster to push?

Or another situation - now we have two records, the distance between them is 2. We pushed them into two clusters. Tomorrow there was a record, in which the distance from the first and second record is 1 - where to shove it? Or should clusters be merged?

The first is easy if all the numbers go in order. And from this follows the second.

  • This is not me) Comrades well-wishers! When I want to answer - I myself can answer) In the meantime, I wanted to clarify the problem - so I made a comment. Moreover, the answer @Yura Ivanov, much more on the answer is similar - BOPOH
  • This, I believe, can be solved by partially random arrangement of centroids and a more or less uniform distribution in the clusters (that is, regularly consider their size). In addition, one cluster may contain several hashes. And with such a volume, a certain percentage of false positives are not terrible. - anagamin
  • @anagamin, well, if you are satisfied with the approximate algorithm, then why not invent it yourself? Moreover, I doubt that approximate algorithms are common for such purposes on an Internet. Most likely, exact ones are used, and exact ones, as we found out, are not suitable for you. One of the "exemplary" - go from the reverse. Select the number of clusters, for each of them select the "standard" values. For each record, we determine membership in a particular cluster (or use the one to which the distance is minimal) - because The default values ​​are set, then the time will be linear. - BOPOH
  • one
    @anagamin, or another option: Take the first record, build all possible nodes from it (ie, values, the distance to which is not more than the original). For each such node (which is also in the original set) we find the next neighbors, etc. to some degree of nesting. We place all entries in the cluster. Next, we take the next raw record, carry out the same steps and put it in the next cluster, etc. Because calculations are limited to a certain number of dives - not all the necessary records will fall into one cluster, but you don’t need this. If you optimize a little, then there will be no problems with memory. - BOPOH
  • Yes, it seems, this is the only solution. I just hoped that the genius of mathematics could bypass the linear complexity and increase the final speed of the process by orders of magnitude. It seems, alas :) - anagamin