Binary string clustering

Question

We will consider the distance between two bit lines equal to the number of positions in which the bit values of these lines are different. For example, the distance between lines 10001 and 11011 is two, and between lines 10001 and 01010 - four. It is required to split a given array of strings into the maximum possible number of clusters so that the distance between rows from different clusters is at least three. Input file format: on the first line - the number of bit lines n and the number of bits in line m separated by a space; on each of the next n lines there is a bit string specified by a sequence of m zeros and ones (see the examples on the links above). The algorithm should calculate the number of clusters, the clusters themselves should not be returned.

Tell me the approach to solving the problem? How to effectively solve?

Answer 1 · 2016-10-17T20:11:40

Since you have set intercluster distance, and the specific number of clusters is not important here, you can use the algorithm for selecting connected components.

Its essence is as follows:

Baseline data are presented as a graph. Vertices are original objects, edge weights are distances between objects. Naturally, each vertex is connected to each.
All edges whose weights are greater than a given value (in your case more than 2) are removed from the graph.
The graph is falling apart into several connected components , each of which represents a separate cluster.

Such an algorithm will work well if your source data is clearly beating into clusters.

andreycha

23.7k 3 gold marks 39 silver marks 77 bronze marks

In general, it is not clear if a problem has a solution in the general case ... For example, for simplicity, let's take, say, all 16 numbers from 0000 to 1111. Remove all edges> = 3. All the same, all the numbers will be one connected component (see Gray code). - Harry
one
@Harry This method well highlights clusters such as thickenings and ribbons. The sequence is just a tape, so you get one cluster. In this case, this is a disadvantage, of course. The data are usually not random and have some features. Based on these features, the necessary algorithm and its parameters are selected. One of the sources for making decisions about the algorithm is a histogram of pairwise distances. In the case of tape, it will be flat. And for the algorithm for extracting connected components, data with two peaks are best suited (one is intracluster distances, the second is intercluster). - andreycha
@Harry solve the same problem in the general case for random data is quite difficult: you have to dynamically select algorithms and not the fact that you get a good result. What is your task? Is there a feature in the data? - andreycha
I have? nothing, this is not my task. I was just interested in your method, nothing more. I just wanted to push a small set of data manually - and then I did not succeed. So I thought, and what will happen in the general case. It turns out that if there is a chain, the solution does not exist in principle ... - Harry
@Harry oh, sorry, didn’t pay attention to the nickname, thought you were the author of the question :). If there is one chain in the data, then you need to either change the metric to get a new form of clusters, or use k-means algorithms, but be prepared for the fact that they just break the tape into N clusters, and you will always find a number of objects from different clusters, in which the distance is not (strongly) greater than intracluster. - andreycha

|

Binary string clustering

1 answer 1

More articles: