Find "complemented" bit vectors clusters

205 views Asked by At

I have a huge list of bit vectors (BV) that I want to group in clusters.

The idea behind this clusters is to be able to choose later BVs from each cluster and combine them for generate a BV with (almost) all-ones (which must be maximized).

For example, imagine the 1 means an app is Up and 0 is down in node X in a specific moment in time. We want to find the min list of nodes for having the app Up:

    App BV for node X in cluster 1:  1 0 0 1 0 0

    App BV for node Y in cluster 2:  0 1 1 0 1 0

    Combined BV for App (X+Y):       1 1 1 1 1 0 

I have been checking the different cluster algorithms but I did found one that takes into account this "complemental" behavior because in this case each column of the BV is not referred to a feature (only means up or down in an specific timeframe).

Regarding other algorithms like k-means or hierarchical clustering, I do not have clear if I can include in the clustering algorithm this consideration for the later grouping.

Finally, I am using the hamming distance to determine the intra-cluster and the inter-cluster distances given that it seems to be the most appropiated metric for binary data but results show me that clusters are not closely grouped and separated among them so I wonder if I am applying the most suitable group/approximation method or even if I should filter the input data previously grouping.

Any clue or idea regarding grouping/clustering method or filtering data is welcomed.

1

There are 1 answers

1
Has QUIT--Anony-Mousse On

This does not at all sound like a clustering problem.

None of these algorithms will help you.

Instead, I would rather call this a match making algorithm. But I'd assume it is at least NP-hard (it resembles set cover) to find the true optimum, so you'll need to come up with a fast approximation. Best something specific to your use case.

Also you haven't specified (you wrote + but that likely isn't what you want) how to combine two 1s. Is it xor or or? Nor if it is possible to combine more than two, and what the cost is when doing so. A strategy would be to find the nearest neighbor of the inverse bitvector for each and always combine the best pair.