How to cluster a set of strings?

1.7k views Asked by At

My dataset looks something like this

['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '', '', '', 'FF', 'F', 'FF', 'F', 'F', 'FFFFFFIFF', '', 'FFFFFFF', 'F', '', '', 'F', '', '', '', '', '', '', '', 'F', '', '', 'ABB', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', 'FFEIE', 'FF', 'ABABCDIIJCCFG', '', 'FABACFFF', 'FEGGIHJCABAGGFEFGGFEECA', '', 'FF', 'FFGEFGGFFG', 'F', 'FFF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'I', '', '', 'ABIIII', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'AAAAA', 'AFGFE', 'FGFEEFGFEFGFEFGJJGFEACHJ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'JFEFFFFFFF', '', 'AAIIJFFGEFGCABAGG', '', '', '', '', '', '', '', '', '', 'F', 'JFJFJFJ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', '', '', '', '', '', '', '', '', 'F', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', 'FGFEFGFE', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

This is just a sample, but i will have a LOT more strings. How do i cluster them such that each cluster has some pattern

1

There are 1 answers

0
Aditya On BEST ANSWER

The idea I'm suggesting is originated in text-processing, NLP and information retrieval and very widely used in situations where you have sequences of characters/information like Genetic information. Because you have to preserve the sequence, we can use the concepts of n-grams. I'm using bi-grams in the following example, though you can generalize for higher order grams. N-grams help in preserving the sequential pattern in the data. Don't worry - we keep borrowing ideas from other fields of science - even edit distance and dynamic programming were not originally computer science concepts.

There are many possible approaches to tackle this problem - each one unique, and there's no right one - atleast there's no sufficient research that proves which one is right. Here's my take on it.

So the goal is to create a Bag-Of-Words like vector out of your data strings - and these vectors can be easily fed to any machine learning tool or library for clustering. A quick summary of steps:-

  1. Collect bigrams (and unigrams etc)
  2. Create a dictionary to get Bag Of Words (Code attached)
  3. Create functionality to get vector from a string

Let's get started

import numpy
from sklearn.cluster import KMeans
def getStringBigrams(string):
    if len(string) <= 0: return []
    if len(string) == 1: return string[0] # Handle strings with only one character = extract unigram
    return [string[i]+string[i+1] for i in range(len(string)-1)]

def getDataBigrams(strings):
    return [getStringBigrams(x) for x in strings]

So here These functions would convert the given string to a set of two characters (and single character if there's only one). You can modify them to capture 3-grams or even complete set of all possible uni-, bi- and tri-grams. Feel free to experiment.

Now how to convert strings to vectors? We'll define a function that will convert string to vectors taking care of how many times a particular n-gram appears. This is called BAG OF WORDS. Here, these are BAGS OF SCREENS. Following two functions help you with that:

def generateDictionary(data):
    '''
    This function identifies unique n-grams in your data.
    '''
    vocab = set()
    for line in data:
        for item in line:
            vocab.add(item)
    dictionary = {}
    i=0
    for item in vocab:
        dictionary[item] = i
        i+=1
    return dictionary

    def doc2Bow(bigramData, dictionary):
    ''' 
    Take single document in bigram format and return a vector
    '''
    vect = [0]*len(dictionary) # Initialize vector to zero
    for gram in bigramData:
        vect[dictionary[gram]]+=1
    return numpy.asarray(vect)  # Convert to numpy vector

Voila! We're done. Now feed your data vectors to any K-Means implementation of your choice. I used SKLearn.

strings = ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '', '', '', 'FF', 'F', 'FF', 'F', 'F', 'FFFFFFIFF', '', 'FFFFFFF', 'F', '', '', 'F', '', '', '', '', '', '', '', 'F', '', '', 'ABB', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', 'FFEIE', 'FF', 'ABABCDIIJCCFG', '', 'FABACFFF', 'FEGGIHJCABAGGFEFGGFEECA', '', 'FF', 'FFGEFGGFFG', 'F', 'FFF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'I', '', '', 'ABIIII', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'AAAAA', 'AFGFE', 'FGFEEFGFEFGFEFGJJGFEACHJ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'JFEFFFFFFF', '', 'AAIIJFFGEFGCABAGG', '', '', '', '', '', '', '', '', '', 'F', 'JFJFJFJ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', '', '', '', '', '', '', '', '', 'F', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', 'FGFEFGFE', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

You should rather choose to read strings from a file

strings = [x for x in strings if len(x) > 0] # Just cleanup for experiment purpose
nGramData = getDataBigrams(strings)
dictionary = generateDictionary(nGramData)
data = [doc2Bow(nGramData[i], dictionary) for i in range(len(nGramData))]

K = 10
km = KMeans(init='k-means++', n_clusters=K, n_init=10)

km.fit(data)

And then I finally see what my clusters look like using km.labels_ property of KMeans class.

Here're your clusters. Look at the console (bottom) window - There are ten clusters. enter image description here

Now you can modify feature generation I wrote in the code and see how your modifications performs. Rather than just bigrams, extract all the possible unigrams, bigrams and trigrams and use them to create BOW. There will be significant difference. You can also use length of the string as a feature. You can also experiment with other algorithms including hierarchical clustering. Be sure to send me updated results after your modifications.

Enjoy!