My dataset looks something like this
['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '', '', '', 'FF', 'F', 'FF', 'F', 'F', 'FFFFFFIFF', '', 'FFFFFFF', 'F', '', '', 'F', '', '', '', '', '', '', '', 'F', '', '', 'ABB', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', 'FFEIE', 'FF', 'ABABCDIIJCCFG', '', 'FABACFFF', 'FEGGIHJCABAGGFEFGGFEECA', '', 'FF', 'FFGEFGGFFG', 'F', 'FFF', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'I', '', '', 'ABIIII', '', '', '', '', 'I', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'AAAAA', 'AFGFE', 'FGFEEFGFEFGFEFGJJGFEACHJ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'JFEFFFFFFF', '', 'AAIIJFFGEFGCABAGG', '', '', '', '', '', '', '', '', '', 'F', 'JFJFJFJ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', '', '', '', '', '', '', '', '', 'F', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'F', 'FGFEFGFE', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
This is just a sample, but i will have a LOT more strings. How do i cluster them such that each cluster has some pattern
The idea I'm suggesting is originated in text-processing, NLP and information retrieval and very widely used in situations where you have sequences of characters/information like Genetic information. Because you have to preserve the sequence, we can use the concepts of n-grams. I'm using bi-grams in the following example, though you can generalize for higher order grams. N-grams help in preserving the sequential pattern in the data. Don't worry - we keep borrowing ideas from other fields of science - even edit distance and dynamic programming were not originally computer science concepts.
There are many possible approaches to tackle this problem - each one unique, and there's no right one - atleast there's no sufficient research that proves which one is right. Here's my take on it.
So the goal is to create a Bag-Of-Words like vector out of your data strings - and these vectors can be easily fed to any machine learning tool or library for clustering. A quick summary of steps:-
Let's get started
So here These functions would convert the given string to a set of two characters (and single character if there's only one). You can modify them to capture 3-grams or even complete set of all possible uni-, bi- and tri-grams. Feel free to experiment.
Now how to convert strings to vectors? We'll define a function that will convert string to vectors taking care of how many times a particular n-gram appears. This is called BAG OF WORDS. Here, these are BAGS OF SCREENS. Following two functions help you with that:
Voila! We're done. Now feed your data vectors to any K-Means implementation of your choice. I used SKLearn.
You should rather choose to read strings from a file
And then I finally see what my clusters look like using km.labels_ property of KMeans class.
Here're your clusters. Look at the console (bottom) window - There are ten clusters.
Now you can modify feature generation I wrote in the code and see how your modifications performs. Rather than just bigrams, extract all the possible unigrams, bigrams and trigrams and use them to create BOW. There will be significant difference. You can also use length of the string as a feature. You can also experiment with other algorithms including hierarchical clustering. Be sure to send me updated results after your modifications.
Enjoy!