I'm try to getting most duplicated word from string with this code.
let text = """
aa bb aa bb aa bb cc dd dd cc zz zz cc dd zz
"""
let words = text.unicodeScalars.split(omittingEmptySubsequences: true, whereSeparator: { !CharacterSet.alphanumerics.contains($0) })
.map { String($0) }
let wordSet = NSCountedSet(array: words)
let sorted = wordSet.sorted { wordSet.count(for: $0) > wordSet.count(for: $1) }
print(sorted.prefix(3))
result is
[cc, dd, aa]
Currently, it put all words, even it is a single charcter.
What I'm going to do is,
- put a word to NSCountedSet which has more than one character.
- if words in NSCountedSet have same count, sort it alphabetically. (desired result is aa ,cc, dd)
And if it is possible..
- omit parts of speech from the string, such as 'and, a how,of,to,it,in on, who '....etc
Let's consider this string:
You could use a linguistic tagger :
To count the multiplicity of each word I'll be using a dictionary:
Let's define the accepted linguistic tags (you change these to your liking) :
Now let's parse the string, using the linguistic tagger :
Now the
dict
has the desired words with their multiplicityAs you can see a Dictionary is an unoreded collection. Now let's introduce some law and order:
Now let's get the keys only:
and print the three most frequent words :
To get the topmost frequent words, it would be more efficient to use a Heap (or a Trie) data structure, instead of having to hash every word, sort them all by frequency, and then prefixing. It should be a fun exercise .