How to put and sort word in NSCountedSet in swift?

193 views Asked by At

I'm try to getting most duplicated word from string with this code.

let text = """
  aa bb aa bb aa bb cc dd dd cc zz zz cc dd zz
  """
  let words = text.unicodeScalars.split(omittingEmptySubsequences: true, whereSeparator: { !CharacterSet.alphanumerics.contains($0) })
  .map { String($0) }
  let wordSet = NSCountedSet(array: words)
  let sorted = wordSet.sorted { wordSet.count(for: $0) > wordSet.count(for: $1) }
print(sorted.prefix(3))

result is

[cc, dd, aa]

Currently, it put all words, even it is a single charcter.

What I'm going to do is,

  1. put a word to NSCountedSet which has more than one character.
  2. if words in NSCountedSet have same count, sort it alphabetically. (desired result is aa ,cc, dd)

And if it is possible..

  1. omit parts of speech from the string, such as 'and, a how,of,to,it,in on, who '....etc
1

There are 1 answers

4
ielyamani On

Let's consider this string:

let text = """
      She was young the way an actual young person is young.
      """

You could use a linguistic tagger :

import NaturalLanguage

let options = NSLinguisticTagger.Options.omitWhitespace.rawValue
let tagger = NSLinguisticTagger(tagSchemes: NSLinguisticTagger.availableTagSchemes(forLanguage: "en"), options: Int(options))

To count the multiplicity of each word I'll be using a dictionary:

var dict = [String : Int]()

Let's define the accepted linguistic tags (you change these to your liking) :

let acceptedtags: Set = ["Verb", "Noun", "Adjective"]

Now let's parse the string, using the linguistic tagger :

let range = NSRange(location: 0, length: text.utf16.count)
tagger.string = text

tagger.enumerateTags(
    in: range,
    scheme: .nameTypeOrLexicalClass,
    options: NSLinguisticTagger.Options(rawValue: options),
    using: { tag, tokenRange, sentenceRange, stop in
        guard let range = Range(tokenRange, in: text)
            else { return }

        let token = String(text[range]).lowercased()

        if let tagValue = tag?.rawValue,
            acceptedtags.contains(tagValue)
        {
            dict[token, default: 0] += 1
        }

        // print(String(describing: tag) + ": \(token)")
})

Now the dict has the desired words with their multiplicity

print("dict =", dict)

As you can see a Dictionary is an unoreded collection. Now let's introduce some law and order:

let ordered = dict.sorted {
    ($0.value, $1.key) > ($1.value, $0.key)
}

Now let's get the keys only:

let mostFrequent = ordered.map { $0.key }

and print the three most frequent words :

print("top three =", mostFrequent.prefix(3))

To get the topmost frequent words, it would be more efficient to use a Heap (or a Trie) data structure, instead of having to hash every word, sort them all by frequency, and then prefixing. It should be a fun exercise .