Natural language identification and assign as like "en", "fr", "tr"

1.4k views Asked by At

Is there any package for identifying which language is a text in R? I have many rows including text in different languages like "en", "es", "fr", "ja" and so on.. Is it possible to get result with language column like below?

id text                 language
1  "I am a musician"    en 
2  "я инженер"          ru 
3  "Je suis un poète"   fr

Or any other possible help to define type of natural language?

1

There are 1 answers

5
AkselA On BEST ANSWER

Your best shot is probably cldr, it uses Chrome's language detection library.

library(devtools)
install_github("aykutfirat/cldr")

library(cldr)

docs1 <- c(
  "Detects the language of a set of documents with possible input hints. Returns the top 3 candidate languages and their probabilities as well.",
  "Som nevnt på møte forrige uke er det ulike ting som skjer denne og neste uke.",
  "Ganz besonders wollen wir, dass forthin allenthalben in unseren Städten, Märkten und auf dem Lande zu keinem Bier mehr Stücke als allein Gersten, Hopfen und Wasser verwendet und gebraucht werden sollen.",
  "Роман Гёте «Вильгельм Майстер» заложил основы воспитательного романа эпохи Просвещения.")

detectLanguage(docs1)$detectedLanguage
# [1] "ENGLISH" "NORWEGIAN" "GERMAN" "RUSSIAN"

However, your examples seems to be a bit too short.

docs2 <- c("I am a musician", "я инженер", "Je suis un poète")

detectLanguage(docs2)$detectedLanguage
# [1] "Unknown" "Unknown" "Unknown"

As noted by Ben textcat seems to perform better on the shorter examples given by gulnerman, but unlike cldr it doesn't indicate how reliable the matches are. This makes it difficult to say how much you can trust the results, even though two out of three were correct in this case.

library(textcat)
textcat(docs2)
# [1] "latin" "russian-iso8859_5" "french"