I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.
There is no need for a PostgreSQL code because I'm having problems to install languages, but any language that can connect to the database, retrieve the texts and identify it arewelcome.
I used Lingua::Identify
suggested in the answers right in the Perl script, it worked, but the results are not precise.
The texts I want to identify comes from the web and most are in portuguese, but Lingua::Identify
is classifying much as french, italian and spanish that are similar languages.
I need something more precise.
I added the java
and r
tags because are the languages I'm using in the system and solution using they will be easy to implement, but solutions in any language are welcome.
Try these:
This blog post shares some tests to compare the 2 libraries (along with a 3rd - the Language Identification module of Apache Tika, which really is a complete toolkit for Text Analysis).