Language detection with data in PostgreSQL

3.7k views Asked by At

I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.

There is no need for a PostgreSQL code because I'm having problems to install languages, but any language that can connect to the database, retrieve the texts and identify it arewelcome.

I used Lingua::Identify suggested in the answers right in the Perl script, it worked, but the results are not precise.

The texts I want to identify comes from the web and most are in portuguese, but Lingua::Identify is classifying much as french, italian and spanish that are similar languages.

I need something more precise.

I added the java and r tags because are the languages I'm using in the system and solution using they will be easy to implement, but solutions in any language are welcome.

6

There are 6 answers

1
Gaurav On BEST ANSWER

Try these:

This blog post shares some tests to compare the 2 libraries (along with a 3rd - the Language Identification module of Apache Tika, which really is a complete toolkit for Text Analysis).

1
Savino Sguera On

Naive Bayes classifiers are very good at language identification. You find implementations in all the major languages, or you can implement one yourself, it's not extremely hard. The wikipedia entry is interesting too: https://en.wikipedia.org/wiki/Naive_Bayes_classifier.

3
filiprem On

You can use PL/Perl (CREATE FUNCTION langof(text) LANGUAGEplperluAS ...) with Lingua::Identify CPAN module.

Perl script:

#!/usr/bin/perl
use Lingua::Identify qw(langof);
undef $/;
my $textstring = <>;  ## warning - slurps whole file to memory
my $a = langof( $textstring );    # gives the most probable language
print "$a\n";

And the function:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
    use Lingua::Identify qw(langof);
    return langof( shift );
$perlcode$;

Works for me:

filip@filip=# select langof('Pójdź, kiń-że tę chmurność w głąb flaszy');
 langof
--------
 pl
(1 row)

Time: 1.801 ms

PL/Perl on Windows

PL/Perl language libary (plperl.dll) comes preinstalled in latest Windows installer of postgres.

But to use PL/Perl, you need Perl interpreter itself. Specifically, Perl 5.14 (at the time of this writing). Most common installer is ActiveState, but it's not free. Free one comes from StrawberryPerl. Make sure you have PERL514.DLL in place.

After installing Perl, login to your postgres database and try to run

CREATE LANGUAGE plperlu;

Language identification library

If quality is your concern, you have some options: You can improve Lingua::Identify yourself (it's open source) or you could try another library. I found this one, which is commercial but looks promising.

0
LiKao On

The problem with language detection is, that it will never be fully precise. My browser quite often misidentifies the language, and it was done by google who probably put a lot of great minds to that tasks.

However here are some points to consider:

I am not sure what Perls Lingua::Identify module really is using, but most often these tasks are handled by Naive Baysian models as somebody pointed out in another answer. Baysian models use probability to classify into a number of categories, in your case these would be different language. Now these probabilities are both dependend probablities, i.e. how often a certain feature appears for each category, as well as independent (prior) probabilities, i.e. how often each category appears in total.

Because both these informations are used, you are very likely to get a low prediction quality when the priors are wrong. I guess Linua::Identify has mostly been trained by a corpus of online document, so the highest prior will most likely be english. What this means, that Lingua::Identify will most likely classify your documents as english, unless it has severe reasons to believe otherwise (In your case it most likely does have severe reason, because you say your documents are misclassified as italian, french and spanish).

This means you should try to re-train your model, if possible. There might be some methods within Lingua::Identify to help you with this. If not, I would suggest you write your own Naive Bayes classifier (it's quite simple actually).

In case you have a Naive Bayes Classifier, you have to decide on a set of features. Most often the frequencies of letters are very characteristic for each language, so this would be a first guess. Just try to train your classifier on these frequencies first. Naive Bayes Classifier are used in spam-filters, so you can train it like one of those. Have it run on a sample set, and whenever you get a misclassification, update the classifier to the correct classification. After a while it will get less and less wrong.

In case single letter frequency does not give you well enough results, you could try using n-grams instead (however be aware of the combinatorial explosion this will introduce). I would not suggest ever trying anything more than 3-grams. In case this still does not give you good results, try manually identifying unique frequent words in each language and add those to your feature set. I am sure once you start experimenting on this you will get more ideas for features to try out.

Another nice thing about the approach using Bayesian Classifiers, is that you can always add new information in case more documents come in, which do not match the trained data. In this case you can just reclassify a few of the new documents and similar to a spam filter the classifier will adapt to the changing environment.

0
Gary - Stand with Ukraine On

I found a library called TextCat, which is available under LGPL. I can't say what the quality of its identification is, but it's got an online demo form, so maybe you can throw some text at it before deciding if its worth downloading.

It's also written in Perl, so if you do want to use it, the approach in filiprem's answer would be a good start point.

0
Laurynas On

Also there is a language detection webservice which provides both free and premium services at http://detectlanguage.com

It has Ruby and PHP clients, but can be accessed from any language simple web request. Output is in JSON.