Can you use perplexity to guess the language of a document?

78 views Asked by At

I'm creating five bigram models based on five different language's training sets. I have one mystery file (I can't see this file but can use it in my program) that I have to guess the language of (its one of the five languages) based on the five n-gram LMs perplexity scores on this mystery file.

Assuming they are of the same size (vocabulary), how robust is it to rely on the perplexity and cross entropy of the 5 different LMs alone to make an educated guess on this mystery language?

Is perplexity used as a reliable intrinsic metric in applications that require language classification (like when you type in a sentence in Google translate and it guesses the language you're typing in)?

I know the mystery language is German but the perplexity score from the spanish and french LMs are the lowest whereas the German LM has one of the higher perplexities. How could this be? Shouldn't the German LM be less perplexed at encountering text of the same language?

0

There are 0 answers