Can you use perplexity to guess the language of a document?

78 views Asked by karak87rt0 At 14 October 2023 at 22:45

I'm creating five bigram models based on five different language's training sets. I have one mystery file (I can't see this file but can use it in my program) that I have to guess the language of (its one of the five languages) based on the five n-gram LMs perplexity scores on this mystery file.

Assuming they are of the same size (vocabulary), how robust is it to rely on the perplexity and cross entropy of the 5 different LMs alone to make an educated guess on this mystery language?

Is perplexity used as a reliable intrinsic metric in applications that require language classification (like when you type in a sentence in Google translate and it guesses the language you're typing in)?

I know the mystery language is German but the perplexity score from the spanish and french LMs are the lowest whereas the German LM has one of the higher perplexities. How could this be? Shouldn't the German LM be less perplexed at encountering text of the same language?

Original Q&A

TechQA.

Can you use perplexity to guess the language of a document?

There are 0 answers

Related Questions in NLP

Related Questions in N-GRAM

Related Questions in CROSS-ENTROPY

Related Questions in PERPLEXITY

Popular Questions

Popular Tags

Trending Questions