Detect languages in a column, but ignore ambiguous values. Why am i getting an error?

Question

Detect languages in a column, but ignore ambiguous values. Why am i getting an error?

348 views Asked by Demi At 26 April 2022 at 15:07

Here is a sample dataset:

ID	Details
1	Here Are the Details on Facebook's Global Part...
2	Aktien New York Schluss: Moderate Verluste nac...
3	ClÃ´ture de Wall Street : Trump plombe la tend...
4	''
5	NaN

I need to add 'Language' column, which represents what language is used in 'Details' column, so that in the end it will look like this:

ID	Details	Language
1	Here Are the Details on Facebook's Global Part...	en
2	Aktien New York Schluss: Moderate Verluste nac...	de
3	ClÃ´ture de Wall Street : Trump plombe la tend...	fr
4	''	NaN
5	NaN	NaN

I tried this code:

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)

It failed, I guess it is because of rows that have values like 'ID'=4. Therefore, I tried this:

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)

However, I still got an error:

LangDetectException: No features in text.

Original Q&A

There are 1 answers

**pho** · Accepted Answer · 2022-04-26T15:14:39+00:00

You can catch the error and return NaN from the function you apply. Note that you can give any callable that takes one input and returns one output as the argument to .apply(), it doesn't have to be a lambda

def detect_lang(x):
    if len(x) <= 1: return np.nan 
    try:
        lang = detect(x)
        if lang: return lang # Return lang if lang is not empty
    except langdetect.LangDetectException:
        pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
    return np.nan  # If lang was empty or there was an error, we reach this line

df2['Language']=df2['Details].apply(detect_lang)

I'm not sure why you had if len(x)>1 in there: that would only return NaN when the original string has zero or one characters, but I included it in my detect_lang function to keep the functionality consistent with your lambda.

TechQA.

Detect languages in a column, but ignore ambiguous values. Why am i getting an error?

There are 1 answers

Related Questions in PYTHON

Related Questions in LANGUAGE-DETECTION

Popular Questions

Popular Tags

Trending Questions