Here is a sample dataset:
ID | Details |
---|---|
1 | Here Are the Details on Facebook's Global Part... |
2 | Aktien New York Schluss: Moderate Verluste nac... |
3 | Clôture de Wall Street : Trump plombe la tend... |
4 | '' |
5 | NaN |
I need to add 'Language' column, which represents what language is used in 'Details' column, so that in the end it will look like this:
ID | Details | Language |
---|---|---|
1 | Here Are the Details on Facebook's Global Part... | en |
2 | Aktien New York Schluss: Moderate Verluste nac... | de |
3 | Clôture de Wall Street : Trump plombe la tend... | fr |
4 | '' | NaN |
5 | NaN | NaN |
I tried this code:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)
It failed, I guess it is because of rows that have values like 'ID'=4. Therefore, I tried this:
!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)
However, I still got an error:
LangDetectException: No features in text.
You can catch the error and return
NaN
from the function you apply. Note that you can give any callable that takes one input and returns one output as the argument to.apply()
, it doesn't have to be a lambdaI'm not sure why you had
if len(x)>1
in there: that would only returnNaN
when the original string has zero or one characters, but I included it in mydetect_lang
function to keep the functionality consistent with your lambda.