quanteda - stopwords not working in French

414 views Asked by At

For some reason, stop words is not working for my corpus, entirely in French. I've been trying repeatedly over the past few days, but many words that should have been filtered simply are not. I am not sure if anyone else has a similar issue? I read somewhere that it could be because of the accents. I tried stringi::stri_trans_general(x, "Latin-ASCII") but I am not certain I did this correctly. Also, I notice that French stop words are sometimes referred to as "french" or "fr".

This is one example of code I tried, I would be extremely grateful for any advice. I also manually installed quanteda, because I had difficulties downloading it, so it could be linked to that.

text_corp <- quanteda::corpus(data,
   text_field="text")

head(stopwords("french"))

summary(text_corp)

my_dfm <- dfm(text_corp)
myStemMat <- dfm(text_corp, remove = stopwords("french"), stem = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE)

myStemMat[, 1:5]

topfeatures(myStemMat 20)

In this last step, there are still words like "etre" (to be), "plus" (more), comme ("like"), avant ("before"), avoir ("to have")

I also tried to filter stop words in a different way, through token creation:

tokens <-
tokens(
text_corp,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE,
)

mydfm <- dfm(tokens,
    tolower = TRUE,
   stem = TRUE,
   remove = stopwords("french")
   )

topfeatures(mydfm, 20)
1

There are 1 answers

1
Ken Benoit On

The stopwords are working just fine, however the default Snowball list of French stopwords simply does not include the words you wish to remove.

You can see that by inspecting the vector of stopwords returned by stopwords("fr"):

library("quanteda")
## Package version: 2.1.2
c("comme", "avoir", "plus", "avant", "être") %in%
  stopwords("fr")
## [1] FALSE FALSE FALSE FALSE FALSE

This is the full list of words:

sort(stopwords("fr"))
##   [1] "à"        "ai"       "aie"      "aient"    "aies"     "ait"     
##   [7] "as"       "au"       "aura"     "aurai"    "auraient" "aurais"  
##  [13] "aurait"   "auras"    "aurez"    "auriez"   "aurions"  "aurons"  
##  [19] "auront"   "aux"      "avaient"  "avais"    "avait"    "avec"    
##  [25] "avez"     "aviez"    "avions"   "avons"    "ayant"    "ayez"    
##  [31] "ayons"    "c"        "ce"       "ceci"     "cela"     "celà"    
##  [37] "ces"      "cet"      "cette"    "d"        "dans"     "de"      
##  [43] "des"      "du"       "elle"     "en"       "es"       "est"     
##  [49] "et"       "étaient"  "étais"    "était"    "étant"    "été"     
##  [55] "étée"     "étées"    "étés"     "êtes"     "étiez"    "étions"  
##  [61] "eu"       "eue"      "eues"     "eûmes"    "eurent"   "eus"     
##  [67] "eusse"    "eussent"  "eusses"   "eussiez"  "eussions" "eut"     
##  [73] "eût"      "eûtes"    "eux"      "fûmes"    "furent"   "fus"     
##  [79] "fusse"    "fussent"  "fusses"   "fussiez"  "fussions" "fut"     
##  [85] "fût"      "fûtes"    "ici"      "il"       "ils"      "j"       
##  [91] "je"       "l"        "la"       "le"       "les"      "leur"    
##  [97] "leurs"    "lui"      "m"        "ma"       "mais"     "me"      
## [103] "même"     "mes"      "moi"      "mon"      "n"        "ne"      
## [109] "nos"      "notre"    "nous"     "on"       "ont"      "ou"      
## [115] "par"      "pas"      "pour"     "qu"       "que"      "quel"    
## [121] "quelle"   "quelles"  "quels"    "qui"      "s"        "sa"      
## [127] "sans"     "se"       "sera"     "serai"    "seraient" "serais"  
## [133] "serait"   "seras"    "serez"    "seriez"   "serions"  "serons"  
## [139] "seront"   "ses"      "soi"      "soient"   "sois"     "soit"    
## [145] "sommes"   "son"      "sont"     "soyez"    "soyons"   "suis"    
## [151] "sur"      "t"        "ta"       "te"       "tes"      "toi"     
## [157] "ton"      "tu"       "un"       "une"      "vos"      "votre"   
## [163] "vous"     "y"

That's why they are not removed. We can see this with an example I created, using many of your words:

toks <- tokens("Je veux avoir une glace et être heureux, comme un enfant avant le dîner.",
  remove_punct = TRUE
)

tokens_remove(toks, stopwords("fr"))
## Tokens consisting of 1 document.
## text1 :
## [1] "veux"    "avoir"   "glace"   "être"    "heureux" "comme"   "enfant" 
## [8] "avant"   "dîner"

How to remove them? Either use a more complete list of stopwords, or customize the Snowball list by appending the stopwords you want to the existing ones.

mystopwords <- c(stopwords("fr"), "comme", "avoir", "plus", "avant", "être")

tokens_remove(toks, mystopwords)
## Tokens consisting of 1 document.
## text1 :
## [1] "veux"    "glace"   "heureux" "enfant"  "dîner"

You could also use one of the other stopword sources, such as the "stopwords-iso", which does contain all of the words you wish to remove:

c("comme", "avoir", "plus", "avant", "être") %in%
  stopwords("fr", source = "stopwords-iso")
## [1] TRUE TRUE TRUE TRUE TRUE

With regard to the language question, see the help for ?stopwords::stopwords, which states:

The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.

With regard to what you tried with stringi::stri_trans_general(x, "Latin-ASCII"), this would only help you if you wanted to remove "etre" and your stopword list contained only "être". In the example below, the stopword vector containing the accented character is concatenated with a version of itself in which the accents have been removed.

sw <- "être"
tokens("etre être heureux") %>%
  tokens_remove(sw)
## Tokens consisting of 1 document.
## text1 :
## [1] "etre"    "heureux"

tokens("etre être heureux") %>%
  tokens_remove(c(sw, stringi::stri_trans_general(sw, "Latin-ASCII")))
## Tokens consisting of 1 document.
## text1 :
## [1] "heureux"

c(sw, stringi::stri_trans_general(sw, "Latin-ASCII"))
## [1] "être" "etre"