Regex error: pattern exceeds limits on size or complexity

85 views Asked by At

I have a dataframe of ~20,0000 observations, I am focused specifically on a column that has abstracts of scientific journals. I am attempting to pull plant species out of these abstracts. So I wanted to use this function to do so...

find.all.matches <- function(search.col,pat){
  captured <- str_match_all(search.col,pattern = pat)
  t <- lapply(captured, str_trim)
  t2 <- lapply(t, function(x) gsub("[^a-z]","",x))
  t3 <- sapply(t2, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

I have a dataframe of all recognized plant species which is 1496575 obs. of 1 variable.

I created a pattern for this dataframe...

WFO_list <- WFO_keywords_l
WFO_list[length(WFO_list)] <- paste0(WFO_list[length(WFO_list)],"[^a-z]")
WFO_list[1] <- paste0("[^a-z]",WFO_list[1])
WFO_pat <- paste(WFO_list,collapse="[^a-z]|[^a-z]")

I then ran this line to achieve the desired result....

WFO_capture <- find.all.matches(search.col = all_data$title_l, 
                                    pat = WFO_pat)

I received an error...

Error in stri_match_all_regex(string, pattern, omit_no_match = TRUE, opts_regex = opts(pattern)) :
Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG, context=`[^a-z]schoenoxiphium ecklonii var. ecklonii[^a-z]|[^a-z]cyperus violifolia[^a-z]|[^a-z]carex viridula var. viridula[^a-z]|[^a-z]mariscus phleoides[^a-z]|[^a-z]tetraria compar[^a-z]|[^a-z]fimbristylis schulzii[^a-z]|[^a-z]scirpus orbicephala[^a-z]|[^a-z]trichophorum bracteatum[^a-z]|[^a-z]scirpus uniflorum[^a-z]|[^a-z]blysmopsis exilis[^a-z]|[^a-z]carex arcatica f. taldycola

I have used this function before with much smaller datasets, I think the large list is tripping the function up. I am wondering if there is any way to overcome this. Any help is greatly appreciated!

For reference

> head(WFOspecies)
                          scientificName
1: Schoenoxiphium ecklonii var. ecklonii
2:                    Cyperus violifolia
3:          Carex viridula var. viridula
4:                    Mariscus phleoides
5:                       Tetraria compar
6:                 Fimbristylis schulzii
0

There are 0 answers