Using dplyr to filter rows which contain partial string of column

Question

Using dplyr to filter rows which contain partial string of column

12.2k views Asked by Karsten Sender At 15 September 2017 at 12:10

Assuming I have a data frame like

term     cnt
apple     10
apples     5
a apple on 3
blue pears 3
pears      1

How could I filter all partial found strings within this column, e.g. getting as a result

term     cnt
apple     10
pears      1

without indicating to which terms I want to filter (apple|pears), but through a self-referencing manner (i.e. it does check each term against the whole column and removes terms that are a partial match). The number of tokens is not limited, nor the consistency of strings (i.e. "mapples" would get matched by "apple"). This would result in an inverted generalized dplyr-based version of

d[grep("^apple$|^pears$", d$term), ]

Additionally, it would be interesting use this departialisation to get a cumulated sum, e.g.

term     cnt
apple     18
pears      4

I couldn't get it to work with contains() or grep().

Thanks

Original Q&A

There are 3 answers

Aramis7d On 15 September 2017 at 12:49

you can try using tidyverse something like

1. define a list of the words as:

     k <- dft %>% 
          select(term) %>% 
          unlist() %>% 
          unique()

2. operate on the data as:

    dft %>%
      separate(term, c('t1', 't2')) %>%
      rowwise() %>%
      mutate( g = sum(t1 %in% k)) %>%
      filter( g > 0) %>%
      select(t1, cnt)

which gives:

      t1   cnt
   <chr> <int>
1  apple    10
2 apples     5
3  pears     1

this still doesn't handle apple and apples though. Will keep trying on that.

tushaR On 15 September 2017 at 14:24

Try this:

df=data.frame(term=c('apple','apples','a apple on','blue pears','pears'),cnt=c(10,5,3,3,1))

matches = sapply(df$term,function(t,terms){grepl(pattern = t,x = terms)},df$term)

sapply(1:ncol(matches),function(t,mat){
  tempmat = mat[,t]&mat[,-t]
  indices=unlist(apply(tempmat,MARGIN = 2,which))
  df$term[indices]<<-df$term[t]
 },matches)

df%>%group_by(term)%>%summarize(cnt=sum(cnt))

 # A tibble: 2 x 2
 #  term   cnt
 #  <chr> <dbl>
 #1 apple    18
 #2 pears     4

**amrrs** · Accepted Answer · 2017-09-15T12:56:18+00:00

Hopefully the complete answer. Not very idiomatic (as Pythonista's call) but someone can suggest improvement to this:

> ssss <- data.frame(c('apple','red apple','apples','pears','blue pears'),c(15,3,10,4,3))
> 
> names(ssss) <- c('Fruit','Count')
> 
> ssss
       Fruit Count
1      apple    15
2  red apple     3
3     apples    10
4      pears     4
5 blue pears     3
> 
> root_list <- as.vector(ssss$Fruit[unlist(lapply(ssss$Fruit,function(x){length(grep(x,ssss$Fruit))>1}))])
> 
> 
> ssss %>% filter(ssss$Fruit %in% root_list)
  Fruit Count
1 apple    15
2 pears     4
> 
> data <- data.frame(lapply(root_list, function(x){y <- stringr::str_extract(ssss$Fruit,x); ifelse(is.na(y),'',y)}))
> 
> cols <- colnames(data)
> 
> #data$x <- do.call(paste0, c(data[cols]))
> #for (co in cols) data[co] <- NULL
> 
> ssss$Fruit <- do.call(paste0, c(data[cols]))
> 
> ssss %>% group_by(Fruit) %>% summarise(val = sum(Count))
# A tibble: 2 x 2
  Fruit   val
  <chr> <dbl>
1 apple    28
2 pears     7
>

TechQA.

Using dplyr to filter rows which contain partial string of column

There are 3 answers

Related Questions in R

Related Questions in FILTER

Related Questions in DPLYR

Related Questions in SUMMARIZE

Popular Questions

Popular Tags

Trending Questions