Looking for a more efficient way to implement this R code if possible

86 views Asked by At

I have the following R code that works but it's quite slow. I want to create a new column based on the values of an existing column in a R dataframe. But there's a catch/complication, I need to access and change a Global environment variable which holds a comparison value across observations. I accomplish this using APPLY with a function on the rows of the dataframe. The function can write and read the external variable. This works but is slow. Is there any way of speeding up the process?

The value of drug in the first observation is the start BASE_VALUE. What I'm trying to do is label the observations (drug value) which are different from the BASE_VALUE and are not substrings of the BASE_VALUE. The drug value of this observation then become the current BASE_VALUE and the process continues. In the example below, the 2nd "applesauce" value which is in the 3rd row of the dataset should NOT be flagged, as it appeared in the 1st line. This illustrates that I need to somehow store the value of the 1st row and be able to compare it with the 3rd row. That's why using the lagged value doesn't work and why I have the BASE_VALUE variable. Indeed I originally tried to use the lagged value till I realised this.

Reproducible code is below:

base_value <- ""
char_vector <- c("applesauce", "apple", "applesauce", "orange", "orange", "banana", "applepie") 

#change char_vector to a dataframe, the lastone column isn't completely necessary
df = data.frame(drug = char_vector) %>% 
mutate(lastone = lag(drug))

test_func <- function(row, output){
if (is.na(row[2])){
   #this is the first observation - set the drug value as BASE_VALUE
   base_value <<- row[1]
   return("Y")
 }else if (!is.na(row[2]) & row[1] != base_value & !grepl(row[1], base_value, fixed = TRUE)) {
   base_value <<- row[1]
   return("Y")
 }else {
   return("N")
 }
}


switches <- apply(df, 1, test_func)
cbind(df, switches = switches)

Tried above and it works. But would like to speed it up

3

There are 3 answers

1
Zé Loff On BEST ANSWER

You can use accumulate from the purrr pacakge to keep track of the base_values, and then check for changes in it:

library(purrr)
library(dplyr)
char_vector <- c("applesauce", "apple", "applesauce", "orange", "orange", "banana", "applepie")
    
new_base <- function(old_base, value) {
  if (grepl(value, old_base, fixed = TRUE)) {
    return(old_base)
  } else {
    return(value)
  }
}

tibble(X = char_vector) %>%
mutate(base_value = accumulate(X, new_base),
       changed = ifelse(base_value != lag(base_value, default = ""),
                        "Y", "N"))
X          base_value   changed
applesauce applesauce   Y
apple      applesauce   N
applesauce applesauce   N
orange     orange       Y
orange     orange       N
banana     banana       Y
applepie   applepie     Y

With accumulate the value returned by the function is fed into the next iteration as the first argument, and is... well, accumulated, into the final result, which will be returned at the end of the recursion. Just like cumsum, e.g.

5
Limey On

[This answer was posted before OP edited their test data.]

You're making life too complicated. A judicious combination of lag and str_detect can give you what you want without looping.

Get some test data by modifying the last line of your example:

original <- cbind(df, switches = switches)

Solve the problem

library(tidyverse)

suggested <- df %>% 
  mutate(
    switches = ifelse(
      row_number() == 1, 
      "Y", 
      ifelse(
        str_detect(
          lag(drug), 
          drug
        ), 
        "N", 
        "Y"
      )
    )
  )
identical(suggested$switches, original$switches)
[1] TRUE

Your example isn't big enough to make benchmarking sensible, but this is likely to be quicker than looping. A base R solution is likely to be quicker still. (Though, IMHO, at the expense of readability.) You should test on real life data sets to determine which is likely to be best in production.

For benchmarking, consider microbenchmark.

0
jay.sf On

Perhaps we don't need a lagged vector at all. We can put the grepl logic in an ifelse function and put it in accumulating Reduce. When we then make a factor out of it we can exploit the underlying integer structure and calculate differences. Where the base doesn't change, we get FALSE, where it does TRUE. Adding + 1 to it allows us to nicely subset the vector c('N', 'Y').

> base <- function(x, y) ifelse(grepl(y, x, fixed=TRUE), x, y)
> char_vector |> 
+   as.data.frame() |> 
+   transform(
+     switches=
+       c('N', 'Y')[
+         1L + (c(1, diff(as.factor(Reduce(fun, char_vector, acc=TRUE)))) != 0)]
+   )
  char_vector switches
1  applesauce        Y
2       apple        N
3  applesauce        N
4      orange        Y
5      orange        N
6      banana        Y
7    applepie        Y

Data:

> dput(char_vector)
c("applesauce", "apple", "applesauce", "orange", "orange", "banana", 
"applepie")