I am trying to filter by multiple variables for a series of conditions using vectors on a SparkR dataframe.
Doing this with tidyverse on a regular dataframe is straight forward. For example:
library("tidyverse")
library("magrittr")
# filter vectors
filter_var1 <- c('111', '222')
filter_var2 <- c('a', 'c')
#create df
df <- data.frame(var1 = c("111", "222", "333", "333"), var2 = c("a", "a", "b", "c"))
df
# filter
df_filtered <- df %>% dplyr::filter( (var1 %in% filter_var1) | (var2 %in% filter_var2))
df_filtered
Attempting do something similar for a single variable filter is a little more tedious but doable (from: sparkR example 1, sparkR example 2):
library(SparkR)
# Initialize a Spark session
sparkR.session(appName = "Create DataFrame Example")
# Create a data frame in SparkR
spark_df <- createDataFrame(data.frame(var1 = c("111", "222", "333", "333"), var2 = c("a", "a", "b", "c")))
# Show the data frame
showDF(spark_df)
spark_df_filtered <- spark_df %>% SparkR::filter(., (
paste("var1 in ('", paste(filter_var1, collapse = "','"), "')", sep = "")
))
But I am having difficulty using AND (&&) / OR (||) statements in the case of in the same filter statement to add more conditions. I did find an example using sparkR::subset here, but when trying to use vectors to implement the filtering I ran into issues.
You could have used the
%in%S4 method to match values in a column. And use|for OR,&for AND operators.