Data manipulation makes lapply not work

148 views Asked by At

EDIT: Ok, it has something to do with the data.all.filtered datatype.

The filtered datatype gets created from data.all.raw which works fine with any lapply below. The weird thing is that I can't find out how do the two differ...

data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)

data.train.filtered <- lapply(seq(1, 8), FUN = function(i) sf.filterFeatures(data.train.raw[[i]], data.selectedFeatures[[i]]))

st.testFeature <- function(featureVector, treshold) {
  if(!is.numeric(featureVector)) {return(T)}

  numberOfNonZero <- sum(featureVector > 0)
  numberOfZero <- length(featureVector) - numberOfNonZero

  return(min(numberOfNonZero, numberOfZero) >= treshold)
}

sf.getGoodFeaturesVector <- function(data, treshold) {

  selectedFeatures <- sapply(data, FUN = st.testFeature, treshold <- treshold)
  whitelistedFeatures <- names(data) %in% c("id", "tp")

  return(selectedFeatures | whitelistedFeatures)

}

sf.filterFeatures <- function(data, selectedFeatures) {
  return(data[, selectedFeatures])
}

Any idea what am I doing wrong when manipulating the data that causes subsequent lapply to not to work?

Original post:

I have a list of datasets called data.train.filtered and want to get a list of models (for predicting a feature called tp) trained by rplot on them. The easiest solution I could think of was using lapply but it doesn't work for some reason.

lapply(data.train.filtered, function(dta) rpart(tp ~ ., data = dta))

Error in terms.formula(formula, data = data) : 
  '.' in formula and no 'data' argument 

The problem is probably not in the data as using it just for one (any) dataset works fine:

rpart(tp ~ ., data = data.train.filtered[[1]])

Even though accessing just one dataset via index works fine (as shown above) using lapply trough indexes fails just the same way the first example did.

lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]])) 

Error in terms.formula(formula, data = data) : 
  '.' in formula and no 'data' argument 

The traceback for the index version is following:

10 terms.formula(formula, data = data) 
9 terms(formula, data = data) 
8 model.frame.default(formula = tp ~ ., data = data.train.filtered[[i]], 
    na.action = function (x) 
    {
        Terms <- attr(x, "terms") ... 
7 stats::model.frame(formula = tp ~ ., data = data.train.filtered[[i]], 
    na.action = function (x) 
    {
        Terms <- attr(x, "terms") ... 
6 eval(expr, envir, enclos) 
5 eval(expr, p) 
4 eval.parent(temp) 
3 rpart(tp ~ ., data = data.train.filtered[[i]]) 
2 FUN(X[[i]], ...) 
1 lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]])) 

I'm quite sure I'm missing something extremely trivial here but being quite new to R I just can't find the problem.

PS: I know that I could iterate trough all the datasets via for loop but that feels really dirty and I'd prefer an R idiomatic solution.

3

There are 3 answers

0
Petrroll On BEST ANSWER

Ok, I finally managed to find the answer. The problem was that data.train.all was actually not what I thought it was. I had an error in the filtering process which corrupted (silently, thanks R) everything.

The fix was to use:

data.selectedFeatures <- lapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)

instead of

data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)

Thanks for all the other answers, though.

2
Peter Ellis On

The trick is to use lapply() on the original list, not on an index vector. For example:

# toy data:
data.train.filtered <- list()
# create 10 different length data frames:
for(i in 1:10){
  n <- rpois(1, 15)
  x = rnorm(n)
  data.train.filtered[[i]] <- data.frame(x =x,
                                         tp = 3 + 2 * x + rnorm(n)
  )
}

library(rpart)
lapply(data.train.filtered, function(dta){rpart(tp ~ ., data = dta)})
3
Nate On

using data(iris) and purrr::map:

datas <- split(iris, rep(sample(c(1,2,3)), length.out = nrow(iris))
models <- purrr::map(datas, ~ rpart(Species ~ ., data = .x)) # a better syntax