Grow a ffdf data frame on disk gradually

433 views Asked by At

From documentation of save.ffdf:

Using ‘save.ffdf’ automagically sets the ‘finalizer’s of the ‘ff’ vectors to ‘"close"’. This means that the data will be preserved on disk when the object is removed or the R sessions is closed. Data can be deleted either using ‘delete’ or by removing the directory where the object were saved (‘dir’).

I want to starting with a small ffdf data frame, add a bit new data at a time, and grow it on the disk. So I did a little experiment:

# in R
ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
rm(ffiris)

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

It turns out it doesn't automatically update the ff data on disk when I remove ffiris. What about saving it manually?

# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
save.ffdf(ffiris, "~/Desktop/iris")

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

Hmm, still no luck. Why?

What about removing the folder before saving?

# in R
ffiris = as.ffdf(iris)
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)

# in bash
ls ~/Desktop/iris/
# ls: /Users/ky/Desktop/iris/: No such file or directory

Even stranger. Even if this all works, it still would be terribly inefficient. I am looking for something like:

updateOnDisk(ffiris)

Could anyone help?

1

There are 1 answers

1
edwindj On BEST ANSWER

ff and ffbase offer out of memory R vectors, but introduce a reference semantics which can give problems with R idioms.

R is a functional programming language, meaning that functions do not change parameters and objects, but return modified copies. In ffbase we implement functions in the R way, i.e. transform returns a copy of the original ffdf data.frame. This can be seen by looking at the filenames:

ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")
filename(ffiris) # show contents of ~/Desktop/iris

ffiris =transform(ffiris, new1 = 99) # this create a copy of the whole data.frame!
filename(ffiris)  

ffiris$new2 <- ff(rep(99, nrow(iris)))  # this creates a new column, but not yet in the right directory
filename(ffiris)

save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE) # this fixes that.

Transform is currently inefficient to add a new column, because it copies the whole data frame (that is R semantics). This is because transform might be a temparory result and you don't wont to change the original data.

In ffbase2 we are fixing this issue