R Bigmemory matrix Kmeans converting from data frame

191 views Asked by At

I am new to using R for k-means clustering and I tried with a sample application of clustering two files together and succeeded with following code. But as the original files that I am using are much larger than these initial test files, the below code is over using RAM and I think it is inefficient to use this on larger files.

file1 <- read.csv("//tmp//file1.txt", sep="\t", header=TRUE)
file1[is.na(file1)]<-0
file2 <- read.csv("//tmp//file2.txt", sep="\t", header=TRUE)
file2[is.na(file2)]<-0
file1_new <- cbind(file1, file_number = 1)
file2_new <- cbind(file1, file_number = 2)
total_input <- rbind(file1_new, file2_new)
myvars <- data.frame(col1 = total_input$col1, file_number = total_input$file_number)
myvars_k_means <- kmeans(myvars, 6)
myvars_k_clustered <- cbind(myvars$col1, myvars$file_number, myvars_k_means$cluster)

I came across bigmemory and biganalytics with bigkmeans function. But I am struggling to translate the above to use bigmatrix. Here is the code I am working on right now.

file1 <- read.big.matrix("//tmp//bigfile1.txt", sep="\t", header=FALSE)
file2 <- read.big.matrix("//tmp//bigfile2.txt", sep="\t", header=FALSE)
file1[is.finite(file1)] <-0
file1[is.finite(file2)] <-0
total_input <- list(file1, file2)
myvars <- cbind(total_input[,1], total_input[,2])
myvars_k_means <- bigkmeans(myvars,6)
myvars_k_clustered <- cbind(total_input[,1], total_input[,2], myvars_k_means$cluster)

Replacing NA with zero is not working, if I ignore that step then it is producing single cluster due to NAs. And cbinds, column additions also are not working. I think I am missing some easier way, but couldn't understand much from bigmemory/matrix documentation. Can you please help?

0

There are 0 answers