smile scala api: create Dataframe from Array

609 views Asked by At

I am trying to integrate smile in my scala code base. In particular, I would like to train a Random Forest Classifier. In the FAQ it is written:

Most Smile algorithms take simple double[] as input. So you can use your favorite methods or library to import the data as long as the samples are in double arrays.

But it does not seem to be the case for the RandomForest, all fit methods seem to take a Formula and a Dataframe as input. in my case I have two Array[Array[Double]] containing examples of two different classes: the first should be labelled as 0 and the second as 1 for example. The first array has shape (n_samples_0, n_features) and the second (n_samples_1, n_features)

To the best of my knowledge, the only way to train a smile randomForest on this data is to first convert these two arrays to one smile dataframe with n_features + 1 columns (one for each feature + one for the label) and n_samples_0 + n_samples_1 rows. And then:

val formula: Formula = "class" ~
val rf = randomForest(formula, df)

Hence my question: is there a way to create a Dataframe from an array in the Scala API? I can only find ways to create Dataframe by reading different file formats.

1

There are 1 answers

0
Damien Lancry On BEST ANSWER

I managed to solve my issue by using the of method of Smile DataFrames.

Here is a minimal example: (X1 and X0 are arrays of arrays of doubles containing the features, each subarray is of size 600, X1 contains features of examples of the positive class, X0 contains features of examples of the negative class)

val X1: List[Array[Double]] = ???
val X0: List[Array[Double]] = ???
val y1 = X1.map(_ => Array(1))
val y0 = X0.map(_ => Array(0))
val X = (X1 ++ X0).toArray
val y = (y1 ++ y0).toArray
val dfX = DataFrame.of(X)
val dfy = DataFrame.of(y, "class")
val df = dfX.merge(dfy)
val formula: Formula = "class" ~
val rf = randomForest(formula, df)