Spark scaling the data for both features and the label

960 views Asked by At

I have a problem in Spark (scala). I've created a simple artificial dataset with the following rules:

y_3 = 2*x1 + 3*x2+0

So a sample date would be:

(y_3, [x1, x2 ]) (4302.84233756448,[513.470030229239,1091.967425702])

Before passing the data to a linear regression, I am doing scaling on the data as follow:

    val scaler = new StandardScaler(withMean = true, withStd = true).fit(data.map(x => x.features))
            (scaler, data.map(x => LabeledPoint(x.label, scaler.transform(x.features))))

But by this scaling my data would be something like this:

(y_3, [x1, x2 ]) (1350.80994484728,[-1.9520275434722287,-1.1671844333252521])

Now the coefficients are not [2, 3] and the intercept is also changing because the scaling just scales the features and not the y_3!!

My question is that: "How can I scale both features and target variable?"

I tried to change my scaling code to the following:

val scalerFeatures = new StandardScaler(withMean = true, withStd = true).fit(data.map(x => x.features))
val scalerLabel = new StandardScaler(withMean = true, withStd = true).fit(data.map(x => Vectors.dense(x.label)))
    (scalerFeatures, data.map(x => LabeledPoint(scalerLabel.transform(x.label), scalerFeatures.transform(x.features))))

But it doesn't accept "scalerLabel.transform(x.label)" in LabeledPoint (needs Double).

So how can I do that?

And another question is that, when the model predicts the target variable which is scaled, how can I transformed it the actual value of target variable?

Thanks in advance.

1

There are 1 answers

0
Dr VComas On

This is weird, what you want to accomplish by Scaling the target variable? What you did was create x1 and x2 then with that the dependent variable as: y_3 = 2*x1 + 3*x2+0. So if you transform x1 and/or x2 by any transformation (that is just not multiplying them by 1) then that function is not going to hold anymore. And you generally do not want to apply the scaling to the target variable.

This might be more like a cross validated discussion but you can have something like two features x_1, x_2 and a target variable y, a best curve from a linear regression will look like:

y=ax_1+bx_2+c

I can transform x_1 and x_2 (maybe not a non linear transformation) and when you train the new linear regression (not changing y) you get different a,b,c values. Once you want to use for predict new cases you just apply first the same transformation to x_1, and x_2 and then use it for prediction.

To answer the specific question of how to scale the label you only need to change what you are sending to transform(), it is expecting a vector and you are giving x.label, here is a code that should work:

val scaleddata = data.map(x => (scalerLabel.transform(Vectors.dense(x.label)), scalerFeatures.transform(x.features))) 
val scaleddataLast = scaleddata.map(x => LabeledPoint(x._1, x._2))