Regression using liblinear and Matlab

184 views Asked by At

Here is my code:

    function testRegression()
    load carsmall
    x1 = Weight;
    x2 = Horsepower;    % Contains NaN data
    y = MPG;
    X = [ones(size(x1)) x1 x2 x1.*x2];
    X(isnan(X)) = 0;
    y(isnan(y)) = 0;

    for i = 2:size(X,2)
        X(:,i) = (X(:,i) - min(X(:,i))) / (max(X(:,i)) - min(X(:,i)));
    end
    y = (y - min(y)) / (max(y) - min(y));

    model = train(y,sparse(X),'s 0');
    [a,b,c] = predict(y, sparse(X), model);
    end

I am always getting 0 for prediction. What is the problem with my code? When I dont normalize y, I am getting some output, however when I normalize the output is always 0.

2

There are 2 answers

1
rayryeng On

You shouldn't be normalizing the output values. The point of normalizing is to do it only for the input features. This decreases the dynamic range of the input features so that it makes the model easier to train. The output values need to stay the same because those are the true values you are trying to predict. By normalizing the output values, you are effectively shrinking the dynamic range of the expected outputs, meaning that small variances in your input features largely affect what the output is.

tl;dr: You never normalize the expected output values.

0
Chris On

Here are some issues I see in your code:

1) By doing:

X(isnan(X)) = 0;
y(isnan(y)) = 0;

you are actually introducing bias to your model (subjective information that is not present in the given data). In short NaN is not equal 0 (0 is a number). I would rather remove the rows of X that contain at least one NaN value. Also the corresponding rows in y need to be removed of course.

2) If you are building an SVR model and not a linear one instead of:

X = [ones(size(x1)) x1 x2 x1.*x2];

you can just use

X = [x1 x2];

There is a constant term included in an SVR model by design and interactions such as x1*x2 are captured well by standard kernels (e.g. rbf, polynomial).

3) Scaling Y the way you did is not used in practice. To my knowledge, the only case where scaling of the output might help is when its possible values span different orders of magnitude, e.g. y is in range [0.1, 10^5]. In such cases you typically use log(y) instead.

4) I would also be cautions with the scaling you did in X. This kind of scaling tends to "smoothen out" any small variability in X for increasing values of (max(X(:,i)) - min(X(:,i))).

Closing note: The cool thing in my opinion with such problems is that you can empirically evaluate any claim (like the ones I make above) yourself. One way to do so, is by splitting your data and use a part for training. Then you use the rest for validation. Do more than one splits for a better picture. Improvements such as the ones proposed above should reflect on the error of your model on the validation set. The error on the training set is not quite informative because you might have just over-fitted your data.