10-fold cross validation for polynomial regressions

1.3k views Asked by At

I want to use a 10-fold cross validation method, which tests which polynomial form (first, second, or third order) gives a better fit. I want to divide my data set into 10 subsets and remove 1 subset from the 10 data sets. Derive a regression model without this subset, predict the output values for this subset using the derived regression model, and computed the residuals. Finally repeat the calculation routine for each subset and sum the squares of the resulting residuals. I already coded the following on Matlab 2013b, which sample the data and test the regression on the training data. I am stuck on how to repeat this for every subset and how to compare which polynomial form gives a better fit.

% Sample the data
parm = [AT];
n = length(parm);
k = 10;                 % how many parts to use
allix = randperm(n);    % all data indices, randomly ordered
numineach = ceil(n/k);  % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
for p=1:k
testix = allix(p,:);            % indices to use for testing
testix(isnan(testix)) = [];     % remove NaNs if necessary
trainix = setdiff(1:n,testix);  % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix);  %gives the testing data
end 

% Derive regression on the training data 
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);

xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse 
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
1

There are 1 answers

2
Kostya On BEST ANSWER

Regarding executing your code for every subset, you can put the fit inside the loop and store the results, e.g.

% Sample the data
parm = [AT];
n = length(parm);
k = 10;                 % how many parts to use
allix = randperm(n);    % all data indices, randomly ordered
numineach = ceil(n/k);  % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);

bAll = []; bintAll = []; rAll = [];

for p=1:k
    testix = allix(p,:);            % indices to use for testing
    testix(isnan(testix)) = [];     % remove NaNs if necessary
    trainix = setdiff(1:n,testix);  % indices to use for training
    %train = parm(trainix); %gives the training data
    %test = parm(testix);  %gives the testing data

    % Derive regression on the training data 
    Sal = Salinity(trainix);
    Temp = Temperature(trainix);
    At = parm(trainix);

    xyz =[Sal Temp At];
    % Fit a Polynomial Surface
    surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
    % Shows equation, rsquare, rmse 
    [b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');

    bAll = [bAll, coeffvalues(b)]; bintAll = [bintAll,bint]; rAll = [rAll,r]; 
end 

Regarding the best fit, you probably can pick the fit with the lowest rmse.