I have two sets of data, one raw and one cleaned. The independent variable is damage, which can take any value between 0 and 100, inclusive. The dependent variable is a user rating, which can take any whole number value between 1 and 5, inclusive. Since the IV is continuous and the DV ordinal, I am using a logistic regression to predict future responses - specifically, I am using MATLAB's built-in mnrfit function, specifying the model type as ordinal.
This worked well for the raw data, but I am getting an error when attempting the regression on the cleaned data. Specifically,
In mnrfit>ordinalFit (line 349) In mnrfit (line 206) Warning: Matrix is singular, close to singular or badly scaled. Results may be inaccurate. RCOND = NaN. Error using linsolve Matrix must be positive definite.
Error in mnrfit (line 248) bcov = linsolve(hess,eye(size(hess)),struct('SYM',true,'POSDEF',true));"
There are not huge differences between the two datasets, simply a few more zeros in the cleaned responses. The condition number of the raw response matrix is about 57, and that of the cleaned response matrix about 58. The rank of both the raw response and the cleaned response matrix is 5, which is equal to the number of columns in each matrix. Therefore, I am not sure why the regression should work with the raw data but not the cleaned data.
Could someone suggest how to proceed with troubleshooting/debugging? Or why the regression might not be working? I've included the relevant code snippets and data samples at the end of this post.
Thank you in advance!
This is the code for the original logit model, based on the raw data:
% fit logit model to all data - responses as a function of mean CDF
% expect a positive trend - i.e. that responses get larger as mean CDF
% increases
% 1: predictor variable is mean CDF; response variable is the user
% damage rating - this is the way our experiment was designed
% get unique values of mean CDF
[x,ia,ic] = unique(crowddata(:,3));
y(1:length(x),1:5) = 0;
for i = 1:length(x)
y(i,1:5) = crowddata(ia(i),5:9);
beq(i,1) = buckets(ia(i),3);
buneq(i,1) = buckets(ia(i),4);
end
% get sample sizes (total number of responses) for each mean CDF
for i = 1:length(x)
ssize(i,1) = y(i,1) + y(i,2) + y(i,3) + y(i,4) + y(i,5);
end
% fit a ordinal logistic regression model
[b,dev,stat] = mnrfit(x,y,'model','ordinal');
b2 = [b(1:4)';repmat(b(5:end),1,4)];
xx = (1:1:276)';
% get probability that a mean CDF is in a category, based on our
% logit model that includes all user responses
pihat = mnrval(b,xx,'model','ordinal','interactions','off');
Below is the code for the regression model based on the cleaned data.
% new regression model using only responses cleaned based on time
% 2: predictor variable is mean CDF; response variable is the user
% damage rating - this is the way our experiment was designed
% get unique values of mean CDF
[x_clean,ia,ic] = unique(crowddata_clean(:,3));
y_clean(1:length(x_clean),1:5) = 0;
for i = 1:length(x_clean)
y_clean(i,1:5) = crowddata_clean(ia(i),5:9);
beq(i,1) = buckets(ia(i),3);
buneq(i,1) = buckets(ia(i),4);
end
% get sample sizes (total number of responses) for each mean CDF
for i = 1:length(x_clean)
ssize_clean(i,1) = y_clean(i,1) + y_clean(i,2) + y_clean(i,3) +...
y_clean(i,4) + y_clean(i,5);
end
% fit a ordinal logistic regression (linear) model
[bclean,dev,stat] = mnrfit(x_clean,y_clean,'model','ordinal');
b3 = [b(1:4)';repmat(b(5:end),1,4)];
UPDATED: Both x and x_clean are 276 x 1 vectors with values ranging from ~0.3 to ~77.
Both y and y_clean are 276 x 5 matrices with condition numbers of about 14. Here are the first 10 rows of y
2 0 0 0 0
2 0 0 0 0
1 1 0 0 0
10 1 0 0 0
2 1 0 0 0
8 1 1 0 0
0 1 0 0 0
4 1 0 0 0
5 0 0 0 0
2 2 0 1 0
0 2 3 0 0