For the question; perform a linear discriminant analysis (LDA) on the training data to construct a classification rule (discriminant function) for tree type based on all available continuous measurement variables. Use cross-validation (CV=TRUE), and calculate the misclassification rate (MCR) for this model using a contingency table (i.e., table function). Summarize linear discriminant functions.
The started script i was given is;
# A4: Classification based on Roosevelt Forest Trees dataset
library(MASS)
### change the following line to point to your CSV file:
filename<-"trees_sample.csv"
# read the data and pre-process, set TreeNumber as row name
trees=read.csv(filename,row.names = 1)
# or
rownames(trees) <- trees$TreeNumber
# trees[,1] = NULL
# check dimension
dim(trees)
# scale the data (numeric variables only)
trees[,1:9]=scale(trees[,1:9])
#
# 1. Divide data into training (80%) and test (20%) by doing random sample without replacement
set.seed(10101)
# Now Selecting 80% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(trees), size = floor(.80*nrow(trees)), replace = F)
trees_train <- trees[sample, ]
trees_test <- trees[-sample, ]# these are the training set subscripts
# 2. Build LDA model on scaled training data
# first use all numeric predictors (i.e. not the factor Area)
# test accuracy via the missclassification rate (MCR)
# chi-sq test for overall significance of predicted classes
# use MANOVA to get Wilks test result:
# and summary.aov() to get individual contributions ?
# fit the model again without CV for prediction later
# summarise model
# 3. Model specification and testing:
#
# determine which LD components are important using barplot
#
# Prediction of test data:
# apply full model to test data and get MCR:
# Clustering: find out how many distinct tree types we really have...
#
# tree diagram (work on a random sample of n=1000 to speed things up):
sam=sample(seq(1,80000,1),size=1000)
hc = hclust(dist(trees_train[sam,1:10]))
hcd=as.dendrogram(hc)
plot(hcd)
# very simple dendrogram, cut at h=10
plot(cut(hcd, h = 10)$upper, main = "Upper tree of cut at h=10")
# use EH Ch 9 method for determining how many clusters based on iterative within groups sum of squares
#
# k-means fit with k = ?
# Centroid Plot against 1st 2 discriminant functions (explain 95%+ variations)
library(fpc)
plotcluster(....)
And the code additions I made are;
lda_model <- lda(Type ~ ., data = trees_train, CV = TRUE)
# Test accuracy via the misclassification rate (MCR)
lda_train_predicted <- predict(lda_model)$class
conf_matrix_train <- table(Actual = trees_train$Type, Predicted = lda_train_predicted)
mcr_train <- 1 - sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
In the question it is given to do CV = True; however with doing that I get return value of a list in lda_model. And in predict function after that i have to use an lda object but using lda_model in there gives me following error;
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
Help solve this.