Kaggle Titanic: Machine Learning From Disaster Decision Tree for Cabin Prediction

429 views Asked by At

One of the variables, 'Cabin', has a hefty amount of NAs. I am trying to use a decision tree (rpart) to predict the Cabin deck of passengers whose Cabin is not available.

Currently, this is the structure of my data table, which is a rbind of the training and test sets.

 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 187 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
 $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
 $ FamilyID   : Factor w/ 8 levels "11","3","4","5",..: 8 8 8 8 8 8 8 4 2 8 ...
 $ FamilyID2  : Factor w/ 7 levels "11","4","5","6",..: 7 7 7 7 7 7 7 3 7 7 ...
 $ Title      : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
 $ Surname    : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
 $ Cabin2     : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...

Please note that I have used strsplit to create 'Cabin2' which has extracted the letter of the 'Cabin' variable, which corresponds to the deck on the Titanic to my understanding. This significantly reduced the number of levels that I was fighting with from 187 with 'Cabin' to 8 with 'Cabin2.'

I am trying to use the following code to predict the cabin deck:

cabinFit <- rpart(Cabin2 ~ Age + Sex + Fare + Embarked + SibSp + Parch + Title + FamilySize + FamilyID,

combi$Cabin2[is.na(combi$Cabin2)] <- predict(cabinFit,     combi[is.na(combi$Cabin2),])

The output that I am being thrown by R is as follows:

 Warning messages:
 1: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L,   :
  invalid factor level, NA generated
 2: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L,   :
  number of items to replace is not a multiple of replacement length

I am desperately trying to make sense of this as I continue fiddling with these data, however I am coming up short as to why this bit of code doesn't do the trick for me.

0

There are 0 answers