R Error : some group is too small for 'qda'

14.8k views Asked by At

I used the MASS::qda() to find the classfier for my data and it always reported

`some group is too small for 'qda'

Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....

set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)

Here is my data example

>head (testData)
                              sequence mono_score eudi_score    type
PhHe_4822_404_76       DTRPTAPGHSPGAGH    51.4930   39.55000 monocot
SoBi_10_265860_58      QTESTTPGHSPSIGH    33.1408    2.23333 monocot
EuGr_5_187924_158        AFRPTSPGHSPGAGH    27.0000   54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH    20.6901   50.21670 eudicot
PoTr_Chr07_112594_90     DFRPTAPGHSPGVGH    43.8732   56.66670 eudicot
OrSa.JA_3_261556_75    GVRPTNPGHSPGIGH    55.0986   45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH    25.8169    2.50000 monocot

>testData$type <- as.factor (testData$type)

> dim (testData)
[1] 200   4

> levels (testData$type)
[1] "eudicot" "monocot" "other" 

> table (testData$type)
eudicot monocot   other 
    100     100       0

> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils

My R version is R 3.0.2.

3

There are 3 answers

4
Ben Bolker On BEST ANSWER

tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.

Here's a way to make up a data set that looks like yours:

set.seed(101)
mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
                 mono_score=runif(100,0,100),
                 dicot_score=runif(100,0,100))

Some useful diagnostics:

str(mytest)
## 'data.frame':    200 obs. of  3 variables:
## $ type       : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
##  $ mono_score : num  37.22 4.38 70.97 65.77 24.99 ...
##  $ dicot_score: num  12.5 2.33 39.19 85.96 71.83 ...
summary(mytest)
##       type       mono_score      dicot_score     
##  dicot  :100   Min.   : 1.019   Min.   : 0.8594  
##  monocot:100   1st Qu.:24.741   1st Qu.:26.7358  
##                Median :57.578   Median :50.6275  
##                Mean   :52.502   Mean   :52.2376  
##                3rd Qu.:77.783   3rd Qu.:78.2199  
##                Max.   :99.341   Max.   :99.9288  
## 
with(mytest,table(type))
## type
##   dicot monocot 
##    100     100 

Importantly, the first two (str() and summary()) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel() function should take care of this problem ...

This made-up example seems to work fine, so there must be something you're not showing us about your data set ...

library(MASS)
qda(type~mono_score+dicot_score,data=mytest)

Here's a guess. If your score variables were actually factors rather than numeric, then qda would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...

bad <- transform(mytest,mono_score=factor(mono_score))
qda(type~mono_score+dicot_score,data=bad)
## Error in qda.default(x, grouping, ...) : 
##    some group is too small for 'qda'
0
cmoreno On

Your grouping variable has 3 levels including 'other' with non cases. Since the number of response variables (2 variables, i.e. mono_score, dicot_score) is larger than the number of cases in any given group level (100, 100 and 0, for dicot, monocot and other, respectively), the analysis cannot be performed. One way to get rid of unnecesary group levels is by redifining the grouping variable as factor after setting it to character:

test.data$type <- as.factor(as.character(test.data$type))

Another alternative is by defining the levels of the grouping variable:

test.data$type <- factor(test.data$type, levels = c("dicot", "monocot"))

If your dataset was so unbalanced and had, for example, 2 cases of 'other', it would probably make sense to exclude them from the analysis.

This message could still appear if the number of response variables is larger than the number of cases in any given group level. Since you have 100 cases for both group levels (i.e. dicot, monocot) and only two response variables (i.e. mono_score, dicot_score) this should not be a problem anymore.

0
Hielke Walinga On

I had this error as well, so I explained what went wrong on my side for anyone stumbling upon this in the future.

You might have factors on the variable you want to predict. All levels in this factor must have some amount of observations. If you don't have enough observations in a group, you will get this error.

For me, I removed a level completely, but there was still this level left in the factor.

To remove this you have to do this

df$var %<>% factor

NB. %<>% requires magrittr

However, even when I did this, it still failed. When I debugged this further it appears that if you subset from a dataframe that had factor applied you have to refactor again, somehow.