Counting majority vote in R

2.4k views Asked by At

I have a dataframe that contains the predictions for each observation by 7 machine learning algorithms. I want to choose the prediction that occurs most frequently. How do I tell R to choose the factor variable that occurs most frequently in each row?

Eg.

  A, A, A, B, B
  B, B, C, C, C

In the first row, I want R to choose A, and in the second row I want R to choose C. There are only 3 levels of factors: A, B and C. How do I go about asking R to find the majority vote?

1

There are 1 answers

0
Ken Benoit On

Modal winner

What you want is to determine the modal winning algorithm for each observation. R does not have a mode() function that works in the meaning of "mode" as in most frequently occurring outcome -- rather it returns the "storage mode" of an object.

Here is a simple function that is intentionally longer so as to be more clear. Using apply on the function Mode() linked in comments also works:

results <- data.frame(model1 = c("A", "A", "A", "B", "B"),
                      model2 = c("B", "B", "C", "C", "C"))
chooseBestModel <- function(x) {
    tabulatedOutcomes <- table(x)
    sortedOutcomes <- sort(tabulatedOutcomes, decreasing=TRUE)
    mostCommonLabel <- names(sortedOutcomes)[1]
    mostCommonLabel
}
apply(results, 2, chooseBestModel)
## model1 model2 
##    "A"    "C" 

Data Structure

Note that I have made each observation's winning algorithm outcome into a variable, in "wide" format, since a data.frame is supposed to record variables in columns, not rows. An alternative would be to create a data.frame of just two columns, where one is the observation number, and the second column is the winning algorithm, but this would require a different treatment than the solution above.

Plurality v. Majority, and Ties

Note that this is not a majority outcome, but a plurality outcome. If you had A, B, C, C, D, E then C is the plurality outcome but not the majority outcome (because it occurs only 1/3 of the time, and majority implies > 1/2). Your question suggested you wanted some sort of an winner to be declared for each observation. This also means that it decides ties rather arbitrarily based on the label value. For instance:

results2 <- data.frame(model1 = c("A", "A", "A", "B", "B", "B"),
                      model2 = c("A", "B", "C", "C", "D", "E"))
apply(results2, 2, chooseBestModel)
## model1 model2 
##    "A"    "C"