replace na values in full dataset using r

2.1k views Asked by At

I am working on a dataset with few missing values marked as "?", I have to replace them with the most common value(mode) of that column. But, I want to write a code which runs it for the whole dataset at once.

I have gotten so far -

df <- read.csv("mushroom.txt", na.strings = "?",header=FALSE)

Now, trying to replace all the NA values in the file with the mode of that column. Please help.

5

There are 5 answers

0
cr1msonB1ade On BEST ANSWER

Since you want to replace by the mode of a column you want to operate in a column-wise fashion via apply and use is.na to identify those columns that you want to replace.

apply(df, 2, function(x){ 
    x[is.na(x)] <- names(which.max(table(x)))
    return(x) })

Note that apply returns a matrix, so if you want a data.frame you would need to convert with as.data.frame

0
Frank P. On
replaceQuestions <- function(vector) {

  mostCommon <- names(sort(table(vector), decreasing = TRUE))[1]

  vector[vector == '?'] <- mostCommon

  vector

}

df <- apply(df, 2, replaceQuestions)

Not reproducible so I'm not sure if this is what you were looking for, but this solves the problem as I've interpreted it.

0
PavoDive On

As you have it in your question, you're replacing NAs with "?" during your csv-reading, so I think this could help:

apply(df,2,function(x) gsub("\\?",names(sort(-table(x,exclude="?")))[1],x))

The exclude part is to avoid selecting the "?", shall it be the most frequent value. The \\ is to escape the special character ? to gsub.

====== EDIT TO ADD ======

gsub will convert everything to text, you'll need to make it back to numeric again:

a<-apply(df,2,function(x) gsub("\\?",names(sort(-table(x,exclude="?")))[1],x))
new_df<-as.data.frame(apply(a,2,as.numeric))

Last line will produce a new data frame

0
Pierre L On

Or:

apply(df, 2, function(x) {
  x[is.na(x)] <- Mode(x[complete.cases(x)])
  x})

This uses the well-known Mode function on SO. Link to the function Is there a built-in function for finding the mode?

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
0
Ajay Ohri On

use

for (i in ncol(dataframename){
   dataframename[i]=
   ifelse(is.na(dataframename[i]),mode(dataframename[i]),dataframename[i])
}