Specifying colClasses in read.table using the class function

8.6k views Asked by At

Is there a way to use read.table() to read all or part of the file in, use the class function to get the column types, modify the column types, and then re-read the file?

Basically I have columns which are zero padded integers that I like to treat as strings. If I let read.table() just do its thing it of course assumes these are numbers and strips off the leading zeros and makes the column type integer. Thing is I have a fair number of columns so while I can create a character vector specifying each one I only want to change a couple from what R's best guess is. What I'd like to do is read the first few lines:

myTable <- read.table("//myFile.txt", sep="\t", quote="\"", header=TRUE, stringsAsFactors=FALSE, nrows = 5)

Then get the column classes:

colTypes <- sapply(myTable, class)

Change a couple of column types i.e.:

colTypes[1] <- "character"

And then re-read the file in using the modified column types:

myTable <- read.table("//myFile.txt", sep="\t", quote="\"", colClasses=colTypes, header=TRUE, stringsAsFactors=FALSE, nrows = 5)

While this seems like an infinitely reasonable thing to do, and colTypes = c("character") works fine, when I actually try it I get a:

scan() expected 'an integer', got '"000001"'

class(colTypes) and class(c("character")) both return "character" so what's the problem?

1

There are 1 answers

4
Kevin On

You use read.tables colClasses = argument to specify the columns you want classified as characters. For example:

txt <- 
"var1, var2, var3
 0001, 0002, 1
 0003, 0004, 2"
df <- 
read.table(
    text = txt,
    sep = ",",
    header = TRUE,
    colClasses = "character") ## read all as characters
df    
df2 <- 
read.table(
    text = txt, 
    sep = ",",
    header = TRUE,
    colClasses = c("character", "character", "double")) ## the third column is numeric 
df2

[updated...] or, you could set and re-set colClasses with a vector...

df <- 
read.table(
    text = txt, 
    sep = ",",
    header = TRUE)
df

## they're all currently read as integer
myColClasses <-   
sapply(df, class)

## create a vector of column names for zero padded variable
zero_padded <-    
c("var1", "var2")

## if a name is in zero_padded, return "character", else leave it be
myColClasses <-   
ifelse(names(myColClasses) %in% zero_padded, 
       "character", 
       myColClasses)

## read in with colClasses set to myColClasses
df2 <- 
read.table(
    text = txt, 
    sep = ",",
    colClasses = myColClasses,
    header = TRUE)
df2