Issues Reading the table in R

148 views Asked by At

I'm trying to do read.table in R. My data (txt file) is like the following:

a b c d e
Australia 1 2 4 3 2
United States 1 2 4 2 2

The problems with reading this table are that:

1) Line 1 only has 5 elements (a~e), as opposed to 6 elements in all rows below that. It's supposed to have the column name like "Country". Then, a corresponds to the first number 1, b corresponds to 2,..and e corresponds to 2 (in the case of Australia.) How do I add a column name to the first column so that R won't show an error that says "line 1 did not have 6 elements"?

2) In United States case, United States are two words instead of one, so when R reads the data, it puts "States" into the second column instead of reading "United States" as one element name.

(i've been advised by my friend to use rownames. Does anyone know how to go about using rownames??)

How can I fix these issues and correctly read my data?

Thank you very much!!

2

There are 2 answers

2
Rich Scriven On

Here's another possibility. This one adds quotes to any two words that begin a string

x <- readLines("your.txt")
x[1] <- paste("Country", x[1])
read.table(text=sub("([A-Za-z]{2,}\\s[A-Za-z]{2,})", "'\\1'", x), header=TRUE)
#         Country a b c d e
# 1     Australia 1 2 4 3 2
# 2 United States 1 2 4 2 2

With regard to @akrun's comment about countries containing more than two words, I think this will work:

x[4] <- 'Papua New Guinea 3 4 3 2 5'
xx <- sub("([A-Za-z]{2,}(\\s[A-Za-z]{2,})+)", "'\\1'", x)
read.table(text = xx, header = TRUE)
#            Country a b c d e
# 1        Australia 1 2 4 3 2
# 2    United States 1 2 4 2 2
# 3 Papua New Guinea 3 4 3 2 5

It also occurred to me that the country names might be the row names for the data frame. If that's the case, then you could do

x <- readLines("your.txt")
read.table(text = sub("([A-Za-z]{2,}\\s[A-Za-z]{2,})", "'\\1'", x))
#               a b c d e
# Australia     1 2 4 3 2
# United States 1 2 4 2 2
3
akrun On

Assuming that the example data mimics the content in the file, we could read it using readLines and then use regex to separate the country names from the rest. The separated country names can be added as a new column.

lines <- readLines('Betty2.txt')
lines
#[1] "a b c d e"               "Australia 1 2 4 3 2"    
#[3] "United States 1 2 4 2 2"

dat <-  read.table(text=c(lines[1], gsub('[A-Za-z]+\\s+', '',
                lines[-1])), header=TRUE)

In the above code, we are replacing the character elements followed by space. ie. the country names with ''.

i.e 

 gsub('[A-Za-z]+\\s+', '',  lines[-1])
 #[1] "1 2 4 3 2" "1 2 4 2 2"

 dat1 <- data.frame(Country= gsub(" \\d+.*", '', lines[-1]),
                               dat, stringsAsFactors=FALSE)

Similarly, here we are replacing the space followed by number (\\d+) followed by one or more characters .* with ''.

 gsub(" \\d+.*", '', lines[-1])
 #[1] "Australia"     "United States"


dat1
#        Country a b c d e
#1     Australia 1 2 4 3 2
#2 United States 1 2 4 2 2