Detecting and removing multibyte strings in R

381 views Asked by At

So I have this multibyte string "UCA1\xa6\xc1" within a large vector of RNA names, which yields UCA1�� upon using the cat() function. I am trying to screen the vector for such strings and rename them to something else or if all else fails, remove them from the vector, as I cannot capitalize such strings with functions like toupper().

I'm not too sure of the data type that '\xa6' and '\xc1' encodes so I am unsure of how to screen for them using any form of regex. Could anybody help me with this?

2

There are 2 answers

0
Antreas Stefopoulos On BEST ANSWER

This is probably an encoding issue, so try change the encoding during load! Try something like this,

df<- read.csv(file_path, 
                encoding = "iso-8859-1", "use different encodings/langs"
                header = TRUE, 
                stringsAsFactors = FALSE)
0
ShortOrders On

Thanks Antreas, that makes sense and it works just exactly as you have said!

As I was using fread(), I had to use "Latin-1" encoding instead (presumably the same as the "iso-8859-1" that was suggested) to read in the file first, like this:

basepath <- file.path(getwd(), 'RNA Databases')
file_list <- dir(basepath)

db2 <- fread(paste0(getwd(), "/", file_list[4]), encoding = 'Latin-1')

Which yielded "UCA1\xa6\xc1" as "UCA1¦Á" instead. Not a very comprehensible string but more than sufficient to accept functions like toupper()!