Detecting and removing multibyte strings in R

Question

Detecting and removing multibyte strings in R

381 views Asked by ShortOrders At 08 August 2022 at 04:29

So I have this multibyte string "UCA1\xa6\xc1" within a large vector of RNA names, which yields UCA1�� upon using the cat() function. I am trying to screen the vector for such strings and rename them to something else or if all else fails, remove them from the vector, as I cannot capitalize such strings with functions like toupper().

I'm not too sure of the data type that '\xa6' and '\xc1' encodes so I am unsure of how to screen for them using any form of regex. Could anybody help me with this?

Original Q&A

There are 2 answers

ShortOrders On 10 August 2022 at 01:20

Thanks Antreas, that makes sense and it works just exactly as you have said!

As I was using fread(), I had to use "Latin-1" encoding instead (presumably the same as the "iso-8859-1" that was suggested) to read in the file first, like this:

basepath <- file.path(getwd(), 'RNA Databases')
file_list <- dir(basepath)

db2 <- fread(paste0(getwd(), "/", file_list[4]), encoding = 'Latin-1')

Which yielded "UCA1\xa6\xc1" as "UCA1¦Á" instead. Not a very comprehensible string but more than sufficient to accept functions like toupper()!

**Antreas Stefopoulos** · Accepted Answer · 2022-08-08T06:39:17+00:00

Antreas Stefopoulos On 08 August 2022 at 06:39 BEST ANSWER

This is probably an encoding issue, so try change the encoding during load! Try something like this,

df<- read.csv(file_path, 
                encoding = "iso-8859-1", "use different encodings/langs"
                header = TRUE, 
                stringsAsFactors = FALSE)

TechQA.

Detecting and removing multibyte strings in R

There are 2 answers

Related Questions in R

Related Questions in MULTIBYTE

Popular Questions

Trending Questions