Count comma delimited values and replicate a value equal times in R

47 views Asked by At

Given the following example data ...

id                               Proteins
522     Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7
523                                Q9UHV7
524                       Q9Y6T7-2;Q9Y6T7
525                       Q9Y6T7-2;Q9Y6T7

... I would like to create a third column with each id times the number of semicolon delimited values of each row. More specifically something like that:

id                               Proteins     newCol
522     Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7    522;522;522;522
523                                Q9UHV7    523
524                       Q9Y6T7-2;Q9Y6T7    524;524
525                       Q9Y6T7-2;Q9Y6T7    525;525

I have tried this dt$newCol <- rep(dt$id, lengths(str_split(dt$Proteins, ";"))) but doesn't work since it creates a longer list.

4

There are 4 answers

0
ekoam On BEST ANSWER

Something like this?

library(stringr)
df$newCol <- str_replace_all(df$Proteins, "[^;]+", as.character(df$id))

Output

> df
   id                          Proteins          newCol
1 522 Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7 522;522;522;522
2 523                            Q9UHV7             523
3 524                   Q9Y6T7-2;Q9Y6T7         524;524
4 525                   Q9Y6T7-2;Q9Y6T7         525;525

Another Base R solution suggested by @markus

df1$new <- Map(gsub, pattern = "[^;]+", replacement = df1$id, x = df1$Proteins)
0
det On
library(tidyverse)

df %>%
  mutate(newCol = map2_chr(id, str_count(Proteins, ";") + 1, ~str_c(rep(.x, .y), collapse = ";")))
0
Ronak Shah On

A base R solution would be to count number of times ";" occurs, add +1 to it, repeat the ids accordingly. Paste the id's together using tapply to create newCol.

x <- rep(df$id, lengths(regmatches(df$Proteins, gregexpr(";", df$Proteins))) + 1)
df$newCol <- tapply(x, x, paste0, collapse = ';')
df

#   id                          Proteins          newCol
#1 522 Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7 522;522;522;522
#2 523                            Q9UHV7             523
#3 524                   Q9Y6T7-2;Q9Y6T7         524;524
#4 525                   Q9Y6T7-2;Q9Y6T7         525;525
0
akrun On

We can use a for loop with gsub

for(i in seq_len(nrow(df1))) df1$newCol[i] <- gsub("([[:alnum:]-]+)", df1$id[i], df1$Proteins[i])
df1
#  id                          Proteins          newCol
#1 522 Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7 522;522;522;522
#2 523                            Q9UHV7             523
#3 524                   Q9Y6T7-2;Q9Y6T7         524;524
#4 525                   Q9Y6T7-2;Q9Y6T7         525;525