I am fairly new with R and I wrote a piece of code to iterate over a data frame (cell by cell) to perform a 'simple' check by if...else statements but it turns out it is very slow.
here the code
for(m in 1:(ncol(df) - 1)){
listmatch <- unlist(subset(dt, Experiment == m & Number == m)[[which(colnames(dt) == "id")]])
for(n in 1:nrow(df)){
if (is.na(df[n, m]) == TRUE){
df[n, m] <- NA
} else {
if (sum(as.numeric(unlist(strsplit(as.character(df[n, 33]), split = ";"))) %in% listmatch, na.rm = TRUE) > 0){
df[n, m] <- 1
} else {
df[n, m] <- NA
}}
next}
next}
Just to give some dimensions, df
is 80000 obs of 33 variable, listmatch
is on average a vector of 5000 values, while the column 33 of df
contains information such as
1;2;3;4;5;6;7;8;9;10;11;12;13...
on average 150 values separated by semicolumn. I do not understand why it is very slow, neither how to optimize it and make it faster. Is it more the nested for loops or the nested if...else statements that make it slow? Any suggestions?
Here an example of my data df
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,33]
[1,] NA NA NA NA NA 1 NA 1;8;9;6
[2,] 1 NA NA NA NA NA NA 5
[3,] 1 NA NA 1 NA NA NA 6
[4,] 1 NA 1 NA 1 1 1 88;98;125
[5,] NA NA NA NA NA NA NA 2;3
[6,] 1 NA NA NA NA NA NA 4;5
[7,] NA NA NA NA NA NA 1 44
[8,] NA NA 1 1 1 NA NA 46
[9,] NA 1 1 NA NA NA NA 78
[10,] 11 NA NA NA NA NA NA 147;411
[11,] NA 1 NA NA NA NA NA 400
[12,] NA 1 1 NA NA NA NA 658
[13,] NA 1 NA 1 NA NA NA 87;988;1226
[14,] NA NA NA NA 1 1 NA 121;232;321
[15,] NA NA 1 NA NA 1 NA 5
while dt looks like this
Text Experiment Number id
[1,] df 5 3 1
[2,] fd 8 8 8
[3,] reh 7 4 6
[4,] egfdgsd 4 8 8
[5,] fbdsgryj 5 3 44
[6,] rtryyukfn 7 2 411
[7,] fgsfde 5 6 400
[8,] wthrthmgh 2 7 1274
[9,] vdfd 2 8 658
[10,] vfr 8 6 147
[11,] brtey 3 2 1226
[12,] tyirt 7 5 988
[13,] fyi 5 4 87
[14,] ywrtj 7 5 46
[15,] kyou 8 6 4
[16,] pkjh 1 7 8
[17,] retuyk 5 4 9
[18,] re 7 8 11
[19,] eryre 6 3 12
[20,] fdhd 6 1 18
I forgot to mention that I would like to keep the code
listmatch <- unlist(subset(dt, Experiment == m & Number == m)[[which(colnames(dt) == "id")]])
as it is, at the moment I am filtering based on the value of m
but I would like in the future to apply manually different filtering, i.e. ...Experiment == m*2 & Number == m*5)...