How to do a full join in a loop using regular expression

57 views Asked by At

I'm trying to perform a full join between a large number of databases that are in two lists.

average_list2

AverageLD_1.txt
AverageLD_2.txt
AverageLD_3.txt
.
.
AverageLD_70.txt

full_list2

fullLD_1.txt
fullLD_2.txt
fullLD_3.txt
.
.
fullLD70

The full join must be performed between pairs of files with the same regular expression (1:70): AverageLD_1 with FullLD_1, AverageLD2_ with FullLD_2, and so on ..

The "sec" and "r2" are the columns to join

for this:

  1. Standardize the column names in both lists:
average_list<- lapply(Sys.glob("ld_interval/averageLD_*.txt"), fread)
new_colnames <-c("sec", "r2")
average_list2 <- lapply(average_list, set_names, new_col_names)

full_list<- lapply(Sys.glob("ld_interval/fullLD*.txt"), fread)
new_colnames2 <-c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "sec", "r2")
full_list2 <- lapply(full_list, set_names, new_col_names2)
  1. I tried
full_join <- list()

for (i in 1:length(average_list2)) {
  for (j in 1:length(full_list2)) {
    if (i == j) {
      full_joina <- dplyr::full_join(by = "sec")
      break
    }
  }
}
write.table(full_joina,  file = paste0("ld_interval/", "fulljoina_",i,".txt"), quote = FALSE, sep ="\t" ,row.names = F, col.names = F)

I appreciate any suggestions

2

There are 2 answers

0
richarddmorey On

Why not something like this? I can't test it without your data.

library(dplyr)

dir("ld_interval/", pattern = "averageLD_*.txt", full.names = TRUE) |>
purrr::map_df(\(fn){
 s = stringr::str_match(fn, pattern = '(LD_[0-9]{1,})\\.txt')[2]
 x = fread(fn)
 colnames(x) = c("sec", "r2")
 y = fread(paste0("ld_interval/","full",s,".txt"))
 colnames(y) = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "sec", "r2")
 dplyr::full_join(x, y, by = "sec") |>
    mutate(source = s)
})

0
Johanna Ramirez On

this one worked for me:

for (i in 1:length(average_list2)) {
  full_joina <- NULL 
  cat(paste0('file ',i,'\n'))
  cat(paste0('Record in: ', nrow(full_list2[[i]]),'\n'))
  full_joina <- dplyr::full_join(average_list2[[i]], full_list2[[i]], by = "seg")
  cat(paste0('Record out: ', nrow(full_joina),'\n'))
  write.table(full_joina,  file = paste0("ld_interval_kmeans/", "fulljoina_kmeans_",pop_ld[i],".txt"), quote = FALSE, sep ="\t" ,row.names = F, col.names = F)
}

the lines below are not necessary for merging, but are very useful for tracking the reading/writing of the files

cat(paste0('file ',i,'\n'))
cat(paste0('Record in: ', nrow(full_list2[[i]]),'\n'))
cat(paste0('Record out: ', nrow(full_joina),'\n'))