Using a specific column's values to subtract from other rows but the column index moves between files in R

124 views Asked by At

Data I also have the total number of cancer patients (case_totals) and non-cancer patients(control_totals) which in this case is 100 and 1000 respectively.

Variant  Cancer IBD AKI CKD CCF IHD
A1         0    5   4   0   0   4
A2         0    8   5   9   0   7
A3         20   9   6   7   0   3
B5         7    2   0   6   5   4
K7         9    1   8   4   2   5
L9         0    0   6   3   3   1

Desired outcome - two tables: Table1:

 Variant     case_total not_seen_in_cases_total control_total not_seen_in_control_total
    A1             0           100                    13                  987  
    A2             0           100                    25                  975 
    A3             20          80                     25                  975
    B5             7           93                     17                  983
    K7             9           91                     20                  980
    L9             0           100                    13                  987

Table2:

case_total_in_gene  not_seen_in_gene_cases      control_total_in_gene control_total_not_in_gene
36                         64                            113                 887

I will then run a fishers across both tables to get a per variant and per gene p.value which I can do.

My issue is that I have multiple such datasets and in each the order of the columns of the input is different. At present I have been using:

ncol(dt) #to get the total number of columns as in reality the table is very large
which(colnames(dt)=='Cancer') #get the index column 
dt$control_total <- (rowSums(dt[,2:7])) - rowSums(dt[,2]) #get a control totals per row column 

And then subsetting dt and just adding in the other columns using subtraction e.g. dt$not_seen_in_control_total <- 1000 - dt$control_total

This won't work with shifting column indices and I want to run this across hundreds of files ideally using a commandArgs.

Ultimately how do I reference a column which will always have the same name but will be in different places in a function like RowSums etc?

Many thanks

1

There are 1 answers

6
Ronak Shah On BEST ANSWER

You can select column names by position or pattern in names or by specifying range of columns. It depends on how your data is structured.

library(dplyr)

table1 <- df %>%
  mutate(control_total = rowSums(select(., setdiff(2:ncol(.), 
                                 match('Cancer', names(.)))))) %>%
  transmute(Variant, Cancer, 
            not_seen_in_cases_total = 100 - Cancer, 
            control_total, 
            not_seen_in_control_total = 1000 - control_total)
table1

#  Variant Cancer not_seen_in_cases_total control_total not_seen_in_control_total
#1      A1      0                     100            13                       987
#2      A2      0                     100            29                       971
#3      A3     20                      80            25                       975
#4      B5      7                      93            17                       983
#5      K7      9                      91            20                       980
#6      L9      0                     100            13                       987

table2 <- table1 %>%
  summarise(case_total_in_gene = sum(Cancer), 
            not_seen_in_gene_cases = 100 - case_total_in_gene, 
            control_total_in_gene = sum(control_total), 
            control_total_not_in_gene = 1000 - control_total_in_gene)

table2
# case_total_in_gene not_seen_in_gene_cases control_total_in_gene control_total_not_in_gene
#1                 36                     64                   117                       883

data

df <- structure(list(Variant = c("A1", "A2", "A3", "B5", "K7", "L9"
), Cancer = c(0L, 0L, 20L, 7L, 9L, 0L), IBD = c(5L, 8L, 9L, 2L, 
1L, 0L), AKI = c(4L, 5L, 6L, 0L, 8L, 6L), CKD = c(0L, 9L, 7L, 
6L, 4L, 3L), CCF = c(0L, 0L, 0L, 5L, 2L, 3L), IHD = c(4L, 7L, 
3L, 4L, 5L, 1L)), class = "data.frame", row.names = c(NA, -6L))