Data I also have the total number of cancer patients (case_totals) and non-cancer patients(control_totals) which in this case is 100 and 1000 respectively.
Variant Cancer IBD AKI CKD CCF IHD
A1 0 5 4 0 0 4
A2 0 8 5 9 0 7
A3 20 9 6 7 0 3
B5 7 2 0 6 5 4
K7 9 1 8 4 2 5
L9 0 0 6 3 3 1
Desired outcome - two tables: Table1:
Variant case_total not_seen_in_cases_total control_total not_seen_in_control_total
A1 0 100 13 987
A2 0 100 25 975
A3 20 80 25 975
B5 7 93 17 983
K7 9 91 20 980
L9 0 100 13 987
Table2:
case_total_in_gene not_seen_in_gene_cases control_total_in_gene control_total_not_in_gene
36 64 113 887
I will then run a fishers across both tables to get a per variant and per gene p.value which I can do.
My issue is that I have multiple such datasets and in each the order of the columns of the input is different. At present I have been using:
ncol(dt) #to get the total number of columns as in reality the table is very large
which(colnames(dt)=='Cancer') #get the index column
dt$control_total <- (rowSums(dt[,2:7])) - rowSums(dt[,2]) #get a control totals per row column
And then subsetting dt and just adding in the other columns using subtraction e.g. dt$not_seen_in_control_total <- 1000 - dt$control_total
This won't work with shifting column indices and I want to run this across hundreds of files ideally using a commandArgs.
Ultimately how do I reference a column which will always have the same name but will be in different places in a function like RowSums etc?
Many thanks
You can select column names by position or pattern in names or by specifying range of columns. It depends on how your data is structured.
data