I need to convert a genotype dosage file into an allelic dosage file.
Input looks like this:
#snp a1 a2 i1 j1 i2 j2 i3 j3
chr6_24000211_D D I3 0 0 0 0 0 0
rs78244999 A G 1 0 1 0 1 0
rs1511479 T C 0 1 1 0 0 1
rs34425199 A C 0 0 0 0 0 0
rs181892770 A G 1 0 1 0 1 0
rs501871 A G 0 1 0.997 0.003 0 1
chr6_24000836_D D I4 0 0 0 0 0 0
chr6_24000891_I I2 D 0 0 0 0 0 1
rs16888446 A C 0 0 0 0 0 0
Columns 1-3 are identifiers. No operations should be performed on these, they need to just be copied as is into the output file. For the remaining columns, they need to be considered as a pair of column i and column j and the following operation needs to be performed: 2*i + j
Pseudocode
write first three columns of input file to output
for all i and j in the file, write 2*i + j to output
Desired output looks like this:
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
I will be performing this on a number of files with different total columns, so I want the loop to run for (total number of columns - 3)/2 iterations, i.e. until it reaches the last column of the file.
Input files are ~9 million rows by ~10,000 columns, so reading the files into a program such as R is very slow. I am not sure the most efficient tool to use to implement this (awk? perl? python?), and as a novice coder I unsure of where to begin re: syntax for the solution.
Here's the awk implementation of your posted algorithm, enhanced just slightly to produce the first row you show in your expected output:
.