Loop to multiply column i by 2, add column j, and repeat for all pairs of columns in a file

102 views Asked by At

I need to convert a genotype dosage file into an allelic dosage file.

Input looks like this:

    #snp a1 a2 i1 j1 i2 j2 i3 j3
    chr6_24000211_D D I3 0 0 0 0 0 0
    rs78244999 A G 1 0 1 0 1 0
    rs1511479 T C 0 1 1 0 0 1
    rs34425199 A C 0 0 0 0 0 0
    rs181892770 A G 1 0 1 0 1 0
    rs501871 A G 0 1 0.997 0.003 0 1
    chr6_24000836_D D I4 0 0 0 0 0 0
    chr6_24000891_I I2 D 0 0 0 0 0 1
    rs16888446 A C 0 0 0 0 0 0

Columns 1-3 are identifiers. No operations should be performed on these, they need to just be copied as is into the output file. For the remaining columns, they need to be considered as a pair of column i and column j and the following operation needs to be performed: 2*i + j

Pseudocode

write first three columns of input file to output

for all i and j in the file, write 2*i + j to output

Desired output looks like this:

#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0 
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0

I will be performing this on a number of files with different total columns, so I want the loop to run for (total number of columns - 3)/2 iterations, i.e. until it reaches the last column of the file.

Input files are ~9 million rows by ~10,000 columns, so reading the files into a program such as R is very slow. I am not sure the most efficient tool to use to implement this (awk? perl? python?), and as a novice coder I unsure of where to begin re: syntax for the solution.

3

There are 3 answers

0
Ed Morton On BEST ANSWER

Here's the awk implementation of your posted algorithm, enhanced just slightly to produce the first row you show in your expected output:

$ cat tst.awk
{
    printf "%s %s %s", $1, $2, $3
    c=0
    for (i=4; i<NF; i+=2) {
        printf " %s", (NR>1 ? 2*$i + $(i+1) : ++c)
    }
    print ""
}

.

$ awk -f tst.awk file
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
0
Borodin On

This will do as you ask. It expects the input file as a parameter on the command line and will send the output to STDOUT, which you can redirect to a file if you wish.

use strict;
use warnings;

while (<>) {

  my @fields = split;
  my @probs = splice @fields, 3;

  if (/^#/) {
    push @fields, 1 .. @probs / 2;
  }
  else {
    while (@probs >= 2) {
      my ($i, $j) = splice @probs, 0, 2;
      push @fields, $i + $i + $j;
    }
  }

  print "@fields\n";
}

output

#SNP A1 A2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
0
snakehiss On

Python version

#!/usr/bin/env python

from itertools import izip_longest, chain

def chunk(sequence, chunk_size=2):
    """
    list(chunk([1,2,3,4], 2)) => [(1,2),(3,4)]
    """
    # Take advantage of the same iterator being consumed
    # multiple times/sources to do grouping
    return izip_longest(*[iter(sequence)] * chunk_size)

def processor(csv_reader):
    for row in csv_reader:
        # collect the pairs and process them
        processed_pairs = (2*float(i)+float(j) for i, j in chunk(row[3:]))
        # yield back the first 3 element and the processed pairs
        yield list(i for j in (row[0:3], processed_pairs) for i in j)

if __name__ == '__main__':
    import csv, sys
    with open(sys.argv[1], 'rb') as csvfile:
        source = processor(csv.reader(csvfile, delimiter=' '))
        for line in source:
            print line