Executing Concatenation for all rows

62 views Asked by At

I'm working with GWAS data.

Using p-link command I was able to get SNPslist, SNPs.map, SNPs.ped.

Here are the data files and commands I have for 2 SNPs (rs6923761, rs7903146):

$ cat SNPs.map 
0   rs6923761   0   0
0   rs7903146   0   0

$ cat SNPs.ped
6 6 0 0 2 2 G G C C
74 74 0 0 2 2 A G T C
421 421 0 0 2 2 A G T C
350 350 0 0 2 2 G G T T
302 302 0 0 2 2 G G C C

bash commands I used:

echo -n IID > SNPs.csv
cat SNPs.map | awk '{printf ",%s", $2}' >> SNPs.csv
echo >> SNPs.csv
cat SNPs.ped | awk '{printf "%s,%s%s,%s%s\n", $1, $7, $8, $9, $10}' >> SNPs.csv
cat SNPs.csv

Output:

IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC

This is about 2 SNPs, so I can see manually their position so I added and called using the above command. But now I have 2000 SNPs IDs and their values. Need help with bash command which can parse over 2000 SNPs in the same way.

2

There are 2 answers

0
markp-fuso On

One awk idea that replaces all of the current code:

awk '
BEGIN   { printf "IID" }

# process 1st file:

FNR==NR { printf ",%s", $2; next }

# process 2nd file:

FNR==1  { print "" }                       # terminate 1st line of output
        { printf $1                        # print 1st column
          for (i=7;i<=NF;i=i+2)            # loop through columns 7-NF, incrementing index +2 on each pass
              printf ",%s%s", $i, $(i+1)   # print (i)th and (i+1)th columns
          print ""                         # terminate line
        }
' SNPs.map SNPs.ped

NOTE: remove comments to declutter code

This generates:

IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
0
DSTO On

You can use --recodeA flag in plink to have your IID as rows and SNPs as columns.