How to replace 00 with Na excluding first row & first column using bash in comma separated file

331 views Asked by At

I'm working with GWAS data, My data looks like this:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,00,AG,GT,AK,00
32,AG,GG,AA,00,AT
100,TT,AA,00,AG,AA       
3,GG,AG,00,GT,GG

Desired Output:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA       
3,GG,AG,N/A,GT,GG

Here I'm trying to replace "00" with "N/A", but since I have 00 in the first row & First Column (IID), the command I used:

sed '1!s~00~N/A~g' allSNIPsFinaldata.csv 

The above command excludes the first row but not the first column as a result I got IID Values 100, 200, and 300 as 1N/A, 2N/A, and 3N/A. Can anyone please help "how to exclude the first row & First Column as well and perform the above operation. please help

6

There are 6 answers

1
Ed Morton On BEST ANSWER

Using any awk in any shell on every Unix box:

$ awk '{$0=$0","; gsub(/,00,/,",N/A,"); sub(/,$/,"")} 1' file
IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA
3,GG,AG,N/A,GT,GG

The above assumes none of the column names on the first row will be the exact string 00. If they can be then just tweak the above to:

awk 'NR>1{$0=$0","; gsub(/,00,/,",N/A,"); sub(/,$/,"")} 1' file
0
Egor Lipchinskiy On

If you want to repolace only 00 in other columns you have to add a delimiter (I am assuming space in my command) in your pattern: sed -i 's~ 00 ~ N/A ~g' allSNIPsFinaldata.csv

0
potong On

This might work for you (GNU sed):

sed -E '1!{s/,00(,|$)/,N\/A\1/g;s//,N\/A\1/g}' file

If not the first line and , followed by 00 followed by , or end-of-line, replace the 00 by N/A and other parts of the match remain unchanged.

This substitution is global but needs to be implemented twice because the patterns may overlap.

0
tshiono On

Assuming the columns are separated by space characters such as whitespace or tab character, would you please try:

sed -E '1!s~([[:space:]])00([[:space:]]|$)~\1N/A\2~g' allSNIPsFinaldata.csv
  • The address 1! skips the 1st row.
  • The regex ([[:space:]])00([[:space:]]|$) matches the 00 string preceded by a space character (it prevents to match the 1st column) and followed by a space character or the end of the line.
2
James Brown On

An awk:

$ awk '
BEGIN {
    FS=OFS=","          # set field delimiters to a comma
}
FNR>1 {                 # process records after the first
    for(i=1;i<=NF;i++)  # iterate all fields (maybe start from 2nd?)
        if($i=="00")    # if field is 00
            $i="N/A"    # replace
}1' file                # output

Output:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA       
3,GG,AG,N/A,GT,GG
0
RavinderSingh13 On

With your shown samples in GNU awk using its gensub function, please try following awk program.

awk '
BEGIN{
  FS=OFS=","
}
FNR==1{
  print
  next
}
{
  secondPart=gensub(/^[^,]*,(.*)/,"\\1","g")
  sub(/^00,/,"N/A,",secondPart)
  gsub(/,00,/,",N/A,",secondPart)
  sub(/,00$/,",N/A",secondPart)
  print $1 OFS secondPart
}
'  Input_file