Parse file and use some of the fields as variables using the header as name in bash

Question

Parse file and use some of the fields as variables using the header as name in bash

97 views Asked by biojl At 01 December 2014 at 11:46

I have a file which first line contain a series of fields, tab separated (\t). I'm trying to walk through the lines and use some of the fields as variables for a programme. The code I have so far is the following:

    {
    A=$(head -1 id_table.txt)
read;
    while IFS='\t' read $A; 
    do
        echo 'downloading '$SRA_Sample_s
        echo $tissue_s
    #out_dir=`echo $tissue_s | sed 's/ /./g'` #Replacing spaces by dots
    #/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s 
    done 
    } <./id_table.txt

Output (Wrong):

downloading _s Inser

downloading  provided> <no

downloading  provided> <no

downloading  provided> <no

It fails because it's not getting correctly the fields. Perhaps the <> characters are creating confusion? Different files have the name of the columns ordered differently and some columns are missing in some files. I'm stuck here.

The file looks like this:

BioSample_s MBases_l    MBytes_l    Run_s   SRA_Sample_s    Sample_Name_s   age_s   breed_s sex_s   Assay_Type_s    AssemblyName_s  BioProject_s    BioSampleModel_s    Center_Name_s   Consent_s   InsertSize_l    Library_Name_s  Platform_s  SRA_Study_s biomaterial_provider_s  g1k_analysis_group_s    g1k_pop_code_s  source_s    tissue_s
SAMN02777951    4698    3249    SRR1287653  SRS607026   SL01    19  SL01    female  RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood
SAMN02777952    4451    3063    SRR1287654  SRS607028   XB01    12  XB01    male    RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood
SAMN02777953    4553    3139    SRR1287655  SRS607025   XB02    6   XB02    female  RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood

Original Q&A

There are 3 answers

NeronLeVelu On 01 December 2014 at 14:05

try (based on your style of development)

cat id_table.txt \
 | {
   read Header

   while eval "read ${Header}"
    do
      echo "Donwloading ${SRA_Sample_s}"
      echo "${tissue_s}"
    done
   }

Etan Reisner On 01 December 2014 at 15:15

IFS='\t' hasn't worked the way you wanted. That's delimiting by t. Use IFS=$'\t' to use tabs.

This is why you are getting _s Inser, etc. (notice it starts and cuts off at the letter t).

That being said I fully agree with EdMorton that using awk for this is likely a better idea though I believe with careful quoting and the assertion that tab will not appear in the input file you can likely do this safely with just the shell (but Ed has shown me the error of my initial thoughts on more than one occasion so he may very well be thinking of things I am not).

**Ed Morton** · Accepted Answer · 2014-12-01T14:04:16+00:00

You may find an awk script more robust and less cumbersome to use than a shell loop:

$ cat tst.awk
BEGIN { FS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
    print "downloading", $(f["SRA_Sample_s"])
    out_dir = $(f["tissue_s"])
    gsub(/ /,".",out_dir)
    cmd = sprintf( "/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir %s --ncbi_error_report %s", out_dir, $(f["SRA_Sample_s"]) )
    print cmd
    #system(cmd); close(cmd)
}

.

$ awk -f tst.awk file
downloading SRR1287653
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287653
downloading SRR1287654
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287654
downloading SRR1287655
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287655

I'd say you should DEFINITELY avoid the shell loop if it wasn't for you calling an external command and so doing more than just text processing.

Alterantively, consider using awk for the text processing and then piping to a shell loop for the external command execution:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
    gsub(/ /,".",$(f["tissue_s"]))
    print $(f["tissue_s"]), $(f["SRA_Sample_s"])
}

.

$ awk -f tst.awk file |
while IFS=$'\t' read -r out_dir SRA_Sample_s
do
    printf 'downloading %s\n' "$SRA_Sample_s"
    #/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s 
done
downloading SRR1287653
downloading SRR1287654
downloading SRR1287655

TechQA.

Parse file and use some of the fields as variables using the header as name in bash

There are 3 answers

Related Questions in BASH

Related Questions in SED

Related Questions in WHILE-LOOP

Related Questions in SEPARATOR

Related Questions in FASTQ

Popular Questions

Popular Tags

Trending Questions