searching through text file in terminal

598 views Asked by At

Hi this might be a basic question for many, but it has however managed to eat a couple of hours of my time.

I have large data file as an output from running a script. The file contains around 15 columns and around 100,000 rows. I wish to search through the file and in columns 4,5,6,7 and 8 check for specific values( and strings ). I know I can cut the columns separately and view them or use forward search("/") in less command. The problem here is the second and third column will also contain the value (almost in every other line) I search for. I only need the values in columns 4,5,6,7 and 8 for result interpretation and also I need to view adjacent columns too. How can I accomplish this? I do not want to use any external languages such as R, python or perl, I am looking for solutions using command line commands.

i use the following command to view the file;

bzcat myfile.tsv.bz2 | column -t | less -S 

Any inputs will be appreciated.

Example of how the data looks like; (It is biological data within specific intervals)

col1 strt  end Sample1 Sample2 Sample3 Sample4 Sample5 p.val1 p.val2 .   ID 

ABC  1100  1200  2        2       2       2       3      NA    0.27403   PLD4     
BCD  1200  1300  4        3       4       4       2    0.88831 0.37662 CYP46A1
CDE  1300  1400  2        1       4       2       1    0.77922 0.00519   CEBPE
DEF  1400  1500  6        4       4       4       4    0.88182 NA        BRCA
EFG  1500  1600  2        6       8       10      3    0.00779 0.01558   BRCA

Say I want to view the file on whole and restrict my only to search columns 4,5,6,7 and 8. ~M

2

There are 2 answers

2
Ed Morton On BEST ANSWER

Until you edit your question to provide more info, is this what you want?:

$ awk '$4==1 && $6==4' file
BCD  2    4  1     1    4    2

The above was run against your posted sample input file:

$ cat file
col1 srt end col4 col5 col6 col7
ABC  1    2  1     1    5    2
BCD  2    4  1     1    4    2
CDE  4    6  6     5    2    5
DEF  6    8  4     4    4    4
EFG  8   10  4     4    3    4

Given your comment below, is this what you want:

$ awk '{print $0 ($4==1 && $6==4 ? " <--- HERE I AM!" : "")}' file
col1 srt end col4 col5 col6 col7
ABC  1    2  1     1    5    2
BCD  2    4  1     1    4    2 <--- HERE I AM!
CDE  4    6  6     5    2    5
DEF  6    8  4     4    4    4
EFG  8   10  4     4    3    4
1
Sobrique On

OK, so I'm going to assume that tsv means tab-separated values.

I would use perl for this:

#!/usr/bin/perl

use strict;
use warnings;

my $search_term = "some_term"; 
my @columns_to_check =  ( 4,5,6,7,8 ); 

while ( <> ) {
    my @cols = split;
    for my $colnum ( @columns_to_check ) {
       if ( $cols[$colnum] =~ m/$search_term/ ) { 
            print; 
            last; 
       }
    }
}

Note: $search_term is a regular expression match. Also: Perl starts arrays at zero, so your column 1 might be column 0.