I have very large file (40m x 400 columns).
Structure like:
chr pos snp
1 1 rs500
2 4 rs501
2 6 rs502
17 6 rs503
Given a name myfile.gz
To search 3rd column for a given value the following works:
zcat myfile | grep rs500$
However, to search for two criteria - say chr = 17
and pos = 6
I was trying to do the following, but can't get it to return values.
zcat myfile | awk '{ if ($1 == 17 && $2 == 6) print }'
No error, but no return of anything. I've done this kind of filtering in the past when the file wasn't .gz compressed with no issue.
such as this command in a much larger different file that filters two columns on criteria and then retrieves the results.
"awk '{ if (NR == 1 || ($39 >= 0.03 && $36 <= 1e-04)) print }' myfile.notgzcompressed"
But I can't seem to combine that syntax with the need for zcat, because I don't want to have to unzip my huge archive
EDIT to add information based on comments
zcat myfile.gz | head -2 | od -c
0000000 c h r \t p o s \t r e f \t a l t \t
0000020 c h r _ h g 1 9 \t p o s _ h g 1
0000040 9 \t r e f _ h g 1 9 \t a l t _ h
0000060 g 1 9 \t V E P _ e n s e m b l _
0000100 s u m m a r y \t r s _ d b S N P
0000120 1 5 1 \n 1 \t 1 0 1 8 0 \t T \t C \t
0000140 1 \t 1 0 1 8 0 \t T \t C \t W A S H
0000160 7 P ( 1 ) : d o w n s t r e a m
0000200 _ g e n e _ v a r i a n t ( 1 )
0000220 | D D X 1 1 L 1 ( 2 ) : u p s t
0000240 r e a m _ g e n e _ v a r i a n
0000260 t ( 2 ) \t r s 2 0 1 6 9 4 9 0 1
0000300 \n
For more info, I am using R and fread() to pass commands like this so that unix does the parsing prior to loading into the R environment. This chr and pos lookup have been assigned.
fread(cmd = paste0("zcat ", myfile, " | awk ","'{ if ($1 == ", chr ," && $2 == ",pos,") print }'")) -> h2
I suspect that whilst using
with humongous
myfile
problem might arise at|
. Namely|
has limited machine-dependant capacity (further reading The Pipe Buffer Capacity in Linux), if yourawk
does not read quickly enough|
might become jammed with data.If your data has never leading zeros and has field separated by single TAB character and you are interesting in 1st field being equal to value and 2nd field being equal to value then you might use GNU
grep
for that task, 1st field holding17
and 2nd field holding6
might be expressed following way, let say you havecommand
which produces TAB-separated outputthen
gives output
Explanation: I instruct GNU grep to use perl-flavor regular expression and do not contaminate output with escape sequences and look for lines starting with (
^
)17
followed by TAB character, followed by6
spanning to word boundary (\b
) - in order to prevent grabbing lines where 2nd column starts with6
but is not6
(observe last line ofcommand
output).(tested in GNU grep 3.7)