I am trying to unzip fastq.gz files and then analyze the sequencing data within them. However, later analysis is dependent on preservation of line (line 1 from zipped file must be line 1 in unzipped file) in order within the unzipped files.
When I manually look at the files, it seems to me that line order is being preserved when using gunzip to unzip the fatsq.gz files (and I wouldn't expect anything else). However, downstream analysis fails because order has not been preserved from the original file. Am I missing something about the unzipping process?
It appears that something like the following is happening.
Sequencer writes data to fastq.txt:
line1
line2
line3
lin4
Then zips it into fastq.gz. I then unzip using gunzip and appear to get something like the following, where line order is disrupted:
line2
line1
line4
line3
A
gzip
/gunzip
cycle should not - and we can be reasonably confident that it does not - modify the contents of a file. Moreover, data corruption and algorithmic bugs in this case normally show up as a whole bunch of garbage, not as neatly reordered text lines.A few alternatives:
Your sequencer does not actually output those lines properly ordered in the first place.
If multiple uncompressed files are involved, it may be that your sequencer does the equivalent of
gzip -c file* > fastq.gz
, with the input files being namedfile1 file2 ... file9 file10
. Whenfile*
is expanded in alphabetic order for such files,file10
will be processed beforefile2
, thus messing-up the order in the output.If multiple compressed files are involved then the same mistake may be happening when decompressing.