Unzip .gz file when line order is important

235 views Asked by At

I am trying to unzip fastq.gz files and then analyze the sequencing data within them. However, later analysis is dependent on preservation of line (line 1 from zipped file must be line 1 in unzipped file) in order within the unzipped files.

When I manually look at the files, it seems to me that line order is being preserved when using gunzip to unzip the fatsq.gz files (and I wouldn't expect anything else). However, downstream analysis fails because order has not been preserved from the original file. Am I missing something about the unzipping process?

It appears that something like the following is happening.

Sequencer writes data to fastq.txt:

line1
line2
line3
lin4

Then zips it into fastq.gz. I then unzip using gunzip and appear to get something like the following, where line order is disrupted:

line2
line1
line4
line3
1

There are 1 answers

0
thkala On

A gzip/gunzip cycle should not - and we can be reasonably confident that it does not - modify the contents of a file. Moreover, data corruption and algorithmic bugs in this case normally show up as a whole bunch of garbage, not as neatly reordered text lines.

A few alternatives:

  • Your sequencer does not actually output those lines properly ordered in the first place.

  • If multiple uncompressed files are involved, it may be that your sequencer does the equivalent of gzip -c file* > fastq.gz, with the input files being named file1 file2 ... file9 file10. When file* is expanded in alphabetic order for such files, file10 will be processed before file2, thus messing-up the order in the output.

  • If multiple compressed files are involved then the same mistake may be happening when decompressing.