Seeking on a gz connection is unpredictable

1.1k views Asked by At

I'm having trouble seeking around gzfiles in R. Here's an example:

set.seed(123)
m=data.frame(z=runif(10000),x=rnorm(10000))
write.csv(m,"m.csv")
system("gzip m.csv")
file.info("m.csv.gz")$size
[1] 195975

That creates m.csv.gz, which R says it can seek on, and the help for seek seems to agree:

gzf=gzfile("m.csv.gz")
open(gzf,"rb")
isSeekable(gzf)
[1] TRUE

Now small jumps, back and forth, seem to work, but if I try a big jump, I get an error:

seek(gzf,10)
[1] 10
seek(gzf,20)
[1] 10
seek(gzf,10)
[1] 20
seek(gzf,1000)
[1] 100
Warning message:
In seek.connection(gzf, 1000) :
  seek on a gzfile connection returned an internal error

However if I reset the connection and start again, I can get to 1000 if I do it in 100-byte steps:

for(i in seq(100,1000,by=100)){seek(gzf,i)}
seek(gzf,NA)
[1] 1000

R has some harsh words on using seek in Windows: "Use of ‘seek’ on Windows is discouraged." but this is on a Linux box (R 3.1.1, 32 bit). Similar code in python using the gz library works fine, seeking all over.

R 3.2.0 is slightly more informative:

Warning messages:
1: In seek.connection(gzf, 1000) : invalid or incomplete compressed data
2: In seek.connection(gzf, 1000) :
  seek on a gzfile connection returned an internal error

Ideas? I've submitted this as a bug report now.

2

There are 2 answers

0
davide On

This is just an educated guess: small jumps just are handled within the decoded buffer, when you seek more than buffer size it will just perform a raw seek then it tries to decode gzip in a middle of a chunk leading to a decoding error, this is most likely a bug within R library. I suggest to use skip instead of seek, since the underlying library cannot do anything more and won't have any impact in performance.

I just checked rfc1952 and rfc1951, in gzip you can know the complete size of the file before extracting it just by reading all 'members' header and sum up all ISIZE fields, however you cannot know how big a deflated block will be without decoding it(the size of each symbol is in the dictionary), you simply cannot truly seek a common gzip stream.

If you truly want to seek a gzip you must index it beforehand.

dictzip is a library that adds extra headers to allow seeking.

3
jan-glx On

If you, like me, just came here to read a gzipped file with readLines, you could try to use: readLines(gzcon(file("your_file.txt.gz", "rb"))) instead of readLines(gzip("your_file.txt.gz", "r")).