Seeking on a gz connection is unpredictable

Question

Seeking on a gz connection is unpredictable

1.2k views Asked by Spacedman At 14 June 2015 at 21:56

I'm having trouble seeking around gzfiles in R. Here's an example:

set.seed(123)
m=data.frame(z=runif(10000),x=rnorm(10000))
write.csv(m,"m.csv")
system("gzip m.csv")
file.info("m.csv.gz")$size
[1] 195975

That creates m.csv.gz, which R says it can seek on, and the help for seek seems to agree:

gzf=gzfile("m.csv.gz")
open(gzf,"rb")
isSeekable(gzf)
[1] TRUE

Now small jumps, back and forth, seem to work, but if I try a big jump, I get an error:

seek(gzf,10)
[1] 10
seek(gzf,20)
[1] 10
seek(gzf,10)
[1] 20
seek(gzf,1000)
[1] 100
Warning message:
In seek.connection(gzf, 1000) :
  seek on a gzfile connection returned an internal error

However if I reset the connection and start again, I can get to 1000 if I do it in 100-byte steps:

for(i in seq(100,1000,by=100)){seek(gzf,i)}
seek(gzf,NA)
[1] 1000

R has some harsh words on using seek in Windows: "Use of ‘seek’ on Windows is discouraged." but this is on a Linux box (R 3.1.1, 32 bit). Similar code in python using the gz library works fine, seeking all over.

R 3.2.0 is slightly more informative:

Warning messages:
1: In seek.connection(gzf, 1000) : invalid or incomplete compressed data
2: In seek.connection(gzf, 1000) :
  seek on a gzfile connection returned an internal error

Ideas? I've submitted this as a bug report now.

Original Q&A

There are 2 answers

**davide** · Answer 1 · 2015-07-18T12:43:18+00:00

This is just an educated guess: small jumps just are handled within the decoded buffer, when you seek more than buffer size it will just perform a raw seek then it tries to decode gzip in a middle of a chunk leading to a decoding error, this is most likely a bug within R library. I suggest to use skip instead of seek, since the underlying library cannot do anything more and won't have any impact in performance.

I just checked rfc1952 and rfc1951, in gzip you can know the complete size of the file before extracting it just by reading all 'members' header and sum up all ISIZE fields, however you cannot know how big a deflated block will be without decoding it(the size of each symbol is in the dictionary), you simply cannot truly seek a common gzip stream.

If you truly want to seek a gzip you must index it beforehand.

dictzip is a library that adds extra headers to allow seeking.

**jan-glx** · Answer 2 · 2018-03-02T15:48:40+00:00

jan-glx On 02 March 2018 at 15:48

If you, like me, just came here to read a gzipped file with readLines, you could try to use: readLines(gzcon(file("your_file.txt.gz", "rb"))) instead of readLines(gzip("your_file.txt.gz", "r")).

TechQA.

Seeking on a gz connection is unpredictable

There are 2 answers

Related Questions in R

Related Questions in IO

Related Questions in GZIP

Related Questions in SEEK

Popular Questions

Trending Questions