I'm having trouble seeking around gzfiles in R. Here's an example:
set.seed(123)
m=data.frame(z=runif(10000),x=rnorm(10000))
write.csv(m,"m.csv")
system("gzip m.csv")
file.info("m.csv.gz")$size
[1] 195975
That creates m.csv.gz
, which R says it can seek on, and the help for seek
seems to agree:
gzf=gzfile("m.csv.gz")
open(gzf,"rb")
isSeekable(gzf)
[1] TRUE
Now small jumps, back and forth, seem to work, but if I try a big jump, I get an error:
seek(gzf,10)
[1] 10
seek(gzf,20)
[1] 10
seek(gzf,10)
[1] 20
seek(gzf,1000)
[1] 100
Warning message:
In seek.connection(gzf, 1000) :
seek on a gzfile connection returned an internal error
However if I reset the connection and start again, I can get to 1000 if I do it in 100-byte steps:
for(i in seq(100,1000,by=100)){seek(gzf,i)}
seek(gzf,NA)
[1] 1000
R has some harsh words on using seek
in Windows: "Use of ‘seek’ on Windows is discouraged." but this is on a Linux box (R 3.1.1, 32 bit). Similar code in python using the gz
library works fine, seeking all over.
R 3.2.0 is slightly more informative:
Warning messages:
1: In seek.connection(gzf, 1000) : invalid or incomplete compressed data
2: In seek.connection(gzf, 1000) :
seek on a gzfile connection returned an internal error
Ideas? I've submitted this as a bug report now.
This is just an educated guess: small jumps just are handled within the decoded buffer, when you seek more than buffer size it will just perform a raw seek then it tries to decode gzip in a middle of a chunk leading to a decoding error, this is most likely a bug within R library. I suggest to use skip instead of seek, since the underlying library cannot do anything more and won't have any impact in performance.
I just checked rfc1952 and rfc1951, in gzip you can know the complete size of the file before extracting it just by reading all 'members' header and sum up all ISIZE fields, however you cannot know how big a deflated block will be without decoding it(the size of each symbol is in the dictionary), you simply cannot truly seek a common gzip stream.
If you truly want to seek a gzip you must index it beforehand.
dictzip is a library that adds extra headers to allow seeking.