Is there a way to skip first x lines of a bz2 file in Python without calling next()?

296 views Asked by At

I'm trying to read the latest Wikidata dump while skipping the first, say, 100 lines.

Is there a better way to do this than calling next() repeatedly?

WIKIDATA_JSON_DUMP = bz2.open('latest-all.json.bz2', 'rt')

for n in range(100):
    next(WIKIDATA_JSON_DUMP)

Alternatively, is there a way to split up the file in bash by, say, using bzcat to pipe select chunks to smaller files?

2

There are 2 answers

1
Tom Morris On

If it was compressed using something like bgzip, you can skip blocks, but they will contain a variable number of lines, depending on the compression ratio. For raw bzip files which are a single stream, I don't think you have any choice but to read and throw away the lines to be skipped, due to the nature of the compression format.

0
Pineapples On

You can try the following in bash, to skip the first 10 lines for example:

bzcat -d -c /tmp/myfile.bz2 | tail -n +11

Notice the tail gets the N+1 number of lines you want to skip.