I'm currently working on a project that involves getting article-titles from the Wikipedia dump. The downloadable file is in .bz2 format and contains an XML file that would be about 80GB in size if I were to unpack it.
I can open and read the first few lines with Python but my script stops reading after 43 lines. After, the first article-page starts. I'm assuming that there's an EOF between the pages.
Is there any way to ignore it and continue reading? I don't really want to decpompress it nor change the file externally.
My code looks similiar to this:
import bz2
dump = bz2.BZ2File(path, "r")
i = 0
for line in dump:
print(type(line))
print(line)
if i <= 1000:
i+=1
else:
break
dump.close()