Python: Ignore EOF in XML file

149 views Asked by At

I'm currently working on a project that involves getting article-titles from the Wikipedia dump. The downloadable file is in .bz2 format and contains an XML file that would be about 80GB in size if I were to unpack it.

I can open and read the first few lines with Python but my script stops reading after 43 lines. After, the first article-page starts. I'm assuming that there's an EOF between the pages.

Is there any way to ignore it and continue reading? I don't really want to decpompress it nor change the file externally.

My code looks similiar to this:

import bz2

dump = bz2.BZ2File(path, "r")
i = 0
for line in dump:
   print(type(line))
   print(line)
   if i <= 1000:
      i+=1
   else:
      break
     
dump.close()
0

There are 0 answers