Python: Ignore EOF in XML file

144 views Asked by wustus At 20 October 2020 at 11:49

I'm currently working on a project that involves getting article-titles from the Wikipedia dump. The downloadable file is in .bz2 format and contains an XML file that would be about 80GB in size if I were to unpack it.

I can open and read the first few lines with Python but my script stops reading after 43 lines. After, the first article-page starts. I'm assuming that there's an EOF between the pages.

Is there any way to ignore it and continue reading? I don't really want to decpompress it nor change the file externally.

My code looks similiar to this:

import bz2

dump = bz2.BZ2File(path, "r")
i = 0
for line in dump:
   print(type(line))
   print(line)
   if i <= 1000:
      i+=1
   else:
      break
     
dump.close()

Original Q&A

TechQA.

Python: Ignore EOF in XML file

There are 0 answers

Related Questions in PYTHON

Related Questions in EOF

Related Questions in WIKIPEDIA

Related Questions in DUMP

Related Questions in BZ2

Popular Questions

Popular Tags

Trending Questions