I'm using pdf2txt (pdfminer python tool) in ubuntu to extract text from some Norwegian pdf's.
This tool works perfectly with some of the pdf's im using, and i get the extracted text in a .txt file, but half of the pdf's, more or less, throws this error:
Traceback (most recent call last):
File "/usr/bin/pdf2txt", line 101, in module if name == 'main': sys.exit(main(sys.argv))
File "/usr/bin/pdf2txt", line 95, in main caching=caching, check_extractable=True)
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_pdf interpreter.process_page(page)
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 757, in process_page self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 770, in render_contents self.execute(list_value(streams))
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 795, in execute func(*args)
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 605, in do_BDC self.device.begin_tag(tag, props)
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfdevice.py", line 160, in begin_tag in sorted(props.iteritems()) )
File "/usr/lib/python2.7/dist-packages/pdfminer/pdfdevice.py", line 159, in s = ''.join( ' %s="%s"' % (enc(k), enc(str(v))) for (k,v)
File "/usr/lib/python2.7/dist-packages/pdfminer/utils.py", line 164, in enc return x.encode(codec, 'xmlcharrefreplace')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
I would understand that any of them work, or all of them work, but they are all in Norweigan, so they are using the same characters. Why would some of them work and some of them not?
There are even some pdf's that throws this error when i try to extract the text just from page 1, and it works good extracting the text from page 2.
Here is the command i'm using:
pdf2txt -t tag -p 4 -A -o out/route/tag.txt in/route/405448.pdf
And here you have two examples of the pdf's im using.
This one works perrfectly for me: http://54.171.169.37/tilbud/pdf/magazines/404707/404707.pdf
This one dosen't work in any page: http://54.171.169.37/tilbud/pdf/magazines/404635/404635.pdf
And this one, just works in some pages : http://54.171.169.37/tilbud/pdf/magazines/401944/401944.pdf
Any idea of what is happening? TY in advance
EDIT: I've realize that, if i extract the text in normal mode and not in tag mode (pdf2txt -t tag) it works on the pages it didn't work before. So is a problem with "tag" type of extraction.