pdfminer mixes order of lines

34 views Asked by At

I'm extracting pdf using pdfminersix. I have following text: enter image description here

enter image description here

after parsing it my result is as below:

Nr 48. Promująco na rozwój chorób alergicznych i wystąpienie objawów alergii 
działa zwiększenie aktywności/ilości: 

1) limfocytów Th1; 
2) limfocytów Th2; 
3) limfocytów Th17; 
Prawidłowa odpowiedź to: 
B. tylko 2. 
A. 1,4. 

4) IL-5. 
5) IL-12. 

C. 1,3. 

D. 2,4. 

E. 3,5. 

The order of the lines is mixed. Is there a way to prevent it? For example to force pdfminer to read the file line by line. I have tried to convert pdf to html, but the result is a mess of seperate span tags for each word.

1

There are 1 answers

1
mik.ro On

Ok, i 've found a solution by increasing char_margin to 20 in LaParams

laparams = LAParams(char_margin = 20)