I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc look like "„ ˛" and not "s, t, a" (or at least the displayed character).
The text is displayed correctly but when I try to copy it for example, those characters are not OK.
Is there some way to extract the text correctly or are those pdf files corrupted in some way (java/C/python etc or windows/linux/etc utility)?
Extracting correctly the text from a pdf (UTF-8)
1.7k views Asked by Andrei F At
1
There are 1 answers
Related Questions in PDF
- How to use custom font during html to pdf conversion?
- How to get content of BLOCK types LAYOUT_TITLE, LAYOUT_SECTION_HEADER and LAYOUT_xx in Textract
- PDF form checkbox/radio button ignores content stream
- Suggest python library for rendering html to pdf files
- Problems with the order in which PDF files are created
- Centering a map element on a generated PDF
- download all pdf files from website doesn't support wildcard
- How to enter external pdf into quarto book while keeping page layout+numbering
- How do I create a website that combines user input and standard text and converts it into a pdf?
- Excel VBA error 1004 on PDF export - not a path issue
- downloading pdf using requests not working
- Creating pdf on Firestore with Pdfplum: Template path "no such object"
- Export password protected PDF from QGIS
- XPS convert PDF with Ghostscript
- Download PDF in ASP.NET MVC application
Related Questions in TEXT
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- How to increase quality of mathjax output?
- How to appropriately handle newlines and the escaping of them?
- How to store data with lots of subdata but keep easy and simple access in python
- Can I make this kind of radio button?
- I am findind it dificult to create a box containing text
- Replacing Text using Javascript
- How to set text inside a div using JavaScript and CSS
- How to get new text input after entering a password in a tab?
- How can I get my hero section to look like this?
- Find text and numbers Formatted: "Case: BE########" and format them, regardless of the number
- Auto style text in flutter
- Text analytics and Insights
- Combine an audio and a text file as one single file
- How to align side text and table horizontally in R-markdown
Related Questions in UTF-8
- Can't we make a better variable-length character encoding with just using the 1 bit extra in the 7 bit ASCII?
- UTF-8 issue with excel
- UTF-8 string has too many bytes using SBCL and babel on Windows 64 bits
- How to convert from Java ASCII properties to UTF8 (Java 9) properties
- How to read a file that contains both ANSI and UTF-8 encoded characters
- BSONError in MongoDB Compass
- Create HMAC SHA-1 in JS with byte array
- pdftk unicode works in preview but not adobe acrobat
- xml file from ISO-8859-2 to UTF-8 in python
- How to store metadata for a UTF-8 text file cross-platform?
- Encoding problem on MySQL: Why some non-ASCII characters get encoded on more than 4 bytes?
- How to get character position in a text file encode in UTF-8 in C?
- Unicode character ſ is matched as itself and as 's.'
- VS Code integrated terminal UTF-8 input problem
- pdftk generated pdf does not render correct utf-8
Related Questions in TEXT-EXTRACTION
- Image cropping from AWS Textract's analyze_expense method
- Getting broken text while reading pdf written in eastern language in python
- Extracting text and comment from Google Doc Python
- Text extraction from pdf file bytes in flutter web
- Unable to extract text from image - Python
- Extract version-specific upgrade notice from readme text
- How to extract text from pdf with complex layouts using python?
- Python Docx - How to read the section (list of paragraphs, images, tables) that are linked to a word in another section using hyperlink
- How to segment the different attributes of a table in an image in key-value pairs using libraries like OpenCV?
- Can I selectively extract text from the table using Python-docx?
- Extract the text on the long text
- Extract words from cell that are exactly 10 characters long and contain number and letter
- Improve Customtkinter performance in data extraction with Pandas
- What is the most efficient way of extracting these integers from a string using SQL?
- Convert PDF to HTML using pdfminer?
Related Questions in PDF-EXTRACTION
- PDF parsing with image coordinates
- CID encoding of font
- I am using ocrmypdf for converting the scanned pdf to searchable pdf. I am getting the dependency error of jbig2 and pngquant - "was not found"
- Extract texts as well as images sequentially using Pymupdf
- How to merge the empty rows with the row above that one?
- Problem extracting a specific table from a PDF-page with multiple tables. (Python)
- Extraction issue with bold heading letters from pdf using tika
- want to extract information from pdf with table
- 'pdf device does not support type 3 fonts' when trying to process a PDF generated by Ghostscript using pdfminer and fitz
- Extraction of complex tables from a pdf using python
- How to extract header, paragraph, table structure from pdf using azure form recognizer in python
- How to retrieve ALL pages from PDF after button click and then insert it into a text editor PyPDF2
- Azure Form Intelligence Connected Container Setup
- Extract specific pages from a PDF file and save it with a specific name given on a excel using VBA or Python or VBA & Python
- I want to use camelot for table extraction but its giving error
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Can you extract the text correctly in Acrobat from the PDF?