I am using document Oracle Outside In to output text content of pdf document.
I am using below parameters to pass to main function of CASample.c
file from content access of https://www.oracle.com/middleware/technologies/outside-in-technology-downloads.html#
C:\adobe-acrobat.pdf -u C:\adobe-acrobat.txt";
Which gives me text in below format.
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 8, Character Set = 0x00030100.
Outside
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 3, Character Set = 0x00030100.
In
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 8, Character Set = 0x00030100.
Unlocks
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 9, Character Set = 0x00030100.
Business
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 10, Character Set = 0x00030100.
Documents
SCCCA_TEXT: dwSubType = 0x08020001, Number of Characters = 4, Character Set = 0x00030100.
for
SCCCA_TEXT: dwSubType = 0x08020002, Number of Characters = 1, Character Set = 0x00030100.
So how do I only get text out of it without metadata? like instead of above entire metadata content I only need Outside In Unlocks Business Documents for
or do I have to make my own parser to get those data?
There is a
tademo.vcxproj
as well in their downloaded files which does the job to extract text. It is a desktop application that you can convert to a library.https://www.oracle.com/middleware/technologies/outside-in-technology-downloads.html#
After converting it to a library, I created the following function in
tademo.c
file which will take the input file and export the text file as output.