How to get text from pdf preserving original formatting (with CTX_DOC)?

781 views Asked by At

I use this code to filter text from pdf file:

create or replace directory pdf_dir as '&1';

create or replace directory l_curr_dir as '&3';

declare
  ll_clob     CLOB;
  l_bfile     BFILE;
  l_filename  VARCHAR2(200) := '&2';
begin
  begin
    ctx_ddl.drop_preference('testfilter');
    ctx_ddl.drop_policy('testdimac_policy1');
  exception when others then
    null;
  end;

  ctx_ddl.create_preference('testfilter', 'AUTO_FILTER');
  ctx_ddl.create_policy('testd_policy1', 'testfilter');

  l_bfile := bfilename('PDF_DIR', l_filename);

  dbms_lob.fileopen(l_bfile);

  ctx_doc.policy_filter(
      policy_name => 'test_policy1'
    , document    => l_bfile
    , restab      => ll_clob
    , plaintext   => true
    , CHARSET     => 'US7ASCII'
  );

DBMS_XSLPROCESSOR.clob2file (ll_clob,'L_CURR_DIR' , '&4');
/

The solution is good and working for me, but is there any way to get the tabular data, right now it is filtering text phrase by phrase or line by line.

For exeample, if pdf contains values like:

Name:            Amount  
Pradeep          100 USD  

I want the output as is, but the current setup gives the output like:

Name:
Amount
Pradeep
100 USD

Is there any way to get the original format of text in pdf?

Is it possible to change the filter?

0

There are 0 answers