Extract text from document in memory using docsplit

Question

Extract text from document in memory using docsplit

1.5k views Asked by fotanus At 29 April 2013 at 18:54

With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:

 Docsplit.extract_pages('doc.pdf')

I can have the text content of a PDF file.

I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.

Is there a way to get the text of this PDF avoiding the creation of a temporary file?

I'm using attachment_fu if it matters.

Original Q&A

There are 2 answers

**the Tin Man** · Answer 1 · 2013-04-29T22:54:32+00:00

If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.

Look at either of:

new(string=""[, mode])
Creates new StringIO instance from with string and mode.

open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.

**barbolo** · Answer 2 · 2015-01-06T12:08:30+00:00

Use a temporary directory:

require 'docsplit'

def pdf_to_text(pdf_filename)
  Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)

  txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
  txt_filename = Dir.tmpdir + '/' + txt_file

  extracted_text = File.read(txt_filename)
  File.delete(txt_filename)

  extracted_text
end

pdf_to_text('doc.pdf')

TechQA.

Extract text from document in memory using docsplit

There are 2 answers

Related Questions in RUBY-ON-RAILS

Related Questions in RUBY

Related Questions in ATTACHMENT-FU

Related Questions in DOCSPLIT

Popular Questions

Trending Questions