Extract text from document in memory using docsplit

1.4k views Asked by At

With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:

 Docsplit.extract_pages('doc.pdf')

I can have the text content of a PDF file.

I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.

Is there a way to get the text of this PDF avoiding the creation of a temporary file?

I'm using attachment_fu if it matters.

2

There are 2 answers

0
barbolo On

Use a temporary directory:

require 'docsplit'

def pdf_to_text(pdf_filename)
  Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)

  txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
  txt_filename = Dir.tmpdir + '/' + txt_file

  extracted_text = File.read(txt_filename)
  File.delete(txt_filename)

  extracted_text
end

pdf_to_text('doc.pdf')
2
the Tin Man On

If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.

Look at either of:

new(string=""[, mode])
Creates new StringIO instance from with string and mode.

open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.