Ruby on Rails Monday, January 31, 2011

pdftk, pdfbox (java), pdfkit

Garrett Lancaster



Walter Lee Davis
January 31, 2011 11:32 AM


On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:


I did this using Paperclip and defining a processor for Paperclip as follows:

#lib/paperclip_processors/text.rb
module Paperclip
  # Handles extracting plain text from PDF file attachments
  class Text < Processor

    attr_accessor :whiny

    # Creates a Text extract from PDF
    def make
      src = @file
      dst = Tempfile.new([@basename, 'txt'].compact.join("."))
      command = <<-end_command
        "#{ File.expand_path(src.path) }"
        "#{ File.expand_path(dst.path) }"
      end_command

      begin
        success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " "))
        Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor."
      rescue PaperclipCommandLineError
        raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny
      end
      dst
    end
  end
end

#app/models/document.rb
  has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } }, :processors => [:text]
  after_post_process :extract_text

  private
  def extract_text
    file = File.open("#{pdf.queued_for_write[:text].path}","r")
    plain_text = ""
    while (line = file.gets)
      plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
    end
    self.plain_text = plain_text #text column to hold the extracted text for searching
  end

I had to find and install the creaky-old pdftotext library on my server (happily, there was an apt-get bundle for it) and configure the path correctly. When Paperclip accepts a PDF upload, it creates a text extraction of that file and saves it in system/pdfs/:id/text/filename.pdf. Note that while it has a .pdf extension, the file itself is actually just the plain text extracted from the original pdf. After quite a lot of googling and begging my local Ruby group, I got the recipe for ripping open that text file and reading it into a variable to store on the record. The text you get out of pdftotext will vary wildly in quality and comprehensiveness, but since all I needed was a way to get a simple search system fed, it works fine for my needs. I never show this text to anyone, just use it as the "keywords" for search. You may want/need to present an editing field for the administrator to clean up these extracted texts.

Walter



Tushar Gandhi
January 31, 2011 11:12 AM

Hi,
In my upcoming application we are uploading the pdf files.
After uploading the pdf file I have to extract the text from pdf and
display it to user.
can anyone tell me how to extract text from pdf file?
Is there any plugin or gem present for this?
Thanks,
Tushar

No comments:

Post a Comment