Re: [Rails] Extract text from PDF file

Ruby on Rails Monday, January 31, 2011

On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:

> Hi,
> In my upcoming application we are uploading the pdf files.
> After uploading the pdf file I have to extract the text from pdf and
> display it to user.
> can anyone tell me how to extract text from pdf file?
> Is there any plugin or gem present for this?
> Thanks,
> Tushar
>

I did this using Paperclip and defining a processor for Paperclip as
follows:

#lib/paperclip_processors/text.rb
module Paperclip
# Handles extracting plain text from PDF file attachments
class Text < Processor

attr_accessor :whiny

# Creates a Text extract from PDF
def make
src = @file
dst = Tempfile.new([@basename, 'txt'].compact.join("."))
command = <<-end_command
"#{ File.expand_path(src.path) }"
"#{ File.expand_path(dst.path) }"
end_command

begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",
command.gsub(/\s+/, " "))
Rails.logger.info "Processing #{src.path} to #{dst.path} in
the text processor."
rescue PaperclipCommandLineError
raise PaperclipError, "There was an error processing the text
for #{@basename}" if @whiny
end
dst
end
end
end

#app/models/document.rb
has_attached_file :pdf,:styles => { :text => { :fake =>
'variable' } }, :processors => [:text]
after_post_process :extract_text

private
def extract_text
file = File.open("#{pdf.queued_for_write[:text].path}","r")
plain_text = ""
while (line = file.gets)
plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
end
self.plain_text = plain_text #text column to hold the extracted
text for searching
end

I had to find and install the creaky-old pdftotext library on my
server (happily, there was an apt-get bundle for it) and configure the
path correctly. When Paperclip accepts a PDF upload, it creates a text
extraction of that file and saves it in system/pdfs/:id/text/
filename.pdf. Note that while it has a .pdf extension, the file itself
is actually just the plain text extracted from the original pdf. After
quite a lot of googling and begging my local Ruby group, I got the
recipe for ripping open that text file and reading it into a variable
to store on the record. The text you get out of pdftotext will vary
wildly in quality and comprehensiveness, but since all I needed was a
way to get a simple search system fed, it works fine for my needs. I
never show this text to anyone, just use it as the "keywords" for
search. You may want/need to present an editing field for the
administrator to clean up these extracted texts.

Walter

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.

Ruby on Rails

Re: [Rails] Extract text from PDF file

No comments:

Post a Comment

Blog Archive