From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jambunathan K Subject: Re: orgmode and pdf Date: Tue, 24 Jul 2012 17:53:22 +0530 Message-ID: <81mx2p73gl.fsf@gmail.com> References: <87vchdd02l.fsf@cica.cica> Mime-Version: 1.0 Content-Type: text/plain Return-path: Received: from eggs.gnu.org ([208.118.235.92]:49474) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ste9L-0006SA-Vu for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 08:23:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ste9F-0003Y4-OE for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 08:23:39 -0400 Received: from mail-pb0-f41.google.com ([209.85.160.41]:55333) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ste9F-0003Xa-Ia for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 08:23:33 -0400 Received: by pbbrp2 with SMTP id rp2so14861496pbb.0 for ; Tue, 24 Jul 2012 05:23:32 -0700 (PDT) In-Reply-To: <87vchdd02l.fsf@cica.cica> (x. piter's message of "Tue, 24 Jul 2012 10:40:02 +0200") List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: x.piter@gmail.com Cc: emacs-orgmode@gnu.org x.piter@gmail.com writes: > Hi list. > I try to make a workflow to mine data from pdfs into org mode. > I prefer to read in emacs, since I have fast dictionary lookup in it and > many other things. > There are two tools I think useful for conversion of pdfs into txt: > cuneiform - to extract text, and pdfimages for image extraction. > Cuneiform is better then other text extractors (what I have tried) in > handling two columned > pdfs. PdfEdit seems interesting as well. http://sourceforge.net/projects/pdfedit http://www.cs.unb.ca/~bremner/blog/posts/pdf2text/ ps: I have no experience using PdfEdit or how it fares wrt images and captions. > A pdf as split to pages and each of them processed separateddly > Using this two programs and some scripting I believe it is possible to > convert pdf in org file. However there are two issues I would like to > solve. > 1) Is there any way to extract figure captions from a pdf? > 2) I have no solution for formulas and Greek letters. The only way to > handle it would be > to consult an image of the page. > Any suggestions about it? Have somebody tried something similar. > Thanks. > Petro. > > > > > --