From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jambunathan K <kjambunathan@gmail.com>
Subject: Re: orgmode and pdf
Date: Tue, 24 Jul 2012 17:53:22 +0530
Message-ID: <81mx2p73gl.fsf@gmail.com>
References: <87vchdd02l.fsf@cica.cica>
Mime-Version: 1.0
Content-Type: text/plain
Return-path: <emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([208.118.235.92]:49474)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kjambunathan@gmail.com>) id 1Ste9L-0006SA-Vu
	for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 08:23:44 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kjambunathan@gmail.com>) id 1Ste9F-0003Y4-OE
	for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 08:23:39 -0400
Received: from mail-pb0-f41.google.com ([209.85.160.41]:55333)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kjambunathan@gmail.com>) id 1Ste9F-0003Xa-Ia
	for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 08:23:33 -0400
Received: by pbbrp2 with SMTP id rp2so14861496pbb.0
	for <emacs-orgmode@gnu.org>; Tue, 24 Jul 2012 05:23:32 -0700 (PDT)
In-Reply-To: <87vchdd02l.fsf@cica.cica> (x. piter's message of "Tue, 24 Jul
	2012 10:40:02 +0200")
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
To: x.piter@gmail.com
Cc: emacs-orgmode@gnu.org

x.piter@gmail.com writes:

> Hi list.
> I try to make a workflow to mine data from pdfs into org mode.
> I prefer to read in emacs, since I have fast dictionary lookup in it and
> many other things.
> There are two tools I think useful for conversion of pdfs into txt:
> cuneiform - to extract text, and pdfimages for image extraction.
> Cuneiform is better then other text extractors (what I have tried) in
> handling two columned
> pdfs.

PdfEdit seems interesting as well.

http://sourceforge.net/projects/pdfedit
http://www.cs.unb.ca/~bremner/blog/posts/pdf2text/

ps: I have no experience using PdfEdit or how it fares wrt images and
captions.

> A pdf as split to pages and each of them processed separateddly
> Using this two programs and some scripting I believe it is possible to
> convert pdf in org file. However there are two issues I would like to
> solve.
> 1) Is there any way to extract  figure captions from a pdf?
> 2) I have no solution for formulas and Greek letters. The only way to
> handle it would be
> to consult an image of the page.
> Any suggestions about it? Have somebody tried something similar. 
> Thanks.
> Petro.
>
>
>
>
>

--