From mboxrd@z Thu Jan 1 00:00:00 1970 From: x.piter@gmail.com Subject: orgmode and pdf Date: Tue, 24 Jul 2012 10:40:02 +0200 Message-ID: <87vchdd02l.fsf@cica.cica> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from eggs.gnu.org ([208.118.235.92]:35512) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1StafM-00089Z-CU for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 04:40:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1StafD-0006li-W1 for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 04:40:28 -0400 Received: from plane.gmane.org ([80.91.229.3]:55338) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1StafD-0006l0-Oj for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 04:40:19 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Staf8-0003EV-KU for emacs-orgmode@gnu.org; Tue, 24 Jul 2012 10:40:14 +0200 Received: from 0x4dd745cf.adsl.cybercity.dk ([77.215.69.207]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 24 Jul 2012 10:40:14 +0200 Received: from x.piter by 0x4dd745cf.adsl.cybercity.dk with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 24 Jul 2012 10:40:14 +0200 List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: emacs-orgmode@gnu.org Hi list. I try to make a workflow to mine data from pdfs into org mode. I prefer to read in emacs, since I have fast dictionary lookup in it and many other things. There are two tools I think useful for conversion of pdfs into txt: cuneiform - to extract text, and pdfimages for image extraction. Cuneiform is better then other text extractors (what I have tried) in handling two columned pdfs. A pdf as split to pages and each of them processed separateddly Using this two programs and some scripting I believe it is possible to convert pdf in org file. However there are two issues I would like to solve. 1) Is there any way to extract figure captions from a pdf? 2) I have no solution for formulas and Greek letters. The only way to handle it would be to consult an image of the page. Any suggestions about it? Have somebody tried something similar. Thanks. Petro.