emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* orgmode and pdf
@ 2012-07-24  8:40 x.piter
  2012-07-24 12:23 ` Jambunathan K
  0 siblings, 1 reply; 2+ messages in thread
From: x.piter @ 2012-07-24  8:40 UTC (permalink / raw)
  To: emacs-orgmode

Hi list.
I try to make a workflow to mine data from pdfs into org mode.
I prefer to read in emacs, since I have fast dictionary lookup in it and
many other things.
There are two tools I think useful for conversion of pdfs into txt:
cuneiform - to extract text, and pdfimages for image extraction.
Cuneiform is better then other text extractors (what I have tried) in handling two columned
pdfs.
A pdf as split to pages and each of them processed separateddly
Using this two programs and some scripting I believe it is possible to
convert pdf in org file. However there are two issues I would like to
solve.
1) Is there any way to extract  figure captions from a pdf?
2) I have no solution for formulas and Greek letters. The only way to handle it would be
to consult an image of the page.
Any suggestions about it? Have somebody tried something similar. 
Thanks.
Petro.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: orgmode and pdf
  2012-07-24  8:40 orgmode and pdf x.piter
@ 2012-07-24 12:23 ` Jambunathan K
  0 siblings, 0 replies; 2+ messages in thread
From: Jambunathan K @ 2012-07-24 12:23 UTC (permalink / raw)
  To: x.piter; +Cc: emacs-orgmode

x.piter@gmail.com writes:

> Hi list.
> I try to make a workflow to mine data from pdfs into org mode.
> I prefer to read in emacs, since I have fast dictionary lookup in it and
> many other things.
> There are two tools I think useful for conversion of pdfs into txt:
> cuneiform - to extract text, and pdfimages for image extraction.
> Cuneiform is better then other text extractors (what I have tried) in
> handling two columned
> pdfs.

PdfEdit seems interesting as well.

http://sourceforge.net/projects/pdfedit
http://www.cs.unb.ca/~bremner/blog/posts/pdf2text/

ps: I have no experience using PdfEdit or how it fares wrt images and
captions.

> A pdf as split to pages and each of them processed separateddly
> Using this two programs and some scripting I believe it is possible to
> convert pdf in org file. However there are two issues I would like to
> solve.
> 1) Is there any way to extract  figure captions from a pdf?
> 2) I have no solution for formulas and Greek letters. The only way to
> handle it would be
> to consult an image of the page.
> Any suggestions about it? Have somebody tried something similar. 
> Thanks.
> Petro.
>
>
>
>
>

-- 

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-07-24 12:23 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-24  8:40 orgmode and pdf x.piter
2012-07-24 12:23 ` Jambunathan K

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).