From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matt Price Subject: Re: Org Mode and PDF Notes! Date: Thu, 12 Nov 2015 17:52:45 -0500 Message-ID: References: <877floffyq.fsf@gmail.com> <87wptnqucw.fsf@gmail.com> <87k2pn70mb.fsf@fastmail.fm> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11c3c6be21cb7305245fcdcb Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:58495) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zx0jY-0003bD-0Q for emacs-orgmode@gnu.org; Thu, 12 Nov 2015 17:52:49 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zx0jW-0006q4-JO for emacs-orgmode@gnu.org; Thu, 12 Nov 2015 17:52:47 -0500 Received: from mail-ig0-x235.google.com ([2607:f8b0:4001:c05::235]:37759) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zx0jW-0006pw-D2 for emacs-orgmode@gnu.org; Thu, 12 Nov 2015 17:52:46 -0500 Received: by igbhv6 with SMTP id hv6so3921015igb.0 for ; Thu, 12 Nov 2015 14:52:45 -0800 (PST) In-Reply-To: <87k2pn70mb.fsf@fastmail.fm> List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Cc: Org Mode --001a11c3c6be21cb7305245fcdcb Content-Type: text/plain; charset=UTF-8 On Thu, Nov 12, 2015 at 9:28 AM, Matt Lundin wrote: > Ramon Diaz-Uriarte writes: > > > > I'll do. In the meantime, I think this is a limitation coming from > > poppler. Other people have mentioned similar things (e.g., > > http://coda.caseykuhlman.com/entries/2014/pdf-extract.html) and using > other > > tools that depend on poppler (such as Leela: > > https://github.com/TrilbyWhite/Leela) also will not give us the text > > itself. > > I don't think this is a limitation of poppler so much as the way that > pdf annotations work. Typically, the subject/text field is not populated > by the text of the highlighted region. Rather, a highlight annotation > specifies bounds, color, style, etc. Basically what Repligo does (I > wouldn't recommend using it, as it is closed source and severely out of > date) is to grab the text *at the time of highlighting* and add it to > the notes field. I don't know of any other annotation tool that does the > same thing. Applications built on poppler could do it, though they > currently do not. > > For extracting the text of highlighted regions *after the fact*, I've > had good luck with this script that relies on the pdf-reader gem for > ruby: > > https://gist.github.com/danlucraft/5277732 > > This looks interesting. It searches for file "./markup_receiver", but doesn't provide that file, which does not appear to be a gem. Any hints? With politza's help am getting close to being able to extract annotation text from within pdf-tools, but am not quite there yet. > Matt > --001a11c3c6be21cb7305245fcdcb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Thu, Nov 12, 2015 at 9:28 AM, Matt Lundin <mdl@imapmail.org><= /span> wrote:
Ramon Diaz= -Uriarte <rdiaz02@gmail.com>= writes:
>
> I'll do. In the meantime, I think this is a limitation coming from=
> poppler. Other people have mentioned similar things (e.g.,
> http://coda.caseykuhlman.com/entries/= 2014/pdf-extract.html) and using other
> tools that depend on poppler (such as Leela:
> https://github.com/TrilbyWhite/Leela) also will not giv= e us the text
> itself.

I don't think this is a limitation of poppler so much as the way= that
pdf annotations work. Typically, the subject/text field is not populated by the text of the highlighted region. Rather, a highlight annotation
specifies bounds, color, style, etc. Basically what Repligo does (I
wouldn't recommend using it, as it is closed source and severely out of=
date) is to grab the text *at the time of highlighting* and add it to
the notes field. I don't know of any other annotation tool that does th= e
same thing. Applications built on poppler could do it, though they
currently do not.

For extracting the text of highlighted regions *after the fact*, I've had good luck with this script that relies on the pdf-reader gem for
ruby:

https://gist.github.com/danlucraft/5277732

This looks interesting. It searches for file "./markup_receive= r", but doesn't provide that file, which does not appear to be a g= em.=C2=A0 Any hints?=C2=A0

With politza's help am ge= tting close to being able to extract annotation text from within pdf-tools,= but am not quite there yet.=C2=A0
=C2=A0
Matt

--001a11c3c6be21cb7305245fcdcb--