On Thu, Nov 12, 2015 at 9:28 AM, Matt Lundin wrote: > Ramon Diaz-Uriarte writes: > > > > I'll do. In the meantime, I think this is a limitation coming from > > poppler. Other people have mentioned similar things (e.g., > > http://coda.caseykuhlman.com/entries/2014/pdf-extract.html) and using > other > > tools that depend on poppler (such as Leela: > > https://github.com/TrilbyWhite/Leela) also will not give us the text > > itself. > > I don't think this is a limitation of poppler so much as the way that > pdf annotations work. Typically, the subject/text field is not populated > by the text of the highlighted region. Rather, a highlight annotation > specifies bounds, color, style, etc. Basically what Repligo does (I > wouldn't recommend using it, as it is closed source and severely out of > date) is to grab the text *at the time of highlighting* and add it to > the notes field. I don't know of any other annotation tool that does the > same thing. Applications built on poppler could do it, though they > currently do not. > > For extracting the text of highlighted regions *after the fact*, I've > had good luck with this script that relies on the pdf-reader gem for > ruby: > > https://gist.github.com/danlucraft/5277732 > > This looks interesting. It searches for file "./markup_receiver", but doesn't provide that file, which does not appear to be a gem. Any hints? With politza's help am getting close to being able to extract annotation text from within pdf-tools, but am not quite there yet. > Matt >