emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Searching inside of attachments (pdf, odt)?
@ 2009-10-12 13:40 Karl Maihofer
  2009-10-12 22:59 ` Samuel Wales
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Maihofer @ 2009-10-12 13:40 UTC (permalink / raw)
  To: emacs-orgmode

Hi,

does anyone use something like Lucene[*] with orgmode to search inside  
attachments like pdf- and odt-files? At the moment I use org for  
time-planning and a stand-alone Confluence wiki for knowledge  
management (which uses Lucene to index attachments). My "knowledge  
management" mainly consists of a large amount of pdf-files. If I could  
search inside attachments with org, I could perhaps switch to an  
Emacs-only solution. That would be awesome.

Kind regards,
Karl

[*] http://en.wikipedia.org/wiki/Lucene

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching inside of attachments (pdf, odt)?
  2009-10-12 13:40 Searching inside of attachments (pdf, odt)? Karl Maihofer
@ 2009-10-12 22:59 ` Samuel Wales
  2009-10-13  8:09   ` Karl Maihofer
  0 siblings, 1 reply; 6+ messages in thread
From: Samuel Wales @ 2009-10-12 22:59 UTC (permalink / raw)
  To: Karl Maihofer; +Cc: emacs-orgmode

Hi Karl,

I have been thinking about this recently also, but in a
different direction.  I agree that searching inside
attachments is important.

On Mon, Oct 12, 2009 at 06:40, Karl Maihofer <ignoramus@gmx.de> wrote:
> does anyone use something like Lucene[*] with orgmode to search inside
> attachments like pdf- and odt-files? At the moment I use org for

My idea is to use ordinary agenda search like this:

  1) agenda search displays the headline that has the
     attachment.
  2) org uses an alist to determine the correct textifier
     according to extension.  e.g. '((".pdf" . "pdf2text")).
  3) agenda searches normally (as if the contents of the
     attachment were body text).

Note that we are searching only attachments that the agenda
would search.  Thus, "<" in the agenda will work
properly.[1]

Also, note that archived tasks always still point to
attachments.  With the above solution, if you search agenda
files, the results won't be polluted with archived
attachments.  If you use an external solution, you would
have to find a way to exclude the archived attachments.

IR software could still be integrated.  At the very least,
you might choose Lucene as a back-end textifier for all
extensions.

It's not as fancy as integrating IR with all of the IR
features, but it might be a simple solution.


Samuel


[1] This raises another, much more general idea.  Is
there a feature to restrict agenda commands (including
search) to the currently displayed (or even marked) agenda
results?  i.e. you run an agenda search, filter however you
like, then search within the results (or run any custom
agenda command) within those results.  It
would allow fast switching among multiple user-defined
sorting strategies (kind of like filtering with "/"),
which is something I've wanted.  But I just thought of it now, and
don't know if it's a good idea.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching inside of attachments (pdf, odt)?
  2009-10-12 22:59 ` Samuel Wales
@ 2009-10-13  8:09   ` Karl Maihofer
  2009-10-13 14:31     ` Tim O'Callaghan
  2009-10-13 17:09     ` Samuel Wales
  0 siblings, 2 replies; 6+ messages in thread
From: Karl Maihofer @ 2009-10-13  8:09 UTC (permalink / raw)
  To: emacs-orgmode

Hi Samuel,

Samuel Wales <samologist@gmail.com> schrieb:
> My idea is to use ordinary agenda search like this:
>   1) agenda search displays the headline that has the
>      attachment.
>   2) org uses an alist to determine the correct textifier
>      according to extension.  e.g. '((".pdf" . "pdf2text")).
>   3) agenda searches normally (as if the contents of the
>      attachment were body text).

correct me if i'm wrong, but your approach is to search inside (an)
already identified attachment(s)?

I'd like to find attachments by searching inside the whole set of
attachments. I do have many articles (pdf-files) to deal with. When i
write a report on a special topic i have to find articles that are
relevant to the topic i'm working on at the moment.

If we use the standard textifiers the procedure will probably get very
slow if there are many attachments. I think using an index would be a
good idea.

To describe what i'm looking for:
My first step is to create an entry for each article, define tags
(describing the content) and add some notes.

* Title of the article                           :tag:tag:tag:
   :PROPERTIES:
   :Attachments: article.pdf
   :ID: 387HJGJD78-758GZFHF87-JKHKJ57dfd9
   :END:
   - Very good explanation of X.
   - New view on Y.

But it would be much more powerful to be able not only to find an
entry by searching for tags but to search inside the attachments.

I'm not a programmer, so sorry if my ideas are stupid. ;-) But i thing
the following questions have to be answered:

1) Is there a tool like Lucene that can index pdf-files as they are
    stored by orgmode (directory structure)?
2) Is it possible to send a query to this tool from within emacs?
3) Is it possible to "import" the answer of the tool into emacs and
    combine it with orgmode so that the result looks somehow like this:
    "Search string 'XX' found in file 'article.pdf' attached to task
    'Title of the article'". A click on the name of the attachment
    should open the pdf-file in the pdf-reader; a click on the task
    name should show the task in the org-buffer.

Karl

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching inside of attachments (pdf, odt)?
  2009-10-13  8:09   ` Karl Maihofer
@ 2009-10-13 14:31     ` Tim O'Callaghan
  2009-10-13 17:09     ` Samuel Wales
  1 sibling, 0 replies; 6+ messages in thread
From: Tim O'Callaghan @ 2009-10-13 14:31 UTC (permalink / raw)
  To: Karl Maihofer; +Cc: emacs-orgmode

FWIW

I think this might be handled easier if all that happened would be a
grep on the attachments, or directories.

The usual grep interface can be used and then it becomes a fast
general purpose data mining extension.

I can see it being used to search a codebase or website for a text string.

I guess it could be further refined with some kind of dispatcher -
like the file dispatcher that invokes a specific tool to view an
attachment, except it uses an attachment specific search or defaults
to grep if its not an emacs editable file.

Possibly an extension fo the current "file:<text-file>::<in buffer
search >", but uses this grep or whatever if it comes up against
something un-emacs-editable.

An added bonus of a search dispatcher type approach: it would give
users the chance to extend the search into whatever tool(s)/file
format(s) they are using without having to become core to org.

Just my 2eurocents worth:

Tim.
2009/10/13 Karl Maihofer <ignoramus@gmx.de>:
> Hi Samuel,
>
> Samuel Wales <samologist@gmail.com> schrieb:
>>
>> My idea is to use ordinary agenda search like this:
>>  1) agenda search displays the headline that has the
>>     attachment.
>>  2) org uses an alist to determine the correct textifier
>>     according to extension.  e.g. '((".pdf" . "pdf2text")).
>>  3) agenda searches normally (as if the contents of the
>>     attachment were body text).
>
> correct me if i'm wrong, but your approach is to search inside (an)
> already identified attachment(s)?
>
> I'd like to find attachments by searching inside the whole set of
> attachments. I do have many articles (pdf-files) to deal with. When i
> write a report on a special topic i have to find articles that are
> relevant to the topic i'm working on at the moment.
>
> If we use the standard textifiers the procedure will probably get very
> slow if there are many attachments. I think using an index would be a
> good idea.
>
> To describe what i'm looking for:
> My first step is to create an entry for each article, define tags
> (describing the content) and add some notes.
>
> * Title of the article                           :tag:tag:tag:
>  :PROPERTIES:
>  :Attachments: article.pdf
>  :ID: 387HJGJD78-758GZFHF87-JKHKJ57dfd9
>  :END:
>  - Very good explanation of X.
>  - New view on Y.
>
> But it would be much more powerful to be able not only to find an
> entry by searching for tags but to search inside the attachments.
>
> I'm not a programmer, so sorry if my ideas are stupid. ;-) But i thing
> the following questions have to be answered:
>
> 1) Is there a tool like Lucene that can index pdf-files as they are
>   stored by orgmode (directory structure)?
> 2) Is it possible to send a query to this tool from within emacs?
> 3) Is it possible to "import" the answer of the tool into emacs and
>   combine it with orgmode so that the result looks somehow like this:
>   "Search string 'XX' found in file 'article.pdf' attached to task
>   'Title of the article'". A click on the name of the attachment
>   should open the pdf-file in the pdf-reader; a click on the task
>   name should show the task in the org-buffer.
>
> Karl
>
>
>
>
>
>
> _______________________________________________
> Emacs-orgmode mailing list
> Remember: use `Reply All' to send replies to the list.
> Emacs-orgmode@gnu.org
> http://lists.gnu.org/mailman/listinfo/emacs-orgmode
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching inside of attachments (pdf, odt)?
  2009-10-13  8:09   ` Karl Maihofer
  2009-10-13 14:31     ` Tim O'Callaghan
@ 2009-10-13 17:09     ` Samuel Wales
  2009-10-14 16:47       ` Karl Maihofer
  1 sibling, 1 reply; 6+ messages in thread
From: Samuel Wales @ 2009-10-13 17:09 UTC (permalink / raw)
  To: Karl Maihofer; +Cc: emacs-orgmode

Hi,

My idea is to keep it simple at first.  Everybody will come
up with great ways to integrate with his favorite IR tool.

Here I want to focus on the org interface.

The org interface can be the same as any other agenda
search, with all the same controls.  The back end can use
special-purpose textifiers like pdf2text (or whatever) or
general-purpose textifiers from IR tools.  Doesn't matter.

Later, the mechanism can get more fancy if desired.  But
first, we should implement existing behavior.  I often move
things to attachments merely because they are large.  I
don't want search to work differently just because I did
that.  Search should IMO work the same as it does for
outline bodies.

This includes regexp syntax.  If we use anything other than
Emacs, we risk one regexp syntax for attachments and another
for outline bodies.  That makes me shudder.

Later, we can use the fancier IR tools, or use reverse
indexes.  But not everybody has IR tools installed, and
reverse indexes might be premature optimization.

If you're worried about speed, this is a perfect, simple
application for caching.  I'd try it before concluding that
it is too slow.  If it is, we have a good foundation into
which we can hook your favorite IR.

I don't think there's a downside to achieving compatibility
and full agenda integration first, then only after that
doing the fancy stuff.

Have you tried the agenda search feature yet?  If not, perhaps trying
it first will help ground the discussion.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching inside of attachments (pdf, odt)?
  2009-10-13 17:09     ` Samuel Wales
@ 2009-10-14 16:47       ` Karl Maihofer
  0 siblings, 0 replies; 6+ messages in thread
From: Karl Maihofer @ 2009-10-14 16:47 UTC (permalink / raw)
  To: emacs-orgmode

Hi,

Am 13.10.09 19:09, schrieb Samuel Wales:
> Have you tried the agenda search feature yet?  If not, perhaps trying
> it first will help ground the discussion.

OK, I had another look at the org agenda search feature and I agree that 
it would be much smarter to use the already implemented org features - 
to go "the org-way".

But I must confess I do not know how to push the attachments to pdf2txt, 
make the org agenda search the text-files and link back to the 
corresponding org task in the org-file.

Any ideas?

Karl

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-10-14 16:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-12 13:40 Searching inside of attachments (pdf, odt)? Karl Maihofer
2009-10-12 22:59 ` Samuel Wales
2009-10-13  8:09   ` Karl Maihofer
2009-10-13 14:31     ` Tim O'Callaghan
2009-10-13 17:09     ` Samuel Wales
2009-10-14 16:47       ` Karl Maihofer

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).