emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Extract document structure from Org file
@ 2015-07-03  8:39 Oleg Sivokon
  2015-07-03 11:58 ` Rasmus
  2015-07-03 14:20 ` John Kitchin
  0 siblings, 2 replies; 4+ messages in thread
From: Oleg Sivokon @ 2015-07-03  8:39 UTC (permalink / raw)
  To: emacs-orgmode

Hello list!

Suppose I wanted to extract the structure from an Org document, where,
what's important for me would be to have it cathegorically divided into
headers, paragraphs of text, technical information and inclusion of
other documents (code snippets).  How would I do it?

The reason I'm asking is that I've a small project I work on, where I'm
trying to enhance the search in documents by using indexing combined
with queries based on things like distance between words, frequency of a
word appearing in a document and so on.  (I'm using Sphinx for it.)
I've tried to do this with Info pages, and I liked the results, however,
in order to do this more intelligently, I'd like to index the documents
with better granularity (i.e. so that later on I could search assigning
different weights to words appearing in headers and words appearing in
comments).

Best.

Oleg

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Extract document structure from Org file
  2015-07-03  8:39 Extract document structure from Org file Oleg Sivokon
@ 2015-07-03 11:58 ` Rasmus
  2015-07-03 14:20 ` John Kitchin
  1 sibling, 0 replies; 4+ messages in thread
From: Rasmus @ 2015-07-03 11:58 UTC (permalink / raw)
  To: emacs-orgmode

Hi Oleg,

Oleg Sivokon <olegsivokon@gmail.com> writes:

> Suppose I wanted to extract the structure from an Org document, where,
> what's important for me would be to have it cathegorically divided into
> headers, paragraphs of text, technical information and inclusion of
> other documents (code snippets).  How would I do it?

You would use org-element.  Try org-element-parse-buffer and
org-element-map and maybe org-element-interpret-data.  There's also a
bunch of regexp for identifying/finding particular types of elements.

Cheers,
Rasmus

-- 
To err is human. To screw up 10⁶ times per second, you need a computer

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Extract document structure from Org file
  2015-07-03  8:39 Extract document structure from Org file Oleg Sivokon
  2015-07-03 11:58 ` Rasmus
@ 2015-07-03 14:20 ` John Kitchin
       [not found]   ` <87a8vdfacd.fsf@gmail.com>
  1 sibling, 1 reply; 4+ messages in thread
From: John Kitchin @ 2015-07-03 14:20 UTC (permalink / raw)
  To: Oleg Sivokon; +Cc: emacs-orgmode

That sounds really cool. I recently hacked a swish-e index of my org
files (there might have been 3000+!)
http://kitchingroup.cheme.cmu.edu/blog/2015/06/25/Integrating-swish-e-and-Emacs/.
and

I just updated it to index the html version of an org-file so that I
take advantage of the structure in the
search. http://kitchingroup.cheme.cmu.edu/blog/2015/07/03/Using-swish-e-to-index-org-files-as-html/. It
would be cool to have more granular searching though.

Is your info project visible
anywhere? i can imagine a close-file hook function that updates the
database automatically.

Oleg Sivokon writes:

> Hello list!
>
> Suppose I wanted to extract the structure from an Org document, where,
> what's important for me would be to have it cathegorically divided into
> headers, paragraphs of text, technical information and inclusion of
> other documents (code snippets).  How would I do it?
>
> The reason I'm asking is that I've a small project I work on, where I'm
> trying to enhance the search in documents by using indexing combined
> with queries based on things like distance between words, frequency of a
> word appearing in a document and so on.  (I'm using Sphinx for it.)
> I've tried to do this with Info pages, and I liked the results, however,
> in order to do this more intelligently, I'd like to index the documents
> with better granularity (i.e. so that later on I could search assigning
> different weights to words appearing in headers and words appearing in
> comments).
>
> Best.
>
> Oleg

--
Professor John Kitchin
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
@johnkitchin
http://kitchingroup.cheme.cmu.edu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Extract document structure from Org file
       [not found]   ` <87a8vdfacd.fsf@gmail.com>
@ 2015-07-04 15:54     ` John Kitchin
  0 siblings, 0 replies; 4+ messages in thread
From: John Kitchin @ 2015-07-04 15:54 UTC (permalink / raw)
  To: Oleg Sivokon, emacs-orgmode@gnu.org

I worked out a new version of the swish-e org indexer that indexes
custom xml representing the org file that you may find interesting for
your project.

http://kitchingroup.cheme.cmu.edu/blog/2015/07/04/An-xml-representation-of-an-org-document-for-indexing-with-swish-e/

It enables a search like this:

swish-e -f index-org2xml.swish-e -w src-block.language=python -w src-block=diffusion

to find org files with a python source block containing the word
diffusion.

I think swish-e supports ranking
(http://swish-e.org/docs/swish-faq.html#how_is_ranking_calculated_) too,
but I have not tried it.

It is pretty interesting overall!



Oleg Sivokon writes:

> John Kitchin <jkitchin@andrew.cmu.edu> writes:
>
>> You would use org-element.  Try org-element-parse-buffer and
>> org-element-map and maybe org-element-interpret-data.  There's also a
>> bunch of regexp for identifying/finding particular types of elements.
>
> Thanks! I'm already looking into it.
>
>> That sounds really cool. I recently hacked a swish-e index of my org
>> files (there might have been 3000+!)
>> http://kitchingroup.cheme.cmu.edu/blog/2015/06/25/Integrating-swish-e-and-Emacs/.
>> and
>>
>> I just updated it to index the html version of an org-file so that I
>> take advantage of the structure in the
>> search. http://kitchingroup.cheme.cmu.edu/blog/2015/07/03/Using-swish-e-to-index-org-files-as-html/. It
>> would be cool to have more granular searching though.
>>
>> Is your info project visible
>> anywhere? i can imagine a close-file hook function that updates the
>> database automatically.
>
> Whoa, that's a lot of Org files :) What I wrote so far is on Github, but
> it's in a very early stage, so it's not something you could just drop
> into your Emacs directory and start using right away.
> https://github.com/wvxvw/sphinx-mode
> I've also looked into Swish some time ago.  I also thought about using
> Nepomuk, but, in the later case, I've to admit, I didn't make it through
> the documentation.
>
> The difference in using Sphinx is that it has ranking, and it has a
> relatively terse way of specifying searching criteria.  For example, you
> could ask to search for "some words in this phrase"/3 and it would look
> for occurances of 3 of 5 words given between the quotes.  Or, you could
> ask it to search for @node "R" @contents "printf" "format", and this
> would search for node titles mentioning "R" and having contents with
> words "printf" and "format".
> I've to admit I didn't master it fully (there are far more options and
> settings) but it does something that seems reasonable (if I compare it
> to M-x info-apropos).
>
> I'm also still trying to learn what's the best way to do indenxing, so
> the project is still very raw, but I'll get there one day :)
>
> The ultimate goal is also to write a more human-friendly interface to
> Sphinx, where one could ask questions in a subset of natural language :)
> (but that's a very long way into the future!)
>
> PS. I see that many posts on this list are titled with [O].  What does
> it mean, should I do that too?
>
> Best.
>
> Oleg

--
Professor John Kitchin
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
@johnkitchin
http://kitchingroup.cheme.cmu.edu

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-07-04 15:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-03  8:39 Extract document structure from Org file Oleg Sivokon
2015-07-03 11:58 ` Rasmus
2015-07-03 14:20 ` John Kitchin
     [not found]   ` <87a8vdfacd.fsf@gmail.com>
2015-07-04 15:54     ` John Kitchin

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).