Re: Org Syntax Specification

emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed

From: Ihor Radchenko <yantar92@gmail.com>
To: Tom Gillespie <tgbugs@gmail.com>
Cc: org-mode-email <emacs-orgmode@gnu.org>,
	Nicolas Goaziou <mail@nicolasgoaziou.fr>,
	Timothy <tecosaur@gmail.com>
Subject: Re: Org Syntax Specification
Date: Wed, 19 Jan 2022 19:58:59 +0800	[thread overview]
Message-ID: <87bl08kkcc.fsf@localhost> (raw)
In-Reply-To: <CA+G3_PMybCd+xd8RkRbC707uMLMDHf38LUWkEyYV10vZv8L6Sw@mail.gmail.com>

Tom Gillespie <tgbugs@gmail.com> writes:

> 3. When I say grammar in this context I mean specifically an eBNF that
>    generates a LALR(1) or LR(1) parser. This is narrower than the
>    definition used in the document, which includes things that have to
>    be implemented in the tokenizer, or in a pass after the grammar has
>    been applied, or are related to some other aspect beyond the pure
>    surface syntax.

I feel that we should not be trying to fit into LR at the expense of
complicating the document. When looking at earlier versions of the
grammar, I mostly had GLR in mind.

> In my thinking I separate the context sensitive nature of parsing from
> the nesting structure of the resulting sexpressions, org elements,
> etc.The most obvious example of this is that the sexpression
> representation for headings nests based on the level of the heading,
> but heading level cannot be determined by the grammar so it must be
> reconstructed from a flat sequence of headings that have varying level.

1. I think that results sexpression is important to describe. We
   eventually plan to provide a reference test set to verify external
   parsers against org-element.el [1]. It is important to describe the
   nesting with this consideration.

2. You actually can determine the end of heading if you are allowed to
   do lookaheads (which is anyway necessary to parse
   #+begin_blah..#+end_blah). The end of current heading is
   "eof|^\*{,N-current-heading} "

[2] https://list.orgmode.org/spmq6a$2s5$1@ciao.gmane.io/T/#t

> ... I think the
> other issue I was having here is that the spec for tables is spread
> allover the place, and it would be much easier to understand and
> implement ifit were all in one place.

That sounds fine for me. Though your next suggestion appears to be
exactly opposite:

> I think your version is quite a bit more readable.  Can we list the
> set of all the elements that can be ended by a new lineas well as
> those that cannot (iirc they are elements such as footnotes that can
> only be ended by a double blank line or a heading)?

The intention behind listing the exceptions for table cells was exactly
as you thinking about open-ended elements. 

>> I am not sure here. Inline tasks are special because a one-line inline
>> task must not contain any text below, cannot have planning or
>> properties.
>
> Then they are no longer inline tasks, but instead parse as headings, correct?

They are still inline tasks. Consider the below example:

* Normal heading

Paragraph
************************************************** Inline task
SCHEDULED: <2022-01-19> <- this is an ordinary paragraph, not a part of inline task
Continuing "SCHEDULED" paragraph, not a part of inline task

* Next heading

The parsed sexp will be
(heading
  (paragraph)
  (inlinetask)
  (paragraph))
(heading)

>> If we mention this, we also need to elaborate kind of element is
>> #+todo:, where it can be located, and how to parse multiple instances of
>> #+todo in the document.
>
> Yes. What I have written for laundry is that only #+todo: declarations
> that appear in the zeroth section will be applied (this is true for
> all document level configuration keywords). There is also a
> possibility that we might be able to support including #+todo:
> keywords (and #+link: definitions or similar) in further sections, but
> that they would only apply to headings that occur after that line in
> the file. Such behavior is likely to be confusing to users so probably
> best to only guarantee correct behavior if they are put in the zeroth
> section.
>
> The reason it is confusing/problematic is that there could be
> a #+todo: buried half way down a file, the buffer configuration is
> updated, and then a user can use keywords up the file in the elisp
> implementation. Another implementation that parses a file
> incrementally would not encounter the buried #+todo: keyword until
> after they have already emitted a heading,changing how a heading is
> parsed. There is a similar issue with the #+link: keyword.

That's why it was initially not included into the syntax document. If we
fall into this rabbit hole, we also need to describe things like
CATEGORY, PROPERTY, OPTIONS, PRIORITIES, PROPERTY, SEQ_TODO, STARTUP,
TYP_TODO, etc.

>> > +All content following a heading that appears before the next heading
>> > +(regardless of the level of that next heading) is a section.
>>
>> Note that it is not true for one-line inline tasks.
>
> I'm not quite sure which part you are referring to here.

I only left the relevant part this time. Also, see the example above.
Inline task only consists of a single line. Nothing below is a part of
it.

> Let's look into how much work it will be and how disruptive it might
> be?  We are already changing to heading in the elisp so maybe now
> would be a good time to also change from section to segment?
> Alternatively we could start by updating the documentation and include
> a note that segments are currently called sections by org element?

Let's continue this in the new thread dedicated to renaming
section->segment.

> I've since come around on this. I think that we can make it consistent
> by thinking of the zeroth section as an invisible heading with zero
> asterisks at the start of a file. This is extremely useful for making
> org-transclusion work transparently with whole files. The only
> modification that I might suggest in the context of org-transclusion
> would be to disallow empty lines before the property drawer. This
> allows files to represent single sections (segments) which might be
> very useful for implementations that want to store sections in a
> database or something like that.

Again, lets move this to separate thread.

>> I generally support this idea. Handling keywords in org-element is not
>> pretty. Having them in the parse tree would make things easier. However,
>> we again need to consider back-compatibility. I can imagine third-party
>> ox-* packages breaking if we make this change - we should double check
>> if we decide to change this.
>
> I'm happy to put in the time to submit code fixes for consumers of the
> API so we can make this change. I have usually limited my thinking
> about compatibility concerts to the document syntax and semantics but
> this made me realize that in terms of actual labor the API consumers
> are likely to be affected as well.

This is not as easy as just submitting patches... Anyway, lets move this
to separate thread.

>> Yes, it is saner. However, our syntax document is supposed to be
>> human-readable description of what org-element does. We cannot introduce
>> differences between grammar document and de-facto parser implementation.
>> This will defeat the purpose to providing reference syntax - we will get
>> inconsistency between Emacs Org mode and external parsers.
>
> To achieve this can we have an implementation note for org element
> specifically? There shouldn't be any divergence between
> implementations if we get the abstract variant of this specified
> correctly, where correctly means "exactly matches org-element
> behavior."

If you refer to restructuring the syntax document without introducing
divergence with org-element, I am fine with such improvements. We
already tried something somewhat similar by referring to Elisp variables
in some cases.

> Another note that I think this difference is arising because I'm using
> a narrower definition for what counts as syntax while still wanting to
> specify that the resulting transformed ast should be the same.
>
> I think it could make the document more useful if we have examples of
> how to get to the same endpoint with slightly different decisions
> about surface syntax.

Sounds reasonable. The only thing I fear about is making the document
too long. Of course, we can always put things in appendices if
necessary.

> One final note here is that part of my objective in this was to
> simplify the org-element implementation while opening the possibility
> for user defined keyword behavior.

I am not sure what you refer to.

>> Both :END: and :end: are supported by Org parser. What do you mean by
>> legacy?
>
> I seem to recall a statement that things like #+BEGIN_SRC and friends
> being retained for legacy support. This is also related to a
> standardization conversation which we aren't quite ready to have,
> which is that for things like :end: and :END: the lowercase version is
> the "canonical" representation when normalizing a document (related to
> being able to specify levels of conformance for an org parser, namely
> that there is a level that would only accept fully normalized
> documents that i.e. use :end: and not :END:). The elisp implementation
> of course supports :END:, but I don't recall whether it falls into the
> same category as #+BEGIN_SRC being on legacy support and #+begin_src
> being the preferred version.

AFAIK, org-element is case-insensitive by default. Majority of
discussions related to this topic are revolving around case of
auto-inserted Org elements.

>> I disagree. inilinetasks are a part of syntax de facto and they can be
>> encountered in Org documents in the wild. If you treat inlinetasks as
>> ordinary headings, things may be broken unpredictably during parsing.
>
> This comment in particular was about whether we talk about things
> beyond the surface syntax in this document and/or whether we move them
> to a section on semantics and transformations that are deeper than the
> surface syntax. I'm fine to keep this section in the document, but we
> should make it clear that it is not part of the surface syntax (this
> is also related to my question about property drawers and planning
> following an inline task being parsed as a heading above).

I afraid that I cannot understand clearly what you refer to when saying
surface syntax vs. semantics.

However, inlinetasks are different from headlines, despite being
sufficiently similar to create confusion. Probably Org is too good in
supporting inlinetasks and headings as if they are the same.

> I'm using the term syntax very narrowly here to refer specifically to
> the pure surface syntax. Inline tasks don't introduce any novel
> restrictions on syntax so they don't have to be implemented as part of
> the surface syntax, they are a reinterpretation of a headings and
> otherwise follow all the usual rules such as not allowing new headings
> inside them etc.

As I mentioned earlier, inlinetasks do not always include everything
until next heading/inlinetask as their section.

> The reason I bring this up is because when implementing an org parser
> we would like to communicate to developers which parts of this
> document should be implemented directly in the parser and which ones
> should be deferred to a later step. Inlinetasks are a good example of
> this because they are entirely consistent with regular old org syntax
> for headings, and can be implemented as a transformation on the ast
> for headings that have a level that is deeper than the inlinetask min
> level.

I am not sure what later step you are referring to.

> Said another way, we want to communicate that trying to introduce a
> node in an eBNF grammar for inline tasks is not a good idea because it
> makes org syntax extremely non-regular and breaks countless use cases
> that need nesting of headings beyond the inlinetask min level.

Do you mean that you imagine the first parsing step to be eBNF grammar?
Why so?

>> Could you elaborate why grammars cannot track the indentation level?
>> AFAIU, If it were the case, python would not be parseable.
>
> Python maintains a separate stack for handling leading whitespace.
> https://docs.python.org/3/reference/lexical_analysis.html#indentation
> Thus it is effectively tracked as part of the tokenizer which goes on
> to emit the indent and dedent tokens. However Org cannot take this
> approach because it allows much more permissive use of leading
> whitespace and in plain lists deals with a minimum deindent relative
> to the bullet which may itself be arbitrarily indented. I think I
> might be able to implement a stack that could track deinents like that
> in the tokenizer but I'm not 100% sure.
>
> Regardless, my (perhaps overly technical point) is that it is not
> something that can be done in the grammar, it must be done in the
> tokenizer, and the tokenizer would have to emit a control token that
> maps to the space between two characters in order for the deinent to
> be usable by the grammar.

AFAIK, tokenizer is just a part of the parser. It may or may not be
separate from the grammar. AFAIU, lookahead grammars can be imagined as
using tokenizer under the hood.

>> Yet, it is exactly what happens in Org. malformed property drawers will
>> become ordinary drawers.
>
> Yes, but ideally a property drawer would only be defined by its
> location in a document and the use of :properties: to start the drawer
> rather than also be defined by the well-formedness of its
> contents. This would mean that we would have regular drawers, property
> drawers, and malformed property drawers that were recognizable by the
> parser. I have a sense that org-lint may already be doing this?

Org syntax is permissive. It can always be parsed without errors.
org-lint is merely catching common unwanted mistakes. I view org-lint as
an addition to grammar. Making linter a part of grammar will complicate
things even more than what we have now.

>> How would you define entities object then? First/second pass is an
>> implementation detail. Our current description follows how org-element
>> handles entities.
>
> At the level of the syntax there is no pure entity object. At the
> level of semantics (deeper pass) there is. My objective here is to
> create a syntax that is invariant to a long and changeable list of
> entities. Imagine that a user wants to add a new custom entity, they
> need to be able to do that without changing org syntax and in the
> laundry case having to recompile the whole parser.
>
> One way that I think about the distinction is that the syntax is the
> subset of things that you cannot change at runtime. Of course in emacs
> you can change almost everything at runtime so by convention we have
> to pick which things we declare to be part of an immutable concrete
> syntax.
>
> With that context, the way I would define entities is as
> entity-fragment objects where the name is contained in the entities
> list. Note that this could lead to a slight change of interpretation
> for something like \alpha[] which needs to be explored. I did some
> experiments with it but don't remember the results.

AFAIK, the current version of the syntax document is trying hard to
restrict itself to fixed grammar that does not change at runtime. That's
why we provide default values of runtime-customizeable variables.
Generalisation entities syntax will require change in org-element parser
and should better be discussed in separate thread.

>> I am not sure if it is needed. We can already to \vert
>
> This should be a side thread, likely started by a working
> implementation.Some immediate thoughts are recorded here.
>
> \vert breaks cases where you want the table to also be data, for
> example I wanted to create a table that had various syntactic elements
> such as =|= in cells and rows and I wanted to be able to ctrl-f for
> =|= in the table. \vert breaks this case and it is quite confusing if
> you need the exact character for clarity in developer
> documentation. Here is an example of the table and me trying with
> macros to work around the issue
> https://github.com/tgbugs/sxpyr/blob/master/docs/sexp.org#reading-behavior
>
> There is an additional point here which is that the restriction on =|=
> has nothing to do with surface syntax at all in the elisp
> implementation due to the order in which macros are resolved relative
> to table elements. Clarifying how macros interact (or hopefully do not
> interact) with other parts of syntax should probably be included at
> some point.

Sounds reasonable and it is also not covered by our escaping mechanisms
in Org. So, lets discuss it in a separate thread.

>> That's not accurate. you cannot nest, say, bold inside bold. You cannot
>> put code inside any other markup freely: consider *bold =asd*asd= not bold*
>
> I think it is accurate. I've tested this fairly extensively for my
> laundry implementation to match the org export behavior. Arbitrary
> nesting of those 4 is supported and the other 2 can be at the bottom
> of any level.
>
> I see *bold =asd*asd= bold* for ox-html/ox-latex and for font locking.

Sorry, my example was wrong. I was referring to
 *bold =asd* asd= bold*

> You can also have ******bold****** and it renders the same as *bold*.

Yes. They key word is "renders". The actual bold object has all the
inner * chars.

> Consider these monstrosities as well:
>  *b /i _u +s =v /*_+lol+_*/= ~c /*_+lol+_*/~ s+ u_ i/ b*
>  */_+bius+_ _+bius+_ bi/*

To clarify, Org does support emphasis nesting as long as that emphasis
does not intersect and as long as the same type of emphasis is not
nested inside.

Best,
Ihor

next prev parent reply	other threads:[~2022-01-19 12:10 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-09 18:02 Org Syntax Specification Timothy
2022-01-15 12:40 ` Sébastien Miquel
2022-01-15 16:36   ` Depreciating TeX-style LaTeX fragments (was: Org Syntax Specification) Timothy
2022-01-16  8:08     ` Sébastien Miquel
2022-01-16  9:23       ` Depreciating TeX-style LaTeX fragments Martin Steffen
2022-01-16  9:46       ` Colin Baxter 😺
2022-01-16 11:11         ` Tim Cross
2022-01-16 13:26         ` Juan Manuel Macías
2022-01-16 14:43           ` Colin Baxter 😺
2022-01-16 15:16             ` Greg Minshall
2022-01-16 17:45         ` Rudolf Adamkovič
2022-01-16 12:10     ` Eric S Fraga
2022-01-16 14:30       ` Anthony Cowley
2022-01-18  0:54 ` Org Syntax Specification Tom Gillespie
2022-01-18 12:09   ` Ihor Radchenko
2022-01-19  1:22     ` Tom Gillespie
2022-01-19 11:58       ` Ihor Radchenko [this message]
2022-09-25  9:09 ` Bastien
2022-09-25 21:28   ` Rohit Patnaik
2022-11-26  2:41   ` Ihor Radchenko
2022-11-26  6:24     ` Bastien
2022-11-26  6:05   ` Ihor Radchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bl08kkcc.fsf@localhost \
    --to=yantar92@gmail.com \
    --cc=emacs-orgmode@gnu.org \
    --cc=mail@nicolasgoaziou.fr \
    --cc=tecosaur@gmail.com \
    --cc=tgbugs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).