Re: Org Syntax Specification

emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed

From: Tom Gillespie <tgbugs@gmail.com>
To: Ihor Radchenko <yantar92@gmail.com>
Cc: org-mode-email <emacs-orgmode@gnu.org>,
	Nicolas Goaziou <mail@nicolasgoaziou.fr>,
	Timothy <tecosaur@gmail.com>
Subject: Re: Org Syntax Specification
Date: Tue, 18 Jan 2022 20:22:31 -0500	[thread overview]
Message-ID: <CA+G3_PMybCd+xd8RkRbC707uMLMDHf38LUWkEyYV10vZv8L6Sw@mail.gmail.com> (raw)
In-Reply-To: <87r195nt2g.fsf@localhost>

Hi Ihor,
  Thank you very much for the detailed responses. Let me start with
some context.

1. A number of the comments that I made fall into the brainstorming
   category, so they don't need to make their way into the document at
   this time. I agree that it is critical for this document to capture
   how org is parsed right now and that we should not put the
   pie-in-the-sky changes in until the behavior of org-element matches
   (if such a change is made at all).
2. Though I haven't been hacking on it, I fully intend to contribute
   test cases and exploratory work on org-element in the future, so
   please don't interpret some of what I am writing as requests for
   other people to write code (unless they want to :)
3. When I say grammar in this context I mean specifically an eBNF that
   generates a LALR(1) or LR(1) parser. This is narrower than the
   definition used in the document, which includes things that have to
   be implemented in the tokenizer, or in a pass after the grammar has
   been applied, or are related to some other aspect beyond the pure
   surface syntax.
4. A number of my comments are about the structure of the document
   more than the structure of the syntax or the implementation. I
   think that most of them are trying to ask whether we want to
   clearly delineate pure surface syntax from semantics to make the
   document easier to understand.

More replies in line.
Best!
Tom

> As for your other comments, you seem to be suggesting a number of
> changes to the existing Org syntax. Some of them looks fine, some are
> not. However, please keep in mind that we have to deal with back
> compatibility, third party compatibility, and not breaking existing Org
> documents unless we have a very strong justification. I suggest to
> branch a number of new threads from here for each concrete suggestion
> where you want to make changes to Org syntax, as opposed to just
> document wording. Otherwise, this discussion will become a total mess.

Agreed. I put many of these in here as notes from my experiences, I
will branch those off into separate discussions so that we don't
pollute this thread.

> Nope. Sections are actually elements. See =org-element-all-elements=.

I realized this at a slightly later date but missed cleaning up this
comment.  See my response on section vs segment below.

> I disagree. Nesting rules are the important part of syntax. We have
> restrictions on what elements can be inside other element. The same
> patterns are not recognised in Org depending on their nesting. For
> example, links that you put into property drawers are not considered
> link objects.

When I wrote this comment I was still confused about sections.I think
discussion of nesting in most contexts is ok, but there are some case
where nesting cannot be determined from the grammar, and there I think
we need to make a distinction.

In my thinking I separate the context sensitive nature of parsing from
the nesting structure of the resulting sexpressions, org elements,
etc.The most obvious example of this is that the sexpression
representation for headings nests based on the level of the heading,
but heading level cannot be determined by the grammar so it must be
reconstructed from a flat sequence of headings that have varying level.

> Again I disagree. While your idea about table cells is reasonable
> (similar for citation-references inside citations), I am against
> decoupling Org syntax from org-element implementation. In
> org-element.el, table-cells are just yet another object. If we make
> things in org-element and syntax document out of sync, confusion and
> errors will follow during future maintenance.

Org element treats all elements and objects as a single homogenous
type.  This is fine. However, to help people understand the syntax it
seems easier to define things in a positive way so that we don't say
"all except these two."  Therefore, despite the fact that the
implementation of org-element treats table rows and cells no different
from any other node in the parse tree, we don't need to burden the
reader with that information at this point in time, and could provide
that information as an implementation note for cells.  I think the
other issue I was having here is that the spec for tables is spread
allover the place, and it would be much easier to understand and
implement ifit were all in one place.

> This actually reads slightly confusing. "Blank lines separate paragraphs
> and other elements" sounds like blank lines are only relevant
> before/after paragraphs. However, there are also footnote references and
> lists. Maybe we can try something like:
>
> Blank lines can be used to indicate end of some elements.
>
> "can" because a single blank line usually does not separate anything.

I think your version is quite a bit more readable.  Can we list the
set of all the elements that can be ended by a new lineas well as
those that cannot (iirc they are elements such as footnotes that can
only be ended by a double blank line or a heading)?

> Then where can we put it? This is one of the tricky conventions we use
> in the parser.

After discussing with Timothy, I realized that I totally missed the
greater/lesser parallelism between blocks and elements. I'll see if I
can come up with some wording that will avoid similar confusion for
other readers.

> I am not sure here. Inline tasks are special because a one-line inline
> task must not contain any text below, cannot have planning or
> properties.

Then they are no longer inline tasks, but instead parse as headings, correct?

> > +  contains =TODO= and =DONE=, however org-todo-keywords-1 is a buffer local
> > +  variable and can be set by users in an org file using =#+todo:=.].
>
> If we mention this, we also need to elaborate kind of element is
> #+todo:, where it can be located, and how to parse multiple instances of
> #+todo in the document.

Yes. What I have written for laundry is that only #+todo: declarations
that appear in the zeroth section will be applied (this is true for
all document level configuration keywords). There is also a
possibility that we might be able to support including #+todo:
keywords (and #+link: definitions or similar) in further sections, but
that they would only apply to headings that occur after that line in
the file. Such behavior is likely to be confusing to users so probably
best to only guarantee correct behavior if they are put in the zeroth
section.

The reason it is confusing/problematic is that there could be
a #+todo: buried half way down a file, the buffer configuration is
updated, and then a user can use keywords up the file in the elisp
implementation. Another implementation that parses a file
incrementally would not encounter the buried #+todo: keyword until
after they have already emitted a heading,changing how a heading is
parsed. There is a similar issue with the #+link: keyword.

> > -A heading contains directly one section (optionally), followed by
> > -any number of deeper level headings.
> > +The level of a heading can be used to construct a nested structure.
> > +All content following a heading that appears before the next heading
> > +(regardless of the level of that next heading) is a section. In addition,
> > +text before the first heading in an org document is also a section.
>
> Note that it is not true for one-line inline tasks.

I'm not quite sure which part you are referring to here.

> Sounds reasonable. However, we may also need to make this change in
> Elisp level, which is tricky when you think about
> backward-compatibility.

Let's look into how much work it will be and how disruptive it might
be?  We are already changing to heading in the elisp so maybe now
would be a good time to also change from section to segment?
Alternatively we could start by updating the documentation and include
a note that segments are currently called sections by org element?

> The statement about property drawers in first section (that how we refer
> to it in org-element) is correct. First section and its property drawer
> location is special.
>
> I agree that it's inconsistent with normal property drawers. However, we
> cannot change it without breaking existing Org files. It we decide to
> change syntax in this area, we should think carefully about possible
> consequences.

I've since come around on this. I think that we can make it consistent
by thinking of the zeroth section as an invisible heading with zero
asterisks at the start of a file. This is extremely useful for making
org-transclusion work transparently with whole files. The only
modification that I might suggest in the context of org-transclusion
would be to disallow empty lines before the property drawer. This
allows files to represent single sections (segments) which might be
very useful for implementations that want to store sections in a
database or something like that.

> I generally support this idea. Handling keywords in org-element is not
> pretty. Having them in the parse tree would make things easier. However,
> we again need to consider back-compatibility. I can imagine third-party
> ox-* packages breaking if we make this change - we should double check
> if we decide to change this.

I'm happy to put in the time to submit code fixes for consumers of the
API so we can make this change. I have usually limited my thinking
about compatibility concerts to the document syntax and semantics but
this made me realize that in terms of actual labor the API consumers
are likely to be affected as well.

> Yes, it is saner. However, our syntax document is supposed to be
> human-readable description of what org-element does. We cannot introduce
> differences between grammar document and de-facto parser implementation.
> This will defeat the purpose to providing reference syntax - we will get
> inconsistency between Emacs Org mode and external parsers.

To achieve this can we have an implementation note for org element
specifically? There shouldn't be any divergence between
implementations if we get the abstract variant of this specified
correctly, where correctly means "exactly matches org-element
behavior."

Another note that I think this difference is arising because I'm using
a narrower definition for what counts as syntax while still wanting to
specify that the resulting transformed ast should be the same.

I think it could make the document more useful if we have examples of
how to get to the same endpoint with slightly different decisions
about surface syntax.

One final note here is that part of my objective in this was to
simplify the org-element implementation while opening the possibility
for user defined keyword behavior. You of course are the expert on
org-element so my thinking may very well be misguided on this
point. This is another area where I would be happy to contribute when
the time comes.

> Both :END: and :end: are supported by Org parser. What do you mean by
> legacy?

I seem to recall a statement that things like #+BEGIN_SRC and friends
being retained for legacy support. This is also related to a
standardization conversation which we aren't quite ready to have,
which is that for things like :end: and :END: the lowercase version is
the "canonical" representation when normalizing a document (related to
being able to specify levels of conformance for an org parser, namely
that there is a level that would only accept fully normalized
documents that i.e. use :end: and not :END:). The elisp implementation
of course supports :END:, but I don't recall whether it falls into the
same category as #+BEGIN_SRC being on legacy support and #+begin_src
being the preferred version.

> I disagree. inilinetasks are a part of syntax de facto and they can be
> encountered in Org documents in the wild. If you treat inlinetasks as
> ordinary headings, things may be broken unpredictably during parsing.

This comment in particular was about whether we talk about things
beyond the surface syntax in this document and/or whether we move them
to a section on semantics and transformations that are deeper than the
surface syntax. I'm fine to keep this section in the document, but we
should make it clear that it is not part of the surface syntax (this
is also related to my question about property drawers and planning
following an inline task being parsed as a heading above).

I'm using the term syntax very narrowly here to refer specifically to
the pure surface syntax. Inline tasks don't introduce any novel
restrictions on syntax so they don't have to be implemented as part of
the surface syntax, they are a reinterpretation of a headings and
otherwise follow all the usual rules such as not allowing new headings
inside them etc.

The reason I bring this up is because when implementing an org parser
we would like to communicate to developers which parts of this
document should be implemented directly in the parser and which ones
should be deferred to a later step. Inlinetasks are a good example of
this because they are entirely consistent with regular old org syntax
for headings, and can be implemented as a transformation on the ast
for headings that have a level that is deeper than the inlinetask min
level.

Said another way, we want to communicate that trying to introduce a
node in an eBNF grammar for inline tasks is not a good idea because it
makes org syntax extremely non-regular and breaks countless use cases
that need nesting of headings beyond the inlinetask min level.

> Instead, we may consider making inlinetask level constant.

I don't think this is necessary, or at least is orthogonal to my
concerns.

> Could you elaborate why grammars cannot track the indentation level?
> AFAIU, If it were the case, python would not be parseable.

Python maintains a separate stack for handling leading whitespace.
https://docs.python.org/3/reference/lexical_analysis.html#indentation
Thus it is effectively tracked as part of the tokenizer which goes on
to emit the indent and dedent tokens. However Org cannot take this
approach because it allows much more permissive use of leading
whitespace and in plain lists deals with a minimum deindent relative
to the bullet which may itself be arbitrarily indented. I think I
might be able to implement a stack that could track deinents like that
in the tokenizer but I'm not 100% sure.

Regardless, my (perhaps overly technical point) is that it is not
something that can be done in the grammar, it must be done in the
tokenizer, and the tokenizer would have to emit a control token that
maps to the space between two characters in order for the deinent to
be usable by the grammar.

Somehow this reminds me that I need to check on the behavior of spaces
vs tabs for plain lists (joy).

> Yet, it is exactly what happens in Org. malformed property drawers will
> become ordinary drawers.

Yes, but ideally a property drawer would only be defined by its
location in a document and the use of :properties: to start the drawer
rather than also be defined by the well-formedness of its
contents. This would mean that we would have regular drawers, property
drawers, and malformed property drawers that were recognizable by the
parser. I have a sense that org-lint may already be doing this?

> > +PLANNING must directly follow HEADING without any blank lines in between.
> > +
> > + [fn::Need a spec for how to handle multiple instances of the same keyword with different values.]
>
> The last one wins (as in org-element-planning-parser)

Perfect.

> How would you define entities object then? First/second pass is an
> implementation detail. Our current description follows how org-element
> handles entities.

At the level of the syntax there is no pure entity object. At the
level of semantics (deeper pass) there is. My objective here is to
create a syntax that is invariant to a long and changeable list of
entities. Imagine that a user wants to add a new custom entity, they
need to be able to do that without changing org syntax and in the
laundry case having to recompile the whole parser.

One way that I think about the distinction is that the syntax is the
subset of things that you cannot change at runtime. Of course in emacs
you can change almost everything at runtime so by convention we have
to pick which things we declare to be part of an immutable concrete
syntax.

With that context, the way I would define entities is as
entity-fragment objects where the name is contained in the entities
list. Note that this could lead to a slight change of interpretation
for something like \alpha[] which needs to be explored. I did some
experiments with it but don't remember the results.

> While I am not opposing the idea, your principle is not followed by
> org-element parser. We may consider changing it, but it is again a whole
> separate discussion where we need to consider pros and cons.

I agree this is one of the deeper discussions that we need to have in
a separate place (consolidating some of my earlier points from the
thread on intra-word markup). I'm happy to work on the changes to org
element to make this possible.

> Do not look at font-locking. You can safely consider that fontification
> is wrong in all non-trivial cases. Always check org-element-at-point and
> org-element-context.

We are in agreement here. This was more of a note for me to check back
in on the behavior because my brain thought that [fn::asdf] could not
start a line but that may not be correct.

> I am not sure if it is needed. We can already to \vert

This should be a side thread, likely started by a working
implementation.Some immediate thoughts are recorded here.

\vert breaks cases where you want the table to also be data, for
example I wanted to create a table that had various syntactic elements
such as =|= in cells and rows and I wanted to be able to ctrl-f for
=|= in the table. \vert breaks this case and it is quite confusing if
you need the exact character for clarity in developer
documentation. Here is an example of the table and me trying with
macros to work around the issue
https://github.com/tgbugs/sxpyr/blob/master/docs/sexp.org#reading-behavior

There is an additional point here which is that the restriction on =|=
has nothing to do with surface syntax at all in the elisp
implementation due to the order in which macros are resolved relative
to table elements. Clarifying how macros interact (or hopefully do not
interact) with other parts of syntax should probably be included at
some point.

> That would be welcome, but someone™ should implement timezone support in
> Elisp level. We have several discussions about this in the past.

Definitely on my list. I have the proposed extensions implemented in
laundry that I can use as a guide.

> That's not accurate. you cannot nest, say, bold inside bold. You cannot
> put code inside any other markup freely: consider *bold =asd*asd= not bold*

I think it is accurate. I've tested this fairly extensively for my
laundry implementation to match the org export behavior. Arbitrary
nesting of those 4 is supported and the other 2 can be at the bottom
of any level.

I see *bold =asd*asd= bold* for ox-html/ox-latex and for font locking.

You can also have ******bold****** and it renders the same as *bold*.

Consider these monstrosities as well:
 *b /i _u +s =v /*_+lol+_*/= ~c /*_+lol+_*/~ s+ u_ i/ b*
 */_+bius+_ _+bius+_ bi/*

next prev parent reply	other threads:[~2022-01-19  1:33 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-09 18:02 Org Syntax Specification Timothy
2022-01-15 12:40 ` Sébastien Miquel
2022-01-15 16:36   ` Depreciating TeX-style LaTeX fragments (was: Org Syntax Specification) Timothy
2022-01-16  8:08     ` Sébastien Miquel
2022-01-16  9:23       ` Depreciating TeX-style LaTeX fragments Martin Steffen
2022-01-16  9:46       ` Colin Baxter 😺
2022-01-16 11:11         ` Tim Cross
2022-01-16 13:26         ` Juan Manuel Macías
2022-01-16 14:43           ` Colin Baxter 😺
2022-01-16 15:16             ` Greg Minshall
2022-01-16 17:45         ` Rudolf Adamkovič
2022-01-16 12:10     ` Eric S Fraga
2022-01-16 14:30       ` Anthony Cowley
2022-01-18  0:54 ` Org Syntax Specification Tom Gillespie
2022-01-18 12:09   ` Ihor Radchenko
2022-01-19  1:22     ` Tom Gillespie [this message]
2022-01-19 11:58       ` Ihor Radchenko
2022-09-25  9:09 ` Bastien
2022-09-25 21:28   ` Rohit Patnaik
2022-11-26  2:41   ` Ihor Radchenko
2022-11-26  6:24     ` Bastien
2022-11-26  6:05   ` Ihor Radchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+G3_PMybCd+xd8RkRbC707uMLMDHf38LUWkEyYV10vZv8L6Sw@mail.gmail.com \
    --to=tgbugs@gmail.com \
    --cc=emacs-orgmode@gnu.org \
    --cc=mail@nicolasgoaziou.fr \
    --cc=tecosaur@gmail.com \
    --cc=yantar92@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).