From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id GQIBMVNq52GXNwEAgWs5BA (envelope-from ) for ; Wed, 19 Jan 2022 02:33:07 +0100 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id AuPtK1Nq52GOYAEAauVa8A (envelope-from ) for ; Wed, 19 Jan 2022 02:33:07 +0100 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 33382306EA for ; Wed, 19 Jan 2022 02:33:07 +0100 (CET) Received: from localhost ([::1]:51842 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1n9zqU-0001CY-7N for larch@yhetil.org; Tue, 18 Jan 2022 20:33:06 -0500 Received: from eggs.gnu.org ([209.51.188.92]:44564) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1n9zgV-0006IO-UJ for emacs-orgmode@gnu.org; Tue, 18 Jan 2022 20:22:47 -0500 Received: from [2a00:1450:4864:20::332] (port=35339 helo=mail-wm1-x332.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1n9zgT-0000Ic-2u for emacs-orgmode@gnu.org; Tue, 18 Jan 2022 20:22:47 -0500 Received: by mail-wm1-x332.google.com with SMTP id q9-20020a7bce89000000b00349e697f2fbso10452488wmj.0 for ; Tue, 18 Jan 2022 17:22:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=61lVQuJWcZJ96w8lxobPGFaFGPwpZE+e+bzCmCoR4bc=; b=WmnBYOr33ESEVnp9s6FsLHMP7VO5v6fNW+yU9SYmWpSGHJcKhqm9m46nKH8owrxjhS LQQrxkbx+gXQNlRukX3Qo5rGnjzfyC/9SCFENQnTf1KoM2qWjx2C6rCeV67U7LYcej4Y xjz4WvBUsbrpBeCF9aBCevmCFdykViciKEDgdxQfL+klnOYmNoXKKXiT2igHDZVNA4xb OtNbgsDQjis7FWd7AlfmybbRy2cQgaSf9nvVBQmVCzzNr59Vtmcis+g0WIT0kvApzyxH zM/3wnLP3LWBpyCyvvkbrBkCswuGjZ6MrdzFQS5ALIjCR6SRW2rRALuvpizsTFtI4dek DjEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=61lVQuJWcZJ96w8lxobPGFaFGPwpZE+e+bzCmCoR4bc=; b=QEg5QKYutZTOZiTRjk2x0fGau7GAtTKCHuM7M7zNLONNUsdQPI5cIJ58tkrxsYGA0U vUGF7cHfPA1VSqHrxPPlFjvczhtVJA+TCBOKWHU9VzFNv4qOtAfbLRsaLHe2ICcz3Fyl IhlNtbSGu6HU+MffWhWOjNG9qjpPh6Nl7nI9MLwy49fKDF1ZYM1rtjrdI98C1DkjEeTW GM28vpnUt0zNLMdxfao7GlWgXMqU9xhMgyFZUSHCoD+K21Xy93luWH/KTpQNpFmv/9zL Y/phdvLkLbdjTyBUFcNb2a6vlCHMOlIo9tT9tTn0aZKXUA+oo5K46hFQYaplf6DvK4LC JeUg== X-Gm-Message-State: AOAM530NDaKZCN9p+sqqH5Mirm7/3H5FpH3XjQpa4PxlXkHX2Eiobe6b bqStPufz4vvz3cnHg8ThH+KntcvSK9jHdCc5ZPQ= X-Google-Smtp-Source: ABdhPJzYCWixu5lvopbGAaPhVi6SBzwMmokN/Myfz0is7Q5Fd26D4Z4hxNlr7MWVSHBOuBvfoKdJ0adXhUnt5J2jJA0= X-Received: by 2002:a1c:2189:: with SMTP id h131mr1047200wmh.177.1642555362619; Tue, 18 Jan 2022 17:22:42 -0800 (PST) MIME-Version: 1.0 References: <871r1g936z.fsf@gmail.com> <87r195nt2g.fsf@localhost> In-Reply-To: <87r195nt2g.fsf@localhost> From: Tom Gillespie Date: Tue, 18 Jan 2022 20:22:31 -0500 Message-ID: Subject: Re: Org Syntax Specification To: Ihor Radchenko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Host-Lookup-Failed: Reverse DNS lookup failed for 2a00:1450:4864:20::332 (failed) Received-SPF: pass client-ip=2a00:1450:4864:20::332; envelope-from=tgbugs@gmail.com; helo=mail-wm1-x332.google.com X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, PDS_HP_HELO_NORDNS=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: org-mode-email , Nicolas Goaziou , Timothy Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Migadu-Flow: FLOW_IN X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1642555987; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=61lVQuJWcZJ96w8lxobPGFaFGPwpZE+e+bzCmCoR4bc=; b=S9fawRO7Z+5NkSrLAjz8vJpfL3v52a19ei/Soh78rn3Ft4SCsNj/O1/hKbHo4IPftSc47F 8s7Pkiy+zih7IcpfVgkkdWrfUhyKEvdH37USHVjK6uP8hm32wOUl9iKQT5cmi8dKM8XL3/ pv+KMMiYdNGTPsir95oE+zy0YvaSeMwTTFpllz+jeTG1G1IG233sEd2qtaNt//rsXTlPkx 5VZKoz1+KSKVSTMVDGa+h3l1l4Sxvxv7K+4FTW85dSQdfl8wSGjva7rxYNE0+l0mPKxeyj upn2aabHRWAEdeg9Ots3vWHRtyst8vKsH0AlOkKBYwr/SOY87M1YU5kha2lt0w== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1642555987; a=rsa-sha256; cv=none; b=SsvKqbLuOZEc4TgvmIEWRIGl7UCaJk31A/9EkQicxrcHnpqScGnHtJ5m5Jzqc1SfWHZiH4 WyvffP9M3aSKLAFRUtZCv7pvyhjNvaEStqFV2BfDOjKsM9lVbm/gTT3QXlo1bpQXugSBg0 BUT2whkLnzZlbM3owDkHeff8C+5IxKrJ1SSNT0hg0fK0fZnFemaWRX0FviNb1xAv9fEiSH LDtn+PMi8tLrNw2sEU4rHc4S0ou8+v//rqO0OyO4WLYMkzJRGFx63ErVEDQRg/HfKKjkUU hy+1Znj+dW2foTx5chk9ds64cuhN7PeATYSyCOVn8i9xCAEsZxdD93X1ioyuHg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20210112 header.b=WmnBYOr3; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: -2.02 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20210112 header.b=WmnBYOr3; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: 33382306EA X-Spam-Score: -2.02 X-Migadu-Scanner: scn1.migadu.com X-TUID: XhyDtmGbMay/ Hi Ihor, Thank you very much for the detailed responses. Let me start with some context. 1. A number of the comments that I made fall into the brainstorming category, so they don't need to make their way into the document at this time. I agree that it is critical for this document to capture how org is parsed right now and that we should not put the pie-in-the-sky changes in until the behavior of org-element matches (if such a change is made at all). 2. Though I haven't been hacking on it, I fully intend to contribute test cases and exploratory work on org-element in the future, so please don't interpret some of what I am writing as requests for other people to write code (unless they want to :) 3. When I say grammar in this context I mean specifically an eBNF that generates a LALR(1) or LR(1) parser. This is narrower than the definition used in the document, which includes things that have to be implemented in the tokenizer, or in a pass after the grammar has been applied, or are related to some other aspect beyond the pure surface syntax. 4. A number of my comments are about the structure of the document more than the structure of the syntax or the implementation. I think that most of them are trying to ask whether we want to clearly delineate pure surface syntax from semantics to make the document easier to understand. More replies in line. Best! Tom > As for your other comments, you seem to be suggesting a number of > changes to the existing Org syntax. Some of them looks fine, some are > not. However, please keep in mind that we have to deal with back > compatibility, third party compatibility, and not breaking existing Org > documents unless we have a very strong justification. I suggest to > branch a number of new threads from here for each concrete suggestion > where you want to make changes to Org syntax, as opposed to just > document wording. Otherwise, this discussion will become a total mess. Agreed. I put many of these in here as notes from my experiences, I will branch those off into separate discussions so that we don't pollute this thread. > Nope. Sections are actually elements. See =3Dorg-element-all-elements=3D. I realized this at a slightly later date but missed cleaning up this comment. See my response on section vs segment below. > I disagree. Nesting rules are the important part of syntax. We have > restrictions on what elements can be inside other element. The same > patterns are not recognised in Org depending on their nesting. For > example, links that you put into property drawers are not considered > link objects. When I wrote this comment I was still confused about sections.I think discussion of nesting in most contexts is ok, but there are some case where nesting cannot be determined from the grammar, and there I think we need to make a distinction. In my thinking I separate the context sensitive nature of parsing from the nesting structure of the resulting sexpressions, org elements, etc.The most obvious example of this is that the sexpression representation for headings nests based on the level of the heading, but heading level cannot be determined by the grammar so it must be reconstructed from a flat sequence of headings that have varying level. > Again I disagree. While your idea about table cells is reasonable > (similar for citation-references inside citations), I am against > decoupling Org syntax from org-element implementation. In > org-element.el, table-cells are just yet another object. If we make > things in org-element and syntax document out of sync, confusion and > errors will follow during future maintenance. Org element treats all elements and objects as a single homogenous type. This is fine. However, to help people understand the syntax it seems easier to define things in a positive way so that we don't say "all except these two." Therefore, despite the fact that the implementation of org-element treats table rows and cells no different from any other node in the parse tree, we don't need to burden the reader with that information at this point in time, and could provide that information as an implementation note for cells. I think the other issue I was having here is that the spec for tables is spread allover the place, and it would be much easier to understand and implement ifit were all in one place. > This actually reads slightly confusing. "Blank lines separate paragraphs > and other elements" sounds like blank lines are only relevant > before/after paragraphs. However, there are also footnote references and > lists. Maybe we can try something like: > > Blank lines can be used to indicate end of some elements. > > "can" because a single blank line usually does not separate anything. I think your version is quite a bit more readable. Can we list the set of all the elements that can be ended by a new lineas well as those that cannot (iirc they are elements such as footnotes that can only be ended by a double blank line or a heading)? > Then where can we put it? This is one of the tricky conventions we use > in the parser. After discussing with Timothy, I realized that I totally missed the greater/lesser parallelism between blocks and elements. I'll see if I can come up with some wording that will avoid similar confusion for other readers. > I am not sure here. Inline tasks are special because a one-line inline > task must not contain any text below, cannot have planning or > properties. Then they are no longer inline tasks, but instead parse as headings, correc= t? > > + contains =3DTODO=3D and =3DDONE=3D, however org-todo-keywords-1 is a= buffer local > > + variable and can be set by users in an org file using =3D#+todo:=3D.= ]. > > If we mention this, we also need to elaborate kind of element is > #+todo:, where it can be located, and how to parse multiple instances of > #+todo in the document. Yes. What I have written for laundry is that only #+todo: declarations that appear in the zeroth section will be applied (this is true for all document level configuration keywords). There is also a possibility that we might be able to support including #+todo: keywords (and #+link: definitions or similar) in further sections, but that they would only apply to headings that occur after that line in the file. Such behavior is likely to be confusing to users so probably best to only guarantee correct behavior if they are put in the zeroth section. The reason it is confusing/problematic is that there could be a #+todo: buried half way down a file, the buffer configuration is updated, and then a user can use keywords up the file in the elisp implementation. Another implementation that parses a file incrementally would not encounter the buried #+todo: keyword until after they have already emitted a heading,changing how a heading is parsed. There is a similar issue with the #+link: keyword. > > -A heading contains directly one section (optionally), followed by > > -any number of deeper level headings. > > +The level of a heading can be used to construct a nested structure. > > +All content following a heading that appears before the next heading > > +(regardless of the level of that next heading) is a section. In additi= on, > > +text before the first heading in an org document is also a section. > > Note that it is not true for one-line inline tasks. I'm not quite sure which part you are referring to here. > Sounds reasonable. However, we may also need to make this change in > Elisp level, which is tricky when you think about > backward-compatibility. Let's look into how much work it will be and how disruptive it might be? We are already changing to heading in the elisp so maybe now would be a good time to also change from section to segment? Alternatively we could start by updating the documentation and include a note that segments are currently called sections by org element? > The statement about property drawers in first section (that how we refer > to it in org-element) is correct. First section and its property drawer > location is special. > > I agree that it's inconsistent with normal property drawers. However, we > cannot change it without breaking existing Org files. It we decide to > change syntax in this area, we should think carefully about possible > consequences. I've since come around on this. I think that we can make it consistent by thinking of the zeroth section as an invisible heading with zero asterisks at the start of a file. This is extremely useful for making org-transclusion work transparently with whole files. The only modification that I might suggest in the context of org-transclusion would be to disallow empty lines before the property drawer. This allows files to represent single sections (segments) which might be very useful for implementations that want to store sections in a database or something like that. > I generally support this idea. Handling keywords in org-element is not > pretty. Having them in the parse tree would make things easier. However, > we again need to consider back-compatibility. I can imagine third-party > ox-* packages breaking if we make this change - we should double check > if we decide to change this. I'm happy to put in the time to submit code fixes for consumers of the API so we can make this change. I have usually limited my thinking about compatibility concerts to the document syntax and semantics but this made me realize that in terms of actual labor the API consumers are likely to be affected as well. > Yes, it is saner. However, our syntax document is supposed to be > human-readable description of what org-element does. We cannot introduce > differences between grammar document and de-facto parser implementation. > This will defeat the purpose to providing reference syntax - we will get > inconsistency between Emacs Org mode and external parsers. To achieve this can we have an implementation note for org element specifically? There shouldn't be any divergence between implementations if we get the abstract variant of this specified correctly, where correctly means "exactly matches org-element behavior." Another note that I think this difference is arising because I'm using a narrower definition for what counts as syntax while still wanting to specify that the resulting transformed ast should be the same. I think it could make the document more useful if we have examples of how to get to the same endpoint with slightly different decisions about surface syntax. One final note here is that part of my objective in this was to simplify the org-element implementation while opening the possibility for user defined keyword behavior. You of course are the expert on org-element so my thinking may very well be misguided on this point. This is another area where I would be happy to contribute when the time comes. > Both :END: and :end: are supported by Org parser. What do you mean by > legacy? I seem to recall a statement that things like #+BEGIN_SRC and friends being retained for legacy support. This is also related to a standardization conversation which we aren't quite ready to have, which is that for things like :end: and :END: the lowercase version is the "canonical" representation when normalizing a document (related to being able to specify levels of conformance for an org parser, namely that there is a level that would only accept fully normalized documents that i.e. use :end: and not :END:). The elisp implementation of course supports :END:, but I don't recall whether it falls into the same category as #+BEGIN_SRC being on legacy support and #+begin_src being the preferred version. > I disagree. inilinetasks are a part of syntax de facto and they can be > encountered in Org documents in the wild. If you treat inlinetasks as > ordinary headings, things may be broken unpredictably during parsing. This comment in particular was about whether we talk about things beyond the surface syntax in this document and/or whether we move them to a section on semantics and transformations that are deeper than the surface syntax. I'm fine to keep this section in the document, but we should make it clear that it is not part of the surface syntax (this is also related to my question about property drawers and planning following an inline task being parsed as a heading above). I'm using the term syntax very narrowly here to refer specifically to the pure surface syntax. Inline tasks don't introduce any novel restrictions on syntax so they don't have to be implemented as part of the surface syntax, they are a reinterpretation of a headings and otherwise follow all the usual rules such as not allowing new headings inside them etc. The reason I bring this up is because when implementing an org parser we would like to communicate to developers which parts of this document should be implemented directly in the parser and which ones should be deferred to a later step. Inlinetasks are a good example of this because they are entirely consistent with regular old org syntax for headings, and can be implemented as a transformation on the ast for headings that have a level that is deeper than the inlinetask min level. Said another way, we want to communicate that trying to introduce a node in an eBNF grammar for inline tasks is not a good idea because it makes org syntax extremely non-regular and breaks countless use cases that need nesting of headings beyond the inlinetask min level. > Instead, we may consider making inlinetask level constant. I don't think this is necessary, or at least is orthogonal to my concerns. > Could you elaborate why grammars cannot track the indentation level? > AFAIU, If it were the case, python would not be parseable. Python maintains a separate stack for handling leading whitespace. https://docs.python.org/3/reference/lexical_analysis.html#indentation Thus it is effectively tracked as part of the tokenizer which goes on to emit the indent and dedent tokens. However Org cannot take this approach because it allows much more permissive use of leading whitespace and in plain lists deals with a minimum deindent relative to the bullet which may itself be arbitrarily indented. I think I might be able to implement a stack that could track deinents like that in the tokenizer but I'm not 100% sure. Regardless, my (perhaps overly technical point) is that it is not something that can be done in the grammar, it must be done in the tokenizer, and the tokenizer would have to emit a control token that maps to the space between two characters in order for the deinent to be usable by the grammar. Somehow this reminds me that I need to check on the behavior of spaces vs tabs for plain lists (joy). > Yet, it is exactly what happens in Org. malformed property drawers will > become ordinary drawers. Yes, but ideally a property drawer would only be defined by its location in a document and the use of :properties: to start the drawer rather than also be defined by the well-formedness of its contents. This would mean that we would have regular drawers, property drawers, and malformed property drawers that were recognizable by the parser. I have a sense that org-lint may already be doing this? > > +PLANNING must directly follow HEADING without any blank lines in betwe= en. > > + > > + [fn::Need a spec for how to handle multiple instances of the same key= word with different values.] > > The last one wins (as in org-element-planning-parser) Perfect. > How would you define entities object then? First/second pass is an > implementation detail. Our current description follows how org-element > handles entities. At the level of the syntax there is no pure entity object. At the level of semantics (deeper pass) there is. My objective here is to create a syntax that is invariant to a long and changeable list of entities. Imagine that a user wants to add a new custom entity, they need to be able to do that without changing org syntax and in the laundry case having to recompile the whole parser. One way that I think about the distinction is that the syntax is the subset of things that you cannot change at runtime. Of course in emacs you can change almost everything at runtime so by convention we have to pick which things we declare to be part of an immutable concrete syntax. With that context, the way I would define entities is as entity-fragment objects where the name is contained in the entities list. Note that this could lead to a slight change of interpretation for something like \alpha[] which needs to be explored. I did some experiments with it but don't remember the results. > While I am not opposing the idea, your principle is not followed by > org-element parser. We may consider changing it, but it is again a whole > separate discussion where we need to consider pros and cons. I agree this is one of the deeper discussions that we need to have in a separate place (consolidating some of my earlier points from the thread on intra-word markup). I'm happy to work on the changes to org element to make this possible. > Do not look at font-locking. You can safely consider that fontification > is wrong in all non-trivial cases. Always check org-element-at-point and > org-element-context. We are in agreement here. This was more of a note for me to check back in on the behavior because my brain thought that [fn::asdf] could not start a line but that may not be correct. > I am not sure if it is needed. We can already to \vert This should be a side thread, likely started by a working implementation.Some immediate thoughts are recorded here. \vert breaks cases where you want the table to also be data, for example I wanted to create a table that had various syntactic elements such as =3D|=3D in cells and rows and I wanted to be able to ctrl-f for =3D|=3D in the table. \vert breaks this case and it is quite confusing if you need the exact character for clarity in developer documentation. Here is an example of the table and me trying with macros to work around the issue https://github.com/tgbugs/sxpyr/blob/master/docs/sexp.org#reading-behavior There is an additional point here which is that the restriction on =3D|=3D has nothing to do with surface syntax at all in the elisp implementation due to the order in which macros are resolved relative to table elements. Clarifying how macros interact (or hopefully do not interact) with other parts of syntax should probably be included at some point. > That would be welcome, but someone=E2=84=A2 should implement timezone sup= port in > Elisp level. We have several discussions about this in the past. Definitely on my list. I have the proposed extensions implemented in laundry that I can use as a guide. > That's not accurate. you cannot nest, say, bold inside bold. You cannot > put code inside any other markup freely: consider *bold =3Dasd*asd=3D not= bold* I think it is accurate. I've tested this fairly extensively for my laundry implementation to match the org export behavior. Arbitrary nesting of those 4 is supported and the other 2 can be at the bottom of any level. I see *bold =3Dasd*asd=3D bold* for ox-html/ox-latex and for font locking. You can also have ******bold****** and it renders the same as *bold*. Consider these monstrosities as well: *b /i _u +s =3Dv /*_+lol+_*/=3D ~c /*_+lol+_*/~ s+ u_ i/ b* */_+bius+_ _+bius+_ bi/*