From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp10.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id MG0RLp//52HSCAAAgWs5BA (envelope-from ) for ; Wed, 19 Jan 2022 13:10:07 +0100 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp10.migadu.com with LMTPS id MNfWJp//52GTfQEAG6o9tA (envelope-from ) for ; Wed, 19 Jan 2022 13:10:07 +0100 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id EB0A435714 for ; Wed, 19 Jan 2022 13:10:06 +0100 (CET) Received: from localhost ([::1]:38650 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nA9mv-0003kb-MV for larch@yhetil.org; Wed, 19 Jan 2022 07:10:05 -0500 Received: from eggs.gnu.org ([209.51.188.92]:34426) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nA9Xz-0008P6-2L for emacs-orgmode@gnu.org; Wed, 19 Jan 2022 06:54:40 -0500 Received: from [2607:f8b0:4864:20::632] (port=40557 helo=mail-pl1-x632.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nA9Xv-0004r6-0C for emacs-orgmode@gnu.org; Wed, 19 Jan 2022 06:54:38 -0500 Received: by mail-pl1-x632.google.com with SMTP id s9so67109plg.7 for ; Wed, 19 Jan 2022 03:54:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:in-reply-to:references:date:message-id :mime-version; bh=yNoj8sFnP+T9rfI4lpr6ByBV5VpzwQ6kpAie26j8FFg=; b=IUYgENNSSKxgy6ig3mzfoFO34oCFITpFj45cm9YKP9snLgdB03YmZ6J/aQG8Kfvx6e Akvvvw6vZR/6sdFJ3eK8Vs2TP3CAaZSk+1It3zRCuAVMXc1JiXZ1vuLvDc5Ib8ananXb AGLqpuF8OFgKiubhBHdYtD7P6kaWl1x54QhlbjlgLQadXi6PfWn7Ig2ri11EU99fM6NG 7sxnBhpaFi4Hni3YRdePvC2y6PfEDArFYmeKSq7+w2Ny9Q8h+JmCEjMrsuATTxnN+pAg 5bFdfzg/WyGMeywOFo5TSJagmS6lyceTTf4lJVuAv8TEU5xI32eeHRKZC1h+WNpv9a4W UNAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version; bh=yNoj8sFnP+T9rfI4lpr6ByBV5VpzwQ6kpAie26j8FFg=; b=dXoGxk4ZoOPrFoaH8BAPJ+ConKV/0Dd/57TSwH+7xpIeNCX0LxuIURi9hDtk/kAfCh J64UrDddRepjl+KuD84YCrQJQUR/oHzC7Py3YVHhKmpl8o4d0goKLbGhSsSj90NMVcqC 8y5ANbKDu8TcxHm+S4IdV0eTzClqn91GDj+JLoIqYcR+IiJuz/s5y48Qi4BUF3Chrdus qmC+pAHxsh6g/qNXVOQ6Anw+qQFWglp0cZTrJuEmHqkEsiu2RYvRtN1IYyrOVOKIh8i6 Y0ajfFPoJBrHPdrXCF0f3yoIEu2LG6CjVALNwTMybQK1xN4cN6l7p8/ksBeSha3Jmryh 9A3w== X-Gm-Message-State: AOAM532yHTcWVkHY5q97j3GyqB8eZ8dI1Y2+HMHzPfLV7qNgsvYXOvOY V1sx3PXexNYn0KgxAF/RglQ= X-Google-Smtp-Source: ABdhPJxVazEUqvVkLzp/0RsfP9TkDyNAkusxuxPGnh7pITMrKCPAcstP0X6taFEJhn293cTSz49MTQ== X-Received: by 2002:a17:902:f54e:b0:14a:59b4:8849 with SMTP id h14-20020a170902f54e00b0014a59b48849mr32395907plf.90.1642593271483; Wed, 19 Jan 2022 03:54:31 -0800 (PST) Received: from localhost ([103.125.234.62]) by smtp.gmail.com with ESMTPSA id e7sm19581070pfc.106.2022.01.19.03.54.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Jan 2022 03:54:30 -0800 (PST) From: Ihor Radchenko To: Tom Gillespie Subject: Re: Org Syntax Specification In-Reply-To: References: <871r1g936z.fsf@gmail.com> <87r195nt2g.fsf@localhost> Date: Wed, 19 Jan 2022 19:58:59 +0800 Message-ID: <87bl08kkcc.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::632 (failed) Received-SPF: pass client-ip=2607:f8b0:4864:20::632; envelope-from=yantar92@gmail.com; helo=mail-pl1-x632.google.com X-Spam_score_int: -10 X-Spam_score: -1.1 X-Spam_bar: - X-Spam_report: (-1.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, PDS_HP_HELO_NORDNS=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: org-mode-email , Nicolas Goaziou , Timothy Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Migadu-Flow: FLOW_IN X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1642594207; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=yNoj8sFnP+T9rfI4lpr6ByBV5VpzwQ6kpAie26j8FFg=; b=FnQl1ZF7PArVivdeXqDz9WRE6hrgnu803LW0kJroIITh7kL2M6oKcv+8ugtk03LI2EDeNK SZtfXJeNguNOLh8zE+UUtZjGZmHa+spRSHMxs22HniUez1gEiFxORRkT+5gQdhwcoGyEwv ajzxb6zOEwZ/Xu9qItUQHwudpsjL/KA3Vw+kVQCSSTrAcO7NQ1a+FcAeaLNI/ip0cm8Vzv Yi8eTap7bwPGeBedlJpTHfWrV1zFxGE2cD/ySdT8/WYIylbXTmxdCIvzQsavQSkd72vzBN fQf/wMiGROgdz5gV7FIm5uvWivREzMTNPUF2ZMKlJ/ipRVjl7e5O/uQkGnkszQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1642594207; a=rsa-sha256; cv=none; b=RXIaQxOCp/FbRtdftdZEHZ+udM7ZxdA8LCR02IekyIp/mRAN03aGOeRHC567N/yFYJa3Bw NSyT78YMQB+AZfBVasgJHK+vAah71ooi2N1Twdxz8Gz2IqmZIV43WwTX22WBLYm6Nhz7Ro f7O6VkoPblTENUVKgUQc83T7DpIU29336hogt3PZ7niYtOlgqGv580QmAaHIjq6wGOdFc2 +tE3Uo0zOODeMiE0ACDPAO6erDaKpVdZ151bBZHL4abrcnzTOlgdKFzDpNHGInQ6alYWQi lUuppsRHON3hMxvOnvD5Wg0OCkg7rp/BL3my59qF0lVAt4x749BC7zOS4PQumQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20210112 header.b=IUYgENNS; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: -1.52 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20210112 header.b=IUYgENNS; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: EB0A435714 X-Spam-Score: -1.52 X-Migadu-Scanner: scn1.migadu.com X-TUID: qOQUjSxKqKoS Tom Gillespie writes: > 3. When I say grammar in this context I mean specifically an eBNF that > generates a LALR(1) or LR(1) parser. This is narrower than the > definition used in the document, which includes things that have to > be implemented in the tokenizer, or in a pass after the grammar has > been applied, or are related to some other aspect beyond the pure > surface syntax. I feel that we should not be trying to fit into LR at the expense of complicating the document. When looking at earlier versions of the grammar, I mostly had GLR in mind. > In my thinking I separate the context sensitive nature of parsing from > the nesting structure of the resulting sexpressions, org elements, > etc.The most obvious example of this is that the sexpression > representation for headings nests based on the level of the heading, > but heading level cannot be determined by the grammar so it must be > reconstructed from a flat sequence of headings that have varying level. 1. I think that results sexpression is important to describe. We eventually plan to provide a reference test set to verify external parsers against org-element.el [1]. It is important to describe the nesting with this consideration. 2. You actually can determine the end of heading if you are allowed to do lookaheads (which is anyway necessary to parse #+begin_blah..#+end_blah). The end of current heading is "eof|^\*{,N-current-heading} " [2] https://list.orgmode.org/spmq6a$2s5$1@ciao.gmane.io/T/#t > ... I think the > other issue I was having here is that the spec for tables is spread > allover the place, and it would be much easier to understand and > implement ifit were all in one place. That sounds fine for me. Though your next suggestion appears to be exactly opposite: > I think your version is quite a bit more readable. Can we list the > set of all the elements that can be ended by a new lineas well as > those that cannot (iirc they are elements such as footnotes that can > only be ended by a double blank line or a heading)? The intention behind listing the exceptions for table cells was exactly as you thinking about open-ended elements. >> I am not sure here. Inline tasks are special because a one-line inline >> task must not contain any text below, cannot have planning or >> properties. > > Then they are no longer inline tasks, but instead parse as headings, correct? They are still inline tasks. Consider the below example: * Normal heading Paragraph ************************************************** Inline task SCHEDULED: <2022-01-19> <- this is an ordinary paragraph, not a part of inline task Continuing "SCHEDULED" paragraph, not a part of inline task * Next heading The parsed sexp will be (heading (paragraph) (inlinetask) (paragraph)) (heading) >> If we mention this, we also need to elaborate kind of element is >> #+todo:, where it can be located, and how to parse multiple instances of >> #+todo in the document. > > Yes. What I have written for laundry is that only #+todo: declarations > that appear in the zeroth section will be applied (this is true for > all document level configuration keywords). There is also a > possibility that we might be able to support including #+todo: > keywords (and #+link: definitions or similar) in further sections, but > that they would only apply to headings that occur after that line in > the file. Such behavior is likely to be confusing to users so probably > best to only guarantee correct behavior if they are put in the zeroth > section. > > The reason it is confusing/problematic is that there could be > a #+todo: buried half way down a file, the buffer configuration is > updated, and then a user can use keywords up the file in the elisp > implementation. Another implementation that parses a file > incrementally would not encounter the buried #+todo: keyword until > after they have already emitted a heading,changing how a heading is > parsed. There is a similar issue with the #+link: keyword. That's why it was initially not included into the syntax document. If we fall into this rabbit hole, we also need to describe things like CATEGORY, PROPERTY, OPTIONS, PRIORITIES, PROPERTY, SEQ_TODO, STARTUP, TYP_TODO, etc. >> > +All content following a heading that appears before the next heading >> > +(regardless of the level of that next heading) is a section. >> >> Note that it is not true for one-line inline tasks. > > I'm not quite sure which part you are referring to here. I only left the relevant part this time. Also, see the example above. Inline task only consists of a single line. Nothing below is a part of it. > Let's look into how much work it will be and how disruptive it might > be? We are already changing to heading in the elisp so maybe now > would be a good time to also change from section to segment? > Alternatively we could start by updating the documentation and include > a note that segments are currently called sections by org element? Let's continue this in the new thread dedicated to renaming section->segment. > I've since come around on this. I think that we can make it consistent > by thinking of the zeroth section as an invisible heading with zero > asterisks at the start of a file. This is extremely useful for making > org-transclusion work transparently with whole files. The only > modification that I might suggest in the context of org-transclusion > would be to disallow empty lines before the property drawer. This > allows files to represent single sections (segments) which might be > very useful for implementations that want to store sections in a > database or something like that. Again, lets move this to separate thread. >> I generally support this idea. Handling keywords in org-element is not >> pretty. Having them in the parse tree would make things easier. However, >> we again need to consider back-compatibility. I can imagine third-party >> ox-* packages breaking if we make this change - we should double check >> if we decide to change this. > > I'm happy to put in the time to submit code fixes for consumers of the > API so we can make this change. I have usually limited my thinking > about compatibility concerts to the document syntax and semantics but > this made me realize that in terms of actual labor the API consumers > are likely to be affected as well. This is not as easy as just submitting patches... Anyway, lets move this to separate thread. >> Yes, it is saner. However, our syntax document is supposed to be >> human-readable description of what org-element does. We cannot introduce >> differences between grammar document and de-facto parser implementation. >> This will defeat the purpose to providing reference syntax - we will get >> inconsistency between Emacs Org mode and external parsers. > > To achieve this can we have an implementation note for org element > specifically? There shouldn't be any divergence between > implementations if we get the abstract variant of this specified > correctly, where correctly means "exactly matches org-element > behavior." If you refer to restructuring the syntax document without introducing divergence with org-element, I am fine with such improvements. We already tried something somewhat similar by referring to Elisp variables in some cases. > Another note that I think this difference is arising because I'm using > a narrower definition for what counts as syntax while still wanting to > specify that the resulting transformed ast should be the same. > > I think it could make the document more useful if we have examples of > how to get to the same endpoint with slightly different decisions > about surface syntax. Sounds reasonable. The only thing I fear about is making the document too long. Of course, we can always put things in appendices if necessary. > One final note here is that part of my objective in this was to > simplify the org-element implementation while opening the possibility > for user defined keyword behavior. I am not sure what you refer to. >> Both :END: and :end: are supported by Org parser. What do you mean by >> legacy? > > I seem to recall a statement that things like #+BEGIN_SRC and friends > being retained for legacy support. This is also related to a > standardization conversation which we aren't quite ready to have, > which is that for things like :end: and :END: the lowercase version is > the "canonical" representation when normalizing a document (related to > being able to specify levels of conformance for an org parser, namely > that there is a level that would only accept fully normalized > documents that i.e. use :end: and not :END:). The elisp implementation > of course supports :END:, but I don't recall whether it falls into the > same category as #+BEGIN_SRC being on legacy support and #+begin_src > being the preferred version. AFAIK, org-element is case-insensitive by default. Majority of discussions related to this topic are revolving around case of auto-inserted Org elements. >> I disagree. inilinetasks are a part of syntax de facto and they can be >> encountered in Org documents in the wild. If you treat inlinetasks as >> ordinary headings, things may be broken unpredictably during parsing. > > This comment in particular was about whether we talk about things > beyond the surface syntax in this document and/or whether we move them > to a section on semantics and transformations that are deeper than the > surface syntax. I'm fine to keep this section in the document, but we > should make it clear that it is not part of the surface syntax (this > is also related to my question about property drawers and planning > following an inline task being parsed as a heading above). I afraid that I cannot understand clearly what you refer to when saying surface syntax vs. semantics. However, inlinetasks are different from headlines, despite being sufficiently similar to create confusion. Probably Org is too good in supporting inlinetasks and headings as if they are the same. > I'm using the term syntax very narrowly here to refer specifically to > the pure surface syntax. Inline tasks don't introduce any novel > restrictions on syntax so they don't have to be implemented as part of > the surface syntax, they are a reinterpretation of a headings and > otherwise follow all the usual rules such as not allowing new headings > inside them etc. As I mentioned earlier, inlinetasks do not always include everything until next heading/inlinetask as their section. > The reason I bring this up is because when implementing an org parser > we would like to communicate to developers which parts of this > document should be implemented directly in the parser and which ones > should be deferred to a later step. Inlinetasks are a good example of > this because they are entirely consistent with regular old org syntax > for headings, and can be implemented as a transformation on the ast > for headings that have a level that is deeper than the inlinetask min > level. I am not sure what later step you are referring to. > Said another way, we want to communicate that trying to introduce a > node in an eBNF grammar for inline tasks is not a good idea because it > makes org syntax extremely non-regular and breaks countless use cases > that need nesting of headings beyond the inlinetask min level. Do you mean that you imagine the first parsing step to be eBNF grammar? Why so? >> Could you elaborate why grammars cannot track the indentation level? >> AFAIU, If it were the case, python would not be parseable. > > Python maintains a separate stack for handling leading whitespace. > https://docs.python.org/3/reference/lexical_analysis.html#indentation > Thus it is effectively tracked as part of the tokenizer which goes on > to emit the indent and dedent tokens. However Org cannot take this > approach because it allows much more permissive use of leading > whitespace and in plain lists deals with a minimum deindent relative > to the bullet which may itself be arbitrarily indented. I think I > might be able to implement a stack that could track deinents like that > in the tokenizer but I'm not 100% sure. > > Regardless, my (perhaps overly technical point) is that it is not > something that can be done in the grammar, it must be done in the > tokenizer, and the tokenizer would have to emit a control token that > maps to the space between two characters in order for the deinent to > be usable by the grammar. AFAIK, tokenizer is just a part of the parser. It may or may not be separate from the grammar. AFAIU, lookahead grammars can be imagined as using tokenizer under the hood. >> Yet, it is exactly what happens in Org. malformed property drawers will >> become ordinary drawers. > > Yes, but ideally a property drawer would only be defined by its > location in a document and the use of :properties: to start the drawer > rather than also be defined by the well-formedness of its > contents. This would mean that we would have regular drawers, property > drawers, and malformed property drawers that were recognizable by the > parser. I have a sense that org-lint may already be doing this? Org syntax is permissive. It can always be parsed without errors. org-lint is merely catching common unwanted mistakes. I view org-lint as an addition to grammar. Making linter a part of grammar will complicate things even more than what we have now. >> How would you define entities object then? First/second pass is an >> implementation detail. Our current description follows how org-element >> handles entities. > > At the level of the syntax there is no pure entity object. At the > level of semantics (deeper pass) there is. My objective here is to > create a syntax that is invariant to a long and changeable list of > entities. Imagine that a user wants to add a new custom entity, they > need to be able to do that without changing org syntax and in the > laundry case having to recompile the whole parser. > > One way that I think about the distinction is that the syntax is the > subset of things that you cannot change at runtime. Of course in emacs > you can change almost everything at runtime so by convention we have > to pick which things we declare to be part of an immutable concrete > syntax. > > With that context, the way I would define entities is as > entity-fragment objects where the name is contained in the entities > list. Note that this could lead to a slight change of interpretation > for something like \alpha[] which needs to be explored. I did some > experiments with it but don't remember the results. AFAIK, the current version of the syntax document is trying hard to restrict itself to fixed grammar that does not change at runtime. That's why we provide default values of runtime-customizeable variables. Generalisation entities syntax will require change in org-element parser and should better be discussed in separate thread. >> I am not sure if it is needed. We can already to \vert > > This should be a side thread, likely started by a working > implementation.Some immediate thoughts are recorded here. > > \vert breaks cases where you want the table to also be data, for > example I wanted to create a table that had various syntactic elements > such as =|= in cells and rows and I wanted to be able to ctrl-f for > =|= in the table. \vert breaks this case and it is quite confusing if > you need the exact character for clarity in developer > documentation. Here is an example of the table and me trying with > macros to work around the issue > https://github.com/tgbugs/sxpyr/blob/master/docs/sexp.org#reading-behavior > > There is an additional point here which is that the restriction on =|= > has nothing to do with surface syntax at all in the elisp > implementation due to the order in which macros are resolved relative > to table elements. Clarifying how macros interact (or hopefully do not > interact) with other parts of syntax should probably be included at > some point. Sounds reasonable and it is also not covered by our escaping mechanisms in Org. So, lets discuss it in a separate thread. >> That's not accurate. you cannot nest, say, bold inside bold. You cannot >> put code inside any other markup freely: consider *bold =asd*asd= not bold* > > I think it is accurate. I've tested this fairly extensively for my > laundry implementation to match the org export behavior. Arbitrary > nesting of those 4 is supported and the other 2 can be at the bottom > of any level. > > I see *bold =asd*asd= bold* for ox-html/ox-latex and for font locking. Sorry, my example was wrong. I was referring to *bold =asd* asd= bold* > You can also have ******bold****** and it renders the same as *bold*. Yes. They key word is "renders". The actual bold object has all the inner * chars. > Consider these monstrosities as well: > *b /i _u +s =v /*_+lol+_*/= ~c /*_+lol+_*/~ s+ u_ i/ b* > */_+bius+_ _+bius+_ bi/* To clarify, Org does support emphasis nesting as long as that emphasis does not intersect and as long as the same type of emphasis is not nested inside. Best, Ihor