emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Inconsistent text markup handling when double-nesting markers
@ 2023-10-09 23:02 Tom Alexander
  2023-10-10 12:07 ` Ihor Radchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Alexander @ 2023-10-09 23:02 UTC (permalink / raw)
  To: emacs-orgmode

I used the following test document:
```
__foo__

**foo**
```

I'd expect the two to behave the same but the first one parses as:
```
(paragraph
  "_"
  (subscript "foo")
  "__"
  )
```

Whereas the second parses as:
```
(paragraph
  (bold
    (bold
      "foo"
      )
    )
  )
```

This pattern happens in worg at [2]

Looking at the description for text markup in the syntax document[1], I don't see any reason the first wouldn't be parsed as an underline:

1. PRE: valid because it is the beginning of a line
2. MARKER: valid underscore
3. CONTENTS: valid. Series of objects from standard set includes both subscript and text markup, so regardless of how we parse the interior, its valid. Also cannot begin or end with whitespace but there is no whitespace in the CONTENTS.
4. MARKER: valid underscore
5. POST: Only valid if we extend the underline to the 2nd underscore so it ends at the end of the line. But the 2nd line shows us that having copies of the marker inside the CONTENTS is fine so I see two possible expected parses of the CONTENTS:
    4a. (underline "foo")
    4b. ((subscript "foo") (plain-text "_"))

I also ran the following test document to further prove that having copies of the marker inside the CONTENTS is fine:
```
*foo*bar*
```
which parses as (bold "foo*bar")

So the only way the top line would fail to parse as an underline is if it matched the first closing underscore as closing the underline, but that would be invalid because underscore is not a valid POST character and invalid copies of the closing marker are ignored as proven by both "**foo**" and "*foo*bar*".


[1] https://orgmode.org/worg/org-syntax.html#Emphasis_Markers
[2] https://git.sr.ht/~bzg/worg/tree/ba6cda890f200d428a5d68e819eef15b5306055f/org-contrib/babel/intro.org#L117

--
Tom Alexander
pgp: https://fizz.buzz/pgp.asc


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-09 23:02 Inconsistent text markup handling when double-nesting markers Tom Alexander
@ 2023-10-10 12:07 ` Ihor Radchenko
  2023-10-11  2:23   ` Max Nikulin
  0 siblings, 1 reply; 9+ messages in thread
From: Ihor Radchenko @ 2023-10-10 12:07 UTC (permalink / raw)
  To: Tom Alexander; +Cc: emacs-orgmode

"Tom Alexander" <tom@fizz.buzz> writes:

> I used the following test document:
> ```
> __foo__
>
> **foo**
> ```
>
> I'd expect the two to behave the same but the first one parses as:
> ```
> (paragraph
>   "_"
>   (subscript "foo")
>   "__"
>   )
> ```

Fixed, on main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=fe23bec60

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-10 12:07 ` Ihor Radchenko
@ 2023-10-11  2:23   ` Max Nikulin
  2023-10-11  9:15     ` Ihor Radchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Max Nikulin @ 2023-10-11  2:23 UTC (permalink / raw)
  To: Tom Alexander; +Cc: emacs-orgmode

On 10/10/2023 19:07, Ihor Radchenko wrote:
> "Tom Alexander" writes:
> 
>> I used the following test document:
>> ```
>> __foo__
>>
>> **foo**
>> ```
> 
> Fixed, on main.
> https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=fe23bec60

Isn't nested bold for "**bold**" a bug? Generally it is not allowed and

      *b1 *b2* b3*

is parsed as bold only for "b1 *b2".




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-11  2:23   ` Max Nikulin
@ 2023-10-11  9:15     ` Ihor Radchenko
  2023-10-11 12:16       ` Max Nikulin
  0 siblings, 1 reply; 9+ messages in thread
From: Ihor Radchenko @ 2023-10-11  9:15 UTC (permalink / raw)
  To: Max Nikulin; +Cc: Tom Alexander, emacs-orgmode

Max Nikulin <manikulin@gmail.com> writes:

> Isn't nested bold for "**bold**" a bug? Generally it is not allowed and
>
>       *b1 *b2* b3*
>
> is parsed as bold only for "b1 *b2".

No, **bold** it is not a bug. The parser is recursive with inner markup
not "seeing" its parent. So, we first parse the outer bold and then
continue parsing the contents separately, as *bold*.

Be it another way, /*bold italic*/ would also not be allowed as
we demand bol, whitespace, -, (, {, ', or " before the markup:
https://orgmode.org/worg/org-syntax.html#Emphasis_Markers

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-11  9:15     ` Ihor Radchenko
@ 2023-10-11 12:16       ` Max Nikulin
  2023-10-11 12:26         ` Ihor Radchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Max Nikulin @ 2023-10-11 12:16 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Tom Alexander, emacs-orgmode

On 11/10/2023 16:15, Ihor Radchenko wrote:
> Max Nikulin <manikulin@gmail.com> writes:
> 
>> Isn't nested bold for "**bold**" a bug? Generally it is not allowed and
>>
>>        *b1 *b2* b3*
>>
>> is parsed as bold only for "b1 *b2".
> 
> No, **bold** it is not a bug. The parser is recursive with inner markup
> not "seeing" its parent. So, we first parse the outer bold and then
> continue parsing the contents separately, as *bold*.

I just find the following rather confusing:

(org-export-string-as "**bold**" 'html t)
"<p>\n<b><b>bold</b></b></p>\n"
(org-export-string-as "**inner* outer*" 'html t)
"<p>\n<b>*inner</b> outer*</p>\n"
(org-export-string-as "*outer *inner**" 'html t)
"<p>\n<b>outer <b>inner</b></b></p>\n"
(org-export-string-as "*begin *inner* end*" 'html t)
"<p>\n<b>begin *inner</b> end*</p>\n"

> Be it another way, /*bold italic*/ would also not be allowed as
> we demand bol, whitespace, -, (, {, ', or " before the markup:
> https://orgmode.org/worg/org-syntax.html#Emphasis_Markers

Certainly /*b*/ should work, but nested bold was a surprise for me. I 
believed that nesting is strictly prohibited. The case of underscores is 
even more tricky due to ambiguity of underline and subscript.

P.S. Juan Manuel at certain moment discovered that pandoc allows nesting 
for *b1 *b2* b3*.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-11 12:16       ` Max Nikulin
@ 2023-10-11 12:26         ` Ihor Radchenko
  2023-10-11 14:40           ` Tom Alexander
  2023-10-12 10:23           ` Max Nikulin
  0 siblings, 2 replies; 9+ messages in thread
From: Ihor Radchenko @ 2023-10-11 12:26 UTC (permalink / raw)
  To: Max Nikulin; +Cc: Tom Alexander, emacs-orgmode

Max Nikulin <manikulin@gmail.com> writes:

>> No, **bold** it is not a bug. The parser is recursive with inner markup
>> not "seeing" its parent. So, we first parse the outer bold and then
>> continue parsing the contents separately, as *bold*.
>
> I just find the following rather confusing:
>
> (org-export-string-as "**bold**" 'html t)
> "<p>\n<b><b>bold</b></b></p>\n"
> (org-export-string-as "**inner* outer*" 'html t)
> "<p>\n<b>*inner</b> outer*</p>\n"
> (org-export-string-as "*outer *inner**" 'html t)
> "<p>\n<b>outer <b>inner</b></b></p>\n"
> (org-export-string-as "*begin *inner* end*" 'html t)
> "<p>\n<b>begin *inner</b> end*</p>\n"

Maybe. It is indeed one of the edge cases. But it is following the
parser logic, which is (1) first matching markup is parser; (2) parsing
recursive contents is isolated.

>> Be it another way, /*bold italic*/ would also not be allowed as
>> we demand bol, whitespace, -, (, {, ', or " before the markup:
>> https://orgmode.org/worg/org-syntax.html#Emphasis_Markers
>
> Certainly /*b*/ should work, but nested bold was a surprise for me. I 
> believed that nesting is strictly prohibited. The case of underscores is 
> even more tricky due to ambiguity of underline and subscript.

It is not strictly prohibited on purpose. It is just a consequence of
how the parser works that nesting <end> constructs is almost impossible,
except certain edge cases like **b**.

> P.S. Juan Manuel at certain moment discovered that pandoc allows nesting 
> for *b1 *b2* b3*.

Which is a bug in pandoc.

I think we discussed this topic a number of times in the past - our
markup is a compromise between simplicity for users and simplicity of
the parser. This works in many simple cases, but edge cases become
problematic.

Workarounds have been discussed as well. For example, creole markup and
generic inline markup constructs (your idea with direct AST and the idea
with inline special blocks).

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-11 12:26         ` Ihor Radchenko
@ 2023-10-11 14:40           ` Tom Alexander
  2023-10-12 10:23           ` Max Nikulin
  1 sibling, 0 replies; 9+ messages in thread
From: Tom Alexander @ 2023-10-11 14:40 UTC (permalink / raw)
  To: Ihor Radchenko, Max Nikulin; +Cc: emacs-orgmode

> Fixed, on main.

Thanks!

--
Tom Alexander
pgp: https://fizz.buzz/pgp.asc


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-11 12:26         ` Ihor Radchenko
  2023-10-11 14:40           ` Tom Alexander
@ 2023-10-12 10:23           ` Max Nikulin
  2023-10-12 12:04             ` Ihor Radchenko
  1 sibling, 1 reply; 9+ messages in thread
From: Max Nikulin @ 2023-10-12 10:23 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Tom Alexander, emacs-orgmode

On 11/10/2023 19:26, Ihor Radchenko wrote:
> Max Nikulin writes:
> 
>> P.S. Juan Manuel at certain moment discovered that pandoc allows nesting
>> for *b1 *b2* b3*.
> 
> Which is a bug in pandoc.
> 
> I think we discussed this topic a number of times in the past - our
> markup is a compromise between simplicity for users and simplicity of
> the parser. This works in many simple cases, but edge cases become
> problematic.

I have no intention to raise discussions of changing patterns to 
recognize beginning and end of objects or extending of syntax.

My guess is that pandoc may use bottom-up, not top-down approach. I 
admit, my opinion may be biased by reading complains concerning 
unexpected behavior of current implementation. Perhaps besides 
advantages pandoc parser has downsides. I would not be surprised if 
bottom up parser is unbearable without some tool that generates code for 
provided rules.

By the way, is it explicitly specified that within an element namely 
top-down strategy must be used to recognize objects?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Inconsistent text markup handling when double-nesting markers
  2023-10-12 10:23           ` Max Nikulin
@ 2023-10-12 12:04             ` Ihor Radchenko
  0 siblings, 0 replies; 9+ messages in thread
From: Ihor Radchenko @ 2023-10-12 12:04 UTC (permalink / raw)
  To: Max Nikulin; +Cc: Tom Alexander, emacs-orgmode

Max Nikulin <manikulin@gmail.com> writes:

> By the way, is it explicitly specified that within an element namely 
> top-down strategy must be used to recognize objects?

https://orgmode.org/worg/org-syntax.html has it, I think.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-10-12 12:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-09 23:02 Inconsistent text markup handling when double-nesting markers Tom Alexander
2023-10-10 12:07 ` Ihor Radchenko
2023-10-11  2:23   ` Max Nikulin
2023-10-11  9:15     ` Ihor Radchenko
2023-10-11 12:16       ` Max Nikulin
2023-10-11 12:26         ` Ihor Radchenko
2023-10-11 14:40           ` Tom Alexander
2023-10-12 10:23           ` Max Nikulin
2023-10-12 12:04             ` Ihor Radchenko

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).