emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: Tom Gillespie <tgbugs@gmail.com>
Cc: Max Nikulin <manikulin@gmail.com>,
	emacs-orgmode@gnu.org, Timothy <orgmode@tec.tecosaur.net>,
	Bastien <bzg@gnu.org>
Subject: Re: Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements)
Date: Tue, 18 Jul 2023 05:07:19 +0000	[thread overview]
Message-ID: <87ttu13j08.fsf@localhost> (raw)
In-Reply-To: <CA+G3_POeBL-nUUy2iZf6_+MJ7fcKAZZMBqM_nQFvrupr5mXD7g@mail.gmail.com>

Tom Gillespie <tgbugs@gmail.com> writes:

> The way I have implemented this is by maintaining an explicit list of
> characters that are safe for pre markup and another for post markup.
>
> It is not possible to use unicode punctuation for this because there
> are a variety of punctuation marks that cannot appear in that position
> and be considered markup, those include @, #, % to name just a few.

Not that bad.
Unicode standard defines the following categories (I listed those that
might be of use):

Pc = Punctuation, connector
Pd = Punctuation, dash
Ps = Punctuation, open
Pe = Punctuation, close
Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage)
Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage)
Po = Punctuation, other
Zs = Separator, space
Zl = Separator, line
Zp = Separator, paragraph

We currently use the following:
PRE  = <bol> <space> - ( ' " {
POST = <space> - . ; : ! ? ' " ) } \ [

At least, ({ have
(get-char-code-property ?{ 'general-category) ;=> Ps (punctuation, open)

We might probably generalize to
PRE  = Zs Zl Pc Pd Ps Pi ' "
POST = Zs Zl Pc Pd Pe Pf . ; : ! ? ' " \ [

Though we need to take care excluding zero-width spaces.

I can find https://www.unicode.org/review/pr-23.html that defines
punctuation terminals like .;:!?
It looks like it is adopted, via special properties:
https://www.unicode.org/reports/tr44/#STerm and
https://www.unicode.org/reports/tr44/#Terminal_Punctuation

Emacs does not support them though (yet?).

> Therefore, if we want to do this we commit to extending and then
> maintaining the lists of valid pre and post markup delimiters as
> special cases.

We certainly do not want to do this. It is out of scope of Org, when
Unicode can be of use.

> Note also this could produce changes from current behavior because
> things that previously tokenized as a series of words connected by
> e.g. underscores could become markup.

Indeed. And we should study the feedback.
However, most scenarios that will change will involve non-standard
Unicode markup characters. The odds are low that users will use such
Unicode at markup boundary and _also expect markup to be ignored_. At
the end, it is the current ASCII limitation plus partially arbitrary
choice of boundaries that keep some users confused (we are getting bug
reports about confusing markup from time to time).

Of course, we can, as usual, provide a linter to catch such scenarios
and warn in the ORG_NEWS.

I do believe that better Unicode support will benefit many Org users
that use non-Latin scripts. 

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


  reply	other threads:[~2023-07-18  5:13 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-15  0:53 c47b535bb origin/main org-element: Remove dependency on ‘org-emphasis-regexp-components’ Ihor Radchenko
2021-11-15  9:56 ` Nicolas Goaziou
2021-11-15 15:20   ` Ihor Radchenko
2021-11-15 16:25     ` Max Nikulin
2021-11-16  7:43       ` Ihor Radchenko
2021-11-16 21:56         ` Samuel Wales
2021-11-16 22:16           ` Samuel Wales
2021-11-17 16:44         ` Max Nikulin
2021-11-17 22:44           ` Samuel Wales
2021-11-18 12:25           ` Ihor Radchenko
2021-11-18 12:35             ` Nicolas Goaziou
2021-11-18 12:55               ` Ihor Radchenko
2021-11-19  8:18                 ` Nicolas Goaziou
2021-11-19 11:38                   ` [PATCH] " Ihor Radchenko
2021-11-19 12:37                     ` Nicolas Goaziou
2021-11-19 13:53                       ` Ihor Radchenko
2021-11-20 18:25                         ` Nicolas Goaziou
2021-11-21  9:28                           ` Ihor Radchenko
2021-11-22 18:44                             ` Nicolas Goaziou
2021-11-23 14:28                               ` Ihor Radchenko
2021-11-27 12:16                             ` org parser and priorities of inline elements Max Nikulin
2021-11-27 19:02                               ` Nicolas Goaziou
2023-07-17 11:51                               ` Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements) Ihor Radchenko
2023-07-18  0:03                                 ` Tom Gillespie
2023-07-18  5:07                                   ` Ihor Radchenko [this message]
2023-07-18  5:40                                     ` Tom Gillespie
2023-07-18  9:45                                       ` Ihor Radchenko
2021-11-19 16:34             ` c47b535bb origin/main org-element: Remove dependency on ‘org-emphasis-regexp-components’ Max Nikulin
2021-11-20 12:02         ` Max Nikulin
2021-11-21 10:01           ` Ihor Radchenko
2021-11-21 16:36             ` Max Nikulin
2021-11-23 17:05             ` [PATCH] org.el: Warning for unsupported markers in `org-set-emphasis-alist' Max Nikulin
2022-11-04  6:53               ` Ihor Radchenko
2022-11-04 12:31                 ` Max Nikulin
2022-11-05  8:21                   ` Ihor Radchenko
2023-02-02 10:53                     ` [PATCH v5] " Ihor Radchenko
2023-02-06 15:11                       ` Max Nikulin
2023-02-06 16:49                       ` Max Nikulin
2023-02-07 10:47                         ` Should we obsolete org-emphasis-alist? (was: [PATCH v5] org.el: Warning for unsupported markers in `org-set-emphasis-alist') Ihor Radchenko
2023-02-07 12:22                           ` Timothy
2023-02-09 12:11                           ` Max Nikulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ttu13j08.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=bzg@gnu.org \
    --cc=emacs-orgmode@gnu.org \
    --cc=manikulin@gmail.com \
    --cc=orgmode@tec.tecosaur.net \
    --cc=tgbugs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).