emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* [BUG] Mark-up handling chokes on unicode whitespace
@ 2014-09-23 12:44 Tobias Getzner
  2014-09-23 17:03 ` Aaron Ecay
  0 siblings, 1 reply; 5+ messages in thread
From: Tobias Getzner @ 2014-09-23 12:44 UTC (permalink / raw)
  To: emacs-orgmode

When mark-up such as =monospace=, /italic/, etc. is preceded by a 
non-8bit whitespace, e. g., «narrow no-break space» (U+202F) or «no-break 
space» (U+00A0), org-mode will not recognize the mark-up content 
correctly; i. e., this content will fail to be syntax-highlighted, and 
the mark-up syntax will be exported in verbatim by the exporter.

Best regards,
Tobias

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] Mark-up handling chokes on unicode whitespace
  2014-09-23 12:44 [BUG] Mark-up handling chokes on unicode whitespace Tobias Getzner
@ 2014-09-23 17:03 ` Aaron Ecay
  2014-09-23 17:44   ` Tobias Getzner
  0 siblings, 1 reply; 5+ messages in thread
From: Aaron Ecay @ 2014-09-23 17:03 UTC (permalink / raw)
  To: Tobias Getzner, emacs-orgmode

Hi Tobias,

2014ko irailak 23an, Tobias Getzner-ek idatzi zuen:
> 
> When mark-up such as =monospace=, /italic/, etc. is preceded by a 
> non-8bit whitespace, e. g., «narrow no-break space» (U+202F) or «no-break 
> space» (U+00A0), org-mode will not recognize the mark-up content 
> correctly; i. e., this content will fail to be syntax-highlighted, and 
> the mark-up syntax will be exported in verbatim by the exporter.

You will need to change the variable org-emphasis-regexp-components; see
the documentation thereof.

-- 
Aaron Ecay

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] Mark-up handling chokes on unicode whitespace
  2014-09-23 17:03 ` Aaron Ecay
@ 2014-09-23 17:44   ` Tobias Getzner
  2014-09-23 18:15     ` Aaron Ecay
  0 siblings, 1 reply; 5+ messages in thread
From: Tobias Getzner @ 2014-09-23 17:44 UTC (permalink / raw)
  To: emacs-orgmode

Hello Aaron!

On Tue, 23 Sep 2014 13:03:06 -0400, Aaron Ecay wrote:

> 2014ko irailak 23an, Tobias Getzner-ek idatzi zuen:
>> 
>> When mark-up such as =monospace=, /italic/, etc. is preceded by a
>> non-8bit whitespace, e. g., «narrow no-break space» (U+202F) or
>> «no-break space» (U+00A0), org-mode will not recognize the mark-up
>> content correctly
> 
> You will need to change the variable org-emphasis-regexp-components; see
> the documentation thereof.

Thank you very much! This seems to do it.

Might I suggest amending unicode whitespace to the default? That variable 
seems a bit opaque and I might probably never have discovered it on my 
own; it also appears as if one has to ensure that this is set before org-
mode is «required», and one cannot easily just extend the default without 
also setting the rest. For type-setting purposes, at least the class of 
non-breaking whitespace is very useful.

At first I thought it might be easy to cleanly solve such problems by 
using the whitespace character class throughout, but to my chagrin it 
seems that at least «search-forward-regexp» will only match 8-bit 
whitespace this way, so I suppose Emacs regex isn’t aware of non-ASCII 
whitespace? :'| 

Best,
Tobias

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] Mark-up handling chokes on unicode whitespace
  2014-09-23 17:44   ` Tobias Getzner
@ 2014-09-23 18:15     ` Aaron Ecay
  2014-09-24  7:34       ` [BUG] Mark-up handling chokes on Unicode white-space Tobias Getzner
  0 siblings, 1 reply; 5+ messages in thread
From: Aaron Ecay @ 2014-09-23 18:15 UTC (permalink / raw)
  To: Tobias Getzner, emacs-orgmode

Hi Tobias,

2014ko irailak 23an, Tobias Getzner-ek idatzi zuen:
> 
> Hello Aaron!
> 
> On Tue, 23 Sep 2014 13:03:06 -0400, Aaron Ecay wrote:
> 
>> You will need to change the variable org-emphasis-regexp-components; see
>> the documentation thereof.
> 
> Thank you very much! This seems to do it.
> 
> Might I suggest amending unicode whitespace to the default? That variable 
> seems a bit opaque and I might probably never have discovered it on my 
> own; it also appears as if one has to ensure that this is set before org-
> mode is «required», and one cannot easily just extend the default without 
> also setting the rest. For type-setting purposes, at least the class of 
> non-breaking whitespace is very useful.

org-emphasis-regexp-components is known to be a wart.  You can search
for posts on the mailing list.  Some people are trying to figure out how
to get rid of it.  (You can search in particular for Nicolas Goaziou’s
posts...)  Here’s one thread where you can see the lay of the land:
<http://mid.gmane.org/87zjl6ktu2.fsf@gmail.com>.

All that to say, the longer-term solution is to figure out some radically
different approach.  In the meantime though, if you can provide a list of
characters (by unicode name and/or code point) that you think should be
added to that variable, someone might be able to add them.  (I probably
would not make such a change on my own, but would wait for feedback from
Nicolas, Bastien, or one of the other maintainer-esque figures on the
list).  On the other hand, they might say “making such a change in org’s
core is just restacking the deck chairs on the Titanic,” which would
also be a reasonable position for them to take IMO.

> 
> At first I thought it might be easy to cleanly solve such problems by 
> using the whitespace character class throughout, but to my chagrin it 
> seems that at least «search-forward-regexp» will only match 8-bit 
> whitespace this way, so I suppose Emacs regex isn’t aware of non-ASCII 
> whitespace? :'|

I don’t really know anything about this...it’s unfortunate if true
though.

-- 
Aaron Ecay

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] Mark-up handling chokes on Unicode white-space
  2014-09-23 18:15     ` Aaron Ecay
@ 2014-09-24  7:34       ` Tobias Getzner
  0 siblings, 0 replies; 5+ messages in thread
From: Tobias Getzner @ 2014-09-24  7:34 UTC (permalink / raw)
  To: Aaron Ecay; +Cc: emacs-orgmode

Hi Aaron,

On Di, 2014-09-23 at 14:15 -0400, Aaron Ecay wrote:
> org-emphasis-regexp-components is known to be a wart.  You can search
> for posts on the mailing list.  Some people are trying to figure out how
> to get rid of it.  (You can search in particular for Nicolas Goaziou’s
> posts...)  Here’s one thread where you can see the lay of the land:
> <http://mid.gmane.org/87zjl6ktu2.fsf@gmail.com>.

Thank you for the background info!

> All that to say, the longer-term solution is to figure out some radically
> different approach.  In the meantime though, if you can provide a list of
> characters (by unicode name and/or code point) that you think should be
> added to that variable, someone might be able to add them. 

I guess the straightforward way of defining white-space would be just
using the set of characters with the Unicode property WSpace=Y, and
this would be what «[:space:]», «\s«, etc., should be expected to match
on Unicode-based locales. I’m supplying a list of code-points below,
for convenience.

I agree though that defining what counts as «white space» within the
confines of org-mode is putting the cart before the horse. I’ll try to
ascertain whether the Emacs implementation of «[:space:]» really only
does 8-bit spaces, and if so I’ll see whether I can poke someone on the
Emacs bug tracker about this.

Best regards,
T.


──────────────────────────────────────────────────────────────────────
List of Unicode white-space

Below is the list of characters with the property White_Space set,
taken from the Unicode 7.0.0 character database. This includes
line-breaking white-space such as «line feed». If these are not
relevant, one can use the subset of space separators (Zs; these do not
include control characters such as Tab) and control chars (Cc).

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE
──────────────────────────────────────────────────────────────────────

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-09-24  7:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-23 12:44 [BUG] Mark-up handling chokes on unicode whitespace Tobias Getzner
2014-09-23 17:03 ` Aaron Ecay
2014-09-23 17:44   ` Tobias Getzner
2014-09-23 18:15     ` Aaron Ecay
2014-09-24  7:34       ` [BUG] Mark-up handling chokes on Unicode white-space Tobias Getzner

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).