emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* [RFC] Fixing link encoding once and for all
@ 2019-02-24  1:16 Nicolas Goaziou
  2019-02-24 23:04 ` Neil Jerram
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Nicolas Goaziou @ 2019-02-24  1:16 UTC (permalink / raw)
  To: Org Mode List

Hello,

Recently[1], issues about link escaping have resurfaced. I'd like to
solve this once and for all.

As a reminder, the initial issue is that bracket links, i.e., "[[path]]"
or "[[path][description]]", cannot contain square brackets, for obvious
reasons. Therefore, they need to be escaped somehow. For some historical
reason, the "somehow" settled, for the path part[2], on URL encoding.
Therefore [ and ] in a link must appear as, respectively, "%5B" and
"%5D". Of course, the initial link could already contain any of these
strings, so percent signs also need to be escaped, as "%25". Eventually,
consecutive spaces are not very handled very gracefully by
`fill-paragraph' function, so it is also useful, but not mandatory, to
be able to escape white spaces, with "%20". It can sadly be confusing
when Org encoding is applied on top an already encoded URI.

To sum it up, `org-link-escape', by default, URL encodes only square
brackets, percent signs and white spaces. Note that, however,
`org-link-unescape' is not its reciprocal function, despite its
docstring. It URL decodes every percent encoded combination.

Anyway, square brackets in a bracket link almost looks like a solved
problem. Alas, if some links are inserted by helper functions, such as
`org-insert-link', others could have been typed right into the buffer.
Therefore, there is usually no way to know if a link is already
Org-encoded or not. Consequently, there is usually no way to know when
a link needs to be Org-decoded. This is the root of all evil, or at
least, all bugs encountered so far. Some links end up being encoded or
decoded once too many.

To solve this, we must assume that every bracket link is properly
Org-encoded in a buffer. In other words, when typing, or yanking,
a bracket link right into a buffer, users are required to use %5B, %5D,
and %25 in the path part of the link, if necessary. I understand it will
bite some users, but using `org-insert-link' would mitigate the pain. It
is also limited to square brackets, which, I assume, is not the type of
link you usually yank.

With that assumption, the parser can safely Org-decode links
appropriately, and store paths in their decoded form. Consumers, like
export back-ends, need not call `org-link-unescape' anymore. In fact,
the only situation where `org-link-unescape' is still needed is when
extracting the path part of a bracket link from the buffer, e.g.,
through regexp matching.

Of course, the manual should mention this assumption, if we agree on it.

Thoughts?

Regards,

Footnotes: 

[1] E.g., <http://lists.gnu.org/r/emacs-orgmode/2019-02/msg00265.html>
or <http://lists.gnu.org/r/emacs-orgmode/2019-02/msg00292.html>.

[2] There is no clear mechanism for the description part.
`org-insert-link' will replace square brackets with curly ones. We could
also use entities, but none of them appears as a square bracket. Anyway,
I'll ignore this issue for the time being.

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-24  1:16 [RFC] Fixing link encoding once and for all Nicolas Goaziou
@ 2019-02-24 23:04 ` Neil Jerram
  2019-02-27 10:48   ` Nicolas Goaziou
  2019-02-25  8:54 ` stardiviner
  2019-02-27  8:07 ` Jens Lechtenboerger
  2 siblings, 1 reply; 21+ messages in thread
From: Neil Jerram @ 2019-02-24 23:04 UTC (permalink / raw)
  To: Org Mode List

I'm not sure how much freedom you have here, but I think it would be
both clearer - by avoiding confusion with URL-escaping - and easier to
type, to use an entirely different form of escaping in the Org syntax;
probably just this:

\[ and \] to include a square bracket in a link
\\ to include a backslash

Regards,
    Neil

On Sun, 24 Feb 2019 at 01:18, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:
>
> Hello,
>
> Recently[1], issues about link escaping have resurfaced. I'd like to
> solve this once and for all.
>
> As a reminder, the initial issue is that bracket links, i.e., "[[path]]"
> or "[[path][description]]", cannot contain square brackets, for obvious
> reasons. Therefore, they need to be escaped somehow. For some historical
> reason, the "somehow" settled, for the path part[2], on URL encoding.
> Therefore [ and ] in a link must appear as, respectively, "%5B" and
> "%5D". Of course, the initial link could already contain any of these
> strings, so percent signs also need to be escaped, as "%25". Eventually,
> consecutive spaces are not very handled very gracefully by
> `fill-paragraph' function, so it is also useful, but not mandatory, to
> be able to escape white spaces, with "%20". It can sadly be confusing
> when Org encoding is applied on top an already encoded URI.
>
> To sum it up, `org-link-escape', by default, URL encodes only square
> brackets, percent signs and white spaces. Note that, however,
> `org-link-unescape' is not its reciprocal function, despite its
> docstring. It URL decodes every percent encoded combination.
>
> Anyway, square brackets in a bracket link almost looks like a solved
> problem. Alas, if some links are inserted by helper functions, such as
> `org-insert-link', others could have been typed right into the buffer.
> Therefore, there is usually no way to know if a link is already
> Org-encoded or not. Consequently, there is usually no way to know when
> a link needs to be Org-decoded. This is the root of all evil, or at
> least, all bugs encountered so far. Some links end up being encoded or
> decoded once too many.
>
> To solve this, we must assume that every bracket link is properly
> Org-encoded in a buffer. In other words, when typing, or yanking,
> a bracket link right into a buffer, users are required to use %5B, %5D,
> and %25 in the path part of the link, if necessary. I understand it will
> bite some users, but using `org-insert-link' would mitigate the pain. It
> is also limited to square brackets, which, I assume, is not the type of
> link you usually yank.
>
> With that assumption, the parser can safely Org-decode links
> appropriately, and store paths in their decoded form. Consumers, like
> export back-ends, need not call `org-link-unescape' anymore. In fact,
> the only situation where `org-link-unescape' is still needed is when
> extracting the path part of a bracket link from the buffer, e.g.,
> through regexp matching.
>
> Of course, the manual should mention this assumption, if we agree on it.
>
> Thoughts?
>
> Regards,
>
> Footnotes:
>
> [1] E.g., <http://lists.gnu.org/r/emacs-orgmode/2019-02/msg00265.html>
> or <http://lists.gnu.org/r/emacs-orgmode/2019-02/msg00292.html>.
>
> [2] There is no clear mechanism for the description part.
> `org-insert-link' will replace square brackets with curly ones. We could
> also use entities, but none of them appears as a square bracket. Anyway,
> I'll ignore this issue for the time being.
>
> --
> Nicolas Goaziou
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-24  1:16 [RFC] Fixing link encoding once and for all Nicolas Goaziou
  2019-02-24 23:04 ` Neil Jerram
@ 2019-02-25  8:54 ` stardiviner
  2019-02-27  8:07 ` Jens Lechtenboerger
  2 siblings, 0 replies; 21+ messages in thread
From: stardiviner @ 2019-02-25  8:54 UTC (permalink / raw)
  To: emacs-orgmode


Nicolas Goaziou <mail@nicolasgoaziou.fr> writes:

> Hello,
>
> Recently[1], issues about link escaping have resurfaced. I'd like to
> solve this once and for all.
>
> As a reminder, the initial issue is that bracket links, i.e., "[[path]]"
> or "[[path][description]]", cannot contain square brackets, for obvious
> reasons. Therefore, they need to be escaped somehow. For some historical
> reason, the "somehow" settled, for the path part[2], on URL encoding.
> Therefore [ and ] in a link must appear as, respectively, "%5B" and
> "%5D". Of course, the initial link could already contain any of these
> strings, so percent signs also need to be escaped, as "%25". Eventually,
> consecutive spaces are not very handled very gracefully by
> `fill-paragraph' function, so it is also useful, but not mandatory, to
> be able to escape white spaces, with "%20". It can sadly be confusing
> when Org encoding is applied on top an already encoded URI.
>
> To sum it up, `org-link-escape', by default, URL encodes only square
> brackets, percent signs and white spaces. Note that, however,
> `org-link-unescape' is not its reciprocal function, despite its
> docstring. It URL decodes every percent encoded combination.
>
> Anyway, square brackets in a bracket link almost looks like a solved
> problem. Alas, if some links are inserted by helper functions, such as
> `org-insert-link', others could have been typed right into the buffer.
> Therefore, there is usually no way to know if a link is already
> Org-encoded or not. Consequently, there is usually no way to know when
> a link needs to be Org-decoded. This is the root of all evil, or at
> least, all bugs encountered so far. Some links end up being encoded or
> decoded once too many.
>
> To solve this, we must assume that every bracket link is properly
> Org-encoded in a buffer. In other words, when typing, or yanking,
> a bracket link right into a buffer, users are required to use %5B, %5D,
> and %25 in the path part of the link, if necessary. I understand it will
> bite some users, but using `org-insert-link' would mitigate the pain. It
> is also limited to square brackets, which, I assume, is not the type of
> link you usually yank.
>
> With that assumption, the parser can safely Org-decode links
> appropriately, and store paths in their decoded form. Consumers, like
> export back-ends, need not call `org-link-unescape' anymore. In fact,
> the only situation where `org-link-unescape' is still needed is when
> extracting the path part of a bracket link from the buffer, e.g.,
> through regexp matching.
>
> Of course, the manual should mention this assumption, if we agree on it.
>
> Thoughts?
>
> Regards,
>

I agree and upvote on this. Use `org-insert-link' as unique entry will help
unify all behavior. The only inconvenient of inserting link literately is where
user can't access `org-insert-link'. Like on web, in other editor. But I think
whatever Org Mode is limited in Emacs already, so no matter add this on. Also,
at the end, if other clients want to support Org Mode, then can insert link with
encoded and handle this properly.

WDYT?

> Footnotes: 
>
> [1] E.g., <http://lists.gnu.org/r/emacs-orgmode/2019-02/msg00265.html>
> or <http://lists.gnu.org/r/emacs-orgmode/2019-02/msg00292.html>.
>
> [2] There is no clear mechanism for the description part.
> `org-insert-link' will replace square brackets with curly ones. We could
> also use entities, but none of them appears as a square bracket. Anyway,
> I'll ignore this issue for the time being.


-- 
[ stardiviner ]
       I try to make every word tell the meaning what I want to express.

       Blog: https://stardiviner.github.io/
       IRC(freenode): stardiviner, Matrix: stardiviner
       GPG: F09F650D7D674819892591401B5DF1C95AE89AC3
      

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-24  1:16 [RFC] Fixing link encoding once and for all Nicolas Goaziou
  2019-02-24 23:04 ` Neil Jerram
  2019-02-25  8:54 ` stardiviner
@ 2019-02-27  8:07 ` Jens Lechtenboerger
  2019-02-27 11:25   ` Nicolas Goaziou
  2 siblings, 1 reply; 21+ messages in thread
From: Jens Lechtenboerger @ 2019-02-27  8:07 UTC (permalink / raw)
  To: Org Mode List

On 2019-02-24, Nicolas Goaziou wrote:

> Recently[1], issues about link escaping have resurfaced. I'd like to
> solve this once and for all.

Good morning,

I updated to Org mode version 9.2.1 (9.2.1-33-g029cf6-elpa @
/home/user/.emacs.d/elpa/org-20190225/).

When exporting the following link to LaTeX, the decoding fails.

--8<---------------cut here---------------start------------->8---
[[https://en.wikipedia.org/wiki/Red%E2%80%93black_tree][Red-black trees]]
--8<---------------cut here---------------end--------------->8---

The output is this:
--8<---------------cut here---------------start------------->8---
\href{https://en.wikipedia.org/wiki/Red\â\€\“black\_tree}{Red-black trees}
--8<---------------cut here---------------end--------------->8---

Previously, I got:
--8<---------------cut here---------------start------------->8---
\href{https://en.wikipedia.org/wiki/Red\%E2\%80\%93black\_tree}{Red-black trees}
--8<---------------cut here---------------end--------------->8---

Best wishes
Jens

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-24 23:04 ` Neil Jerram
@ 2019-02-27 10:48   ` Nicolas Goaziou
  2019-02-28 10:24     ` Neil Jerram
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolas Goaziou @ 2019-02-27 10:48 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Hello,

Neil Jerram <neiljerram@gmail.com> writes:

> I'm not sure how much freedom you have here, but I think it would be
> both clearer - by avoiding confusion with URL-escaping - and easier to
> type, to use an entirely different form of escaping in the Org syntax;
> probably just this:
>
> \[ and \] to include a square bracket in a link
> \\ to include a backslash

Wouldn't that become problematic with file names in Windows?

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-27  8:07 ` Jens Lechtenboerger
@ 2019-02-27 11:25   ` Nicolas Goaziou
  2019-02-27 12:57     ` Jens Lechtenboerger
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolas Goaziou @ 2019-02-27 11:25 UTC (permalink / raw)
  To: Org Mode List

Hello,

Jens Lechtenboerger <lechten@wi.uni-muenster.de> writes:

> When exporting the following link to LaTeX, the decoding fails.
>
> --8<---------------cut here---------------start------------->8---
> [[https://en.wikipedia.org/wiki/Red%E2%80%93black_tree][Red-black trees]]
> --8<---------------cut here---------------end--------------->8---

According to my suggestion in this thread, this link should be written

  [[https://en.wikipedia.org/wiki/Red%25E2%2580%2593black_tree][Red-black trees]]

i.e., either you wrote it by hand, or `org-insert-link' failed.

With the \-escape solution suggested by Neil, it would be correctly
processed without additional change. Of course, that would entail other
difficulties.

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-27 11:25   ` Nicolas Goaziou
@ 2019-02-27 12:57     ` Jens Lechtenboerger
  2019-02-28 10:51       ` Nicolas Goaziou
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Lechtenboerger @ 2019-02-27 12:57 UTC (permalink / raw)
  To: Org Mode List

On 2019-02-27, Nicolas Goaziou wrote:

> Hello,
>
> Jens Lechtenboerger <lechten@wi.uni-muenster.de> writes:
>
>> When exporting the following link to LaTeX, the decoding fails.
>>
>> --8<---------------cut here---------------start------------->8---
>> [[https://en.wikipedia.org/wiki/Red%E2%80%93black_tree][Red-black trees]]
>> --8<---------------cut here---------------end--------------->8---
>
> According to my suggestion in this thread, this link should be written
>
>   [[https://en.wikipedia.org/wiki/Red%25E2%2580%2593black_tree][Red-black trees]]
>
> i.e., either you wrote it by hand, or `org-insert-link' failed.

I copied that from the address bar of my browser, probably two years
ago.  Today, I was surprised by a compilation failure.

> With the \-escape solution suggested by Neil, it would be correctly
> processed without additional change. Of course, that would entail other
> difficulties.

You mentioned Windows file names.  I’m not affected by that.  URLs
in my Org files neither contain “[” nor “\” (but lots of “%”).  So
the suggestion by Neil would be fine for me.

Best wishes
Jens

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-27 10:48   ` Nicolas Goaziou
@ 2019-02-28 10:24     ` Neil Jerram
  2019-03-01  8:14       ` Nicolas Goaziou
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Jerram @ 2019-02-28 10:24 UTC (permalink / raw)
  To: Neil Jerram, Org Mode List

On Wed, 27 Feb 2019 at 10:49, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:
>
> Hello,
>
> Neil Jerram <neiljerram@gmail.com> writes:
>
> > I'm not sure how much freedom you have here, but I think it would be
> > both clearer - by avoiding confusion with URL-escaping - and easier to
> > type, to use an entirely different form of escaping in the Org syntax;
> > probably just this:
> >
> > \[ and \] to include a square bracket in a link
> > \\ to include a backslash
>
> Wouldn't that become problematic with file names in Windows?

Do you mean Windows file names in existing Org files?  I.e. the
back-compatibility concern?

If so, yes, I confess I didn't think at all about back-compatibility,
with my suggestion above.  So perhaps that rules my idea out.

If we were starting from scratch, however,
- I believe it would technically be fine; i.e. it's a complete and
unambiguous encoding
- it might be considered awkward for Windows users to have to write
c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
know how big a concern that would be.

Best wishes,
     Neil


> Regards,
>
> --
> Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-27 12:57     ` Jens Lechtenboerger
@ 2019-02-28 10:51       ` Nicolas Goaziou
  0 siblings, 0 replies; 21+ messages in thread
From: Nicolas Goaziou @ 2019-02-28 10:51 UTC (permalink / raw)
  To: Org Mode List

Hello,

Jens Lechtenboerger <lechten@wi.uni-muenster.de> writes:

> I copied that from the address bar of my browser, probably two years
> ago.  Today, I was surprised by a compilation failure.

Link syntax is currently unstable. We fix it on one side and it breaks
elsewhere. 

This thread is an attempt to make the link syntax stable. It will not
necessarily solve your example, tho.

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-02-28 10:24     ` Neil Jerram
@ 2019-03-01  8:14       ` Nicolas Goaziou
  2019-03-01  8:30         ` Nicolas Goaziou
                           ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Nicolas Goaziou @ 2019-03-01  8:14 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Hello,

Neil Jerram <neiljerram@gmail.com> writes:

> Do you mean Windows file names in existing Org files?  I.e. the
> back-compatibility concern?
>
> If so, yes, I confess I didn't think at all about back-compatibility,
> with my suggestion above.  So perhaps that rules my idea out.
>
> If we were starting from scratch, however,
> - I believe it would technically be fine; i.e. it's a complete and
> unambiguous encoding
> - it might be considered awkward for Windows users to have to write
> c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
> know how big a concern that would be.

Thinking a bit more about it, we don't need to escape /all/ square
brackets, only "]]" and "][" constructs. Therefore, we don't need to
escape every backslash either.

The regexp for bracket links could be, in its simple (!) form:

  \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]

Most links would need no change.  I see one notable exception:
directories in Windows:

  [[c:\system32\\]] for "c:\system32\"

Some further notes:

1. Macros already use backslashes to escape commas in arguments, so it
   is at least consistent with this part of Org.
   
2. The description part of the link, like most parts of Org, does not
   use backslash escaping. If needed, we can implement an entity for
   a square bracket.

3. There will be some backward compatibility issues. We can add
   a checker in Org Lint to catch most of those. For example, we could
   look at URI where every percent is followed only by 25, 5B, and 5D.

WDYT?

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:14       ` Nicolas Goaziou
@ 2019-03-01  8:30         ` Nicolas Goaziou
  2019-03-01  8:40         ` Michael Brand
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Nicolas Goaziou @ 2019-03-01  8:30 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Nicolas Goaziou <mail@nicolasgoaziou.fr> writes:

> The regexp for bracket links could be, in its simple (!) form:
>
>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]

Small update, in its string form now:

  "\\[\\[\\([^\000]*?[^\\]\\(\\\\\\\\\\)*\\)\\]\\(?:\\[\\([^\000]+?\\)\\]\\)?\\]"

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:14       ` Nicolas Goaziou
  2019-03-01  8:30         ` Nicolas Goaziou
@ 2019-03-01  8:40         ` Michael Brand
  2019-03-01  8:41         ` Jens Lechtenboerger
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Michael Brand @ 2019-03-01  8:40 UTC (permalink / raw)
  To: Neil Jerram, Org Mode List

On Fri, Mar 1, 2019 at 9:15 AM Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:

> Thinking a bit more about it, we don't need to escape /all/ square
> brackets, only "]]" and "][" constructs. Therefore, we don't need to
> escape every backslash either.

Brilliant!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:14       ` Nicolas Goaziou
  2019-03-01  8:30         ` Nicolas Goaziou
  2019-03-01  8:40         ` Michael Brand
@ 2019-03-01  8:41         ` Jens Lechtenboerger
  2019-03-01  8:56           ` Nicolas Goaziou
  2019-03-03  6:58         ` stardiviner
  2019-03-04 23:16         ` Neil Jerram
  4 siblings, 1 reply; 21+ messages in thread
From: Jens Lechtenboerger @ 2019-03-01  8:41 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Hi there,

I like this proposal.

On 2019-03-01, Nicolas Goaziou wrote:

> 3. There will be some backward compatibility issues. We can add
>    a checker in Org Lint to catch most of those. For example, we could
>    look at URI where every percent is followed only by 25, 5B, and 5D.

I do not understand this point.  What is special about URIs where
*only* those occur?  Might compatibility issues not arise if those
occur at all (while others such as %28 and %29 for parentheses might
occur without problems as well)?

Best wishes
Jens

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:41         ` Jens Lechtenboerger
@ 2019-03-01  8:56           ` Nicolas Goaziou
  2019-03-01  9:40             ` Jens Lechtenboerger
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolas Goaziou @ 2019-03-01  8:56 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Hello,

Jens Lechtenboerger <lechten@wi.uni-muenster.de> writes:

> On 2019-03-01, Nicolas Goaziou wrote:
>
>> 3. There will be some backward compatibility issues. We can add
>>    a checker in Org Lint to catch most of those. For example, we could
>>    look at URI where every percent is followed only by 25, 5B, and 5D.
>
> I do not understand this point.  What is special about URIs where
> *only* those occur?  Might compatibility issues not arise if those
> occur at all (while others such as %28 and %29 for parentheses might
> occur without problems as well)?

If a URI seems percent encoded, but only uses %25, %5B and %5D as escape
combinations, there is a high chance that it is Org-encoded, and
therefore uses a deprecated syntax. We could send a warning to the user
in this case; they might want to clean the URI.

OTOH, if there is %28, or %29, we are sure it isn't Org-encoded, and
therefore, the percent-encoding was intended right from the start (like
in your Wikipedia link).

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:56           ` Nicolas Goaziou
@ 2019-03-01  9:40             ` Jens Lechtenboerger
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Lechtenboerger @ 2019-03-01  9:40 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

On 2019-03-01, Nicolas Goaziou wrote:

> Jens Lechtenboerger <lechten@wi.uni-muenster.de> writes:
>
>> On 2019-03-01, Nicolas Goaziou wrote:
>>
>>> 3. There will be some backward compatibility issues. We can add
>>>    a checker in Org Lint to catch most of those. For example, we could
>>>    look at URI where every percent is followed only by 25, 5B, and 5D.
>>
>> I do not understand this point.  What is special about URIs where
>> *only* those occur?  Might compatibility issues not arise if those
>> occur at all (while others such as %28 and %29 for parentheses might
>> occur without problems as well)?
>
> If a URI seems percent encoded, but only uses %25, %5B and %5D as escape
> combinations, there is a high chance that it is Org-encoded, and
> therefore uses a deprecated syntax. We could send a warning to the user
> in this case; they might want to clean the URI.
>
> OTOH, if there is %28, or %29, we are sure it isn't Org-encoded, and
> therefore, the percent-encoding was intended right from the start (like
> in your Wikipedia link).

Thanks for the clarification.  Makes sense.

Best wishes
Jens

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:14       ` Nicolas Goaziou
                           ` (2 preceding siblings ...)
  2019-03-01  8:41         ` Jens Lechtenboerger
@ 2019-03-03  6:58         ` stardiviner
  2019-03-03  8:08           ` Nicolas Goaziou
  2019-03-04 23:16         ` Neil Jerram
  4 siblings, 1 reply; 21+ messages in thread
From: stardiviner @ 2019-03-03  6:58 UTC (permalink / raw)
  To: emacs-orgmode; +Cc: Neil Jerram


Nicolas Goaziou <mail@nicolasgoaziou.fr> writes:

> Hello,
>
> Neil Jerram <neiljerram@gmail.com> writes:
>
>> Do you mean Windows file names in existing Org files?  I.e. the
>> back-compatibility concern?
>>
>> If so, yes, I confess I didn't think at all about back-compatibility,
>> with my suggestion above.  So perhaps that rules my idea out.
>>
>> If we were starting from scratch, however,
>> - I believe it would technically be fine; i.e. it's a complete and
>> unambiguous encoding
>> - it might be considered awkward for Windows users to have to write
>> c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
>> know how big a concern that would be.
>
> Thinking a bit more about it, we don't need to escape /all/ square
> brackets, only "]]" and "][" constructs. Therefore, we don't need to
> escape every backslash either.
>
> The regexp for bracket links could be, in its simple (!) form:
>
>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]
>
> Most links would need no change.  I see one notable exception:
> directories in Windows:
>
>   [[c:\system32\\]] for "c:\system32\"
>
> Some further notes:
>
> 1. Macros already use backslashes to escape commas in arguments, so it
>    is at least consistent with this part of Org.
>    
> 2. The description part of the link, like most parts of Org, does not
>    use backslash escaping. If needed, we can implement an entity for
>    a square bracket.
>
> 3. There will be some backward compatibility issues. We can add
>    a checker in Org Lint to catch most of those. For example, we could
>    look at URI where every percent is followed only by 25, 5B, and 5D.
>

About this, I'm curious, is it possible let this checker search and interactive
query replace with running recursively in a directory for all Org files. If Org
updated, I hope my Org documents are update too.

> WDYT?
>
> Regards,


-- 
[ stardiviner ]
       I try to make every word tell the meaning what I want to express.

       Blog: https://stardiviner.github.io/
       IRC(freenode): stardiviner, Matrix: stardiviner
       GPG: F09F650D7D674819892591401B5DF1C95AE89AC3
      

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-03  6:58         ` stardiviner
@ 2019-03-03  8:08           ` Nicolas Goaziou
  0 siblings, 0 replies; 21+ messages in thread
From: Nicolas Goaziou @ 2019-03-03  8:08 UTC (permalink / raw)
  To: stardiviner; +Cc: Neil Jerram, emacs-orgmode

Hello,

stardiviner <numbchild@gmail.com> writes:

> Nicolas Goaziou <mail@nicolasgoaziou.fr> writes:

>> 3. There will be some backward compatibility issues. We can add
>>    a checker in Org Lint to catch most of those. For example, we could
>>    look at URI where every percent is followed only by 25, 5B, and 5D.
>
> About this, I'm curious, is it possible let this checker search and interactive
> query replace with running recursively in a directory for all Org files. If Org
> updated, I hope my Org documents are update too.

The linter is only effective on the current document, and does not offer
to change it.

Writing a function to replace such links would be great. It is not my
priority at the moment, tho. 

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-01  8:14       ` Nicolas Goaziou
                           ` (3 preceding siblings ...)
  2019-03-03  6:58         ` stardiviner
@ 2019-03-04 23:16         ` Neil Jerram
  2019-03-05  0:23           ` Nicolas Goaziou
  4 siblings, 1 reply; 21+ messages in thread
From: Neil Jerram @ 2019-03-04 23:16 UTC (permalink / raw)
  To: Neil Jerram, Org Mode List

On Fri, 1 Mar 2019 at 08:14, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:
>
> Hello,
>
> Neil Jerram <neiljerram@gmail.com> writes:
>
> > Do you mean Windows file names in existing Org files?  I.e. the
> > back-compatibility concern?
> >
> > If so, yes, I confess I didn't think at all about back-compatibility,
> > with my suggestion above.  So perhaps that rules my idea out.
> >
> > If we were starting from scratch, however,
> > - I believe it would technically be fine; i.e. it's a complete and
> > unambiguous encoding
> > - it might be considered awkward for Windows users to have to write
> > c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
> > know how big a concern that would be.
>
> Thinking a bit more about it, we don't need to escape /all/ square
> brackets, only "]]" and "][" constructs. Therefore, we don't need to
> escape every backslash either.

Agreed.

> The regexp for bracket links could be, in its simple (!) form:
>
>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]

[then a bit later]
> Small update, in its string form now:
>
>   "\\[\\[\\([^\000]*?[^\\]\\(\\\\\\\\\\)*\\)\\]\\(?:\\[\\([^\000]+?\\)\\]\\)?\\]"

Is [^\000] the only (or best) way of saying "any character, including
newlines"?  Could there be actual NUL characters in the document?

More generally I'm not sure I'm fully understanding the regex.  I
_think_ it breaks down like this:

\[\[      # literal [[
\(        # begin group 1
[^\000]*? # non-greedy any characters (0 or more)
[^\]      # something not a backslash
\(        # begin group 2
\\\\      # literal \\
\)*       # end group 2, and allow 0 or more of it
\)        # end group 1
\]        # literal ]
\(        # begin group 3
?         # don't understand
:\[       # literal :[
\(        # begin group 4
[^\000]+? # non-greedy any characters (1 or more)
\)        # end group 4
\]        # literal ]
\)?       # end group 3, and allow 0 or 1 or it
\]        # literal ]

but there's at least a ? that I don't understand, and I'm afraid I'm
not seeing how it's useful.

> Most links would need no change.  I see one notable exception:
> directories in Windows:
>
>   [[c:\system32\\]] for "c:\system32\"

But I guess it would be unusual to write a trailing backslash like that.

> Some further notes:
>
> 1. Macros already use backslashes to escape commas in arguments, so it
>    is at least consistent with this part of Org.
>
> 2. The description part of the link, like most parts of Org, does not
>    use backslash escaping. If needed, we can implement an entity for
>    a square bracket.
>
> 3. There will be some backward compatibility issues. We can add
>    a checker in Org Lint to catch most of those. For example, we could
>    look at URI where every percent is followed only by 25, 5B, and 5D.
>
> WDYT?

If you think it works, I'm happy to defer to your judgement on that!
Although I suggested the idea, I don't know Org nearly well enough to
be sure that I haven't missed problems; but I guess that you would
know that.

Best wishes,
      Neil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-04 23:16         ` Neil Jerram
@ 2019-03-05  0:23           ` Nicolas Goaziou
  2019-03-05 16:27             ` Neil Jerram
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolas Goaziou @ 2019-03-05  0:23 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Hello,

Neil Jerram <neiljerram@gmail.com> writes:

> On Fri, 1 Mar 2019 at 08:14, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:

>> The regexp for bracket links could be, in its simple (!) form:
>>
>>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]
>
> [then a bit later]
>> Small update, in its string form now:
>>
>>   "\\[\\[\\([^\000]*?[^\\]\\(\\\\\\\\\\)*\\)\\]\\(?:\\[\\([^\000]+?\\)\\]\\)?\\]"
>
> Is [^\000] the only (or best) way of saying "any character, including
> newlines"?

There is also "\(.\|\n\)", or "[[:ascii:][:nonascii:]]".

> Could there be actual NUL characters in the document?

Good question. I used [^\000] out of habit. You are right, "\(.\|\n\)"
is more robust.

So, the new challenger is:

    "\\[\\[\\(\\(?:.\\|\n\\)*?[^\\]\\(\\\\\\\\\\)*\\)\\]\\(?:\\[\\(\\(?:.\\|\n\\)+?\\)\\]\\)?\\]"

Beautiful.

The commented rx equivalent would be:

(seq "["
     ;; URI part: match group 1.
     "["
     (group
      (*? anything)
      ;; Allow an even number of backslashes before the closing bracket.
      (not (any "\\"))
      (zero-or-more (group "\\\\")))
     "]"
     ;; Description (optional): match group 2.
     (opt "[" (group (+? anything)) "]")
     "]")

> \(        # begin group 3
> ?         # don't understand
> :\[       # literal :[

[...]

> but there's at least a ? that I don't understand, and I'm afraid I'm
> not seeing how it's useful.

\(?: ... \) is a shy group.

> If you think it works, I'm happy to defer to your judgement on that!
> Although I suggested the idea, I don't know Org nearly well enough to
> be sure that I haven't missed problems;

We are solving the problem with a regexp. What bad things could happen? ;)

Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-05  0:23           ` Nicolas Goaziou
@ 2019-03-05 16:27             ` Neil Jerram
  2019-03-05 16:36               ` Robert Pluim
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Jerram @ 2019-03-05 16:27 UTC (permalink / raw)
  To: Neil Jerram, Org Mode List

Hi Nicolas,

On Tue, 5 Mar 2019 at 00:23, Nicolas Goaziou <mail@nicolasgoaziou.fr> wrote:
[...]
> So, the new challenger is:
>
>     "\\[\\[\\(\\(?:.\\|\n\\)*?[^\\]\\(\\\\\\\\\\)*\\)\\]\\(?:\\[\\(\\(?:.\\|\n\\)+?\\)\\]\\)?\\]"
>
> Beautiful.
>
> The commented rx equivalent would be:
>
> (seq "["
>      ;; URI part: match group 1.
>      "["
>      (group
>       (*? anything)
>       ;; Allow an even number of backslashes before the closing bracket.
>       (not (any "\\"))
>       (zero-or-more (group "\\\\")))
>      "]"
>      ;; Description (optional): match group 2.
>      (opt "[" (group (+? anything)) "]")
>      "]")
>
> > \(        # begin group 3
> > ?         # don't understand
> > :\[       # literal :[
>
> [...]
>
> > but there's at least a ? that I don't understand, and I'm afraid I'm
> > not seeing how it's useful.
>
> \(?: ... \) is a shy group.

Thanks for explaining that.  It's not mentioned in the manual though
(https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html);
are you sure that it's supported in Emacs regexps?

> > If you think it works, I'm happy to defer to your judgement on that!
> > Although I suggested the idea, I don't know Org nearly well enough to
> > be sure that I haven't missed problems;
>
> We are solving the problem with a regexp. What bad things could happen? ;)

Well hopefully the fallout is limited to destroying all of the text in
one Org buffer. :-)

More seriously, though, I don't understand when and how the regexp is
used.  Presumably you loop through the buffer looking for matches, but
what do you do after each match?

Regards,
    Neil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Fixing link encoding once and for all
  2019-03-05 16:27             ` Neil Jerram
@ 2019-03-05 16:36               ` Robert Pluim
  0 siblings, 0 replies; 21+ messages in thread
From: Robert Pluim @ 2019-03-05 16:36 UTC (permalink / raw)
  To: Neil Jerram; +Cc: Org Mode List

Neil Jerram <neiljerram@gmail.com> writes:

> Thanks for explaining that.  It's not mentioned in the manual though
> (https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html);
> are you sure that it's supported in Emacs regexps?
>

Itʼs described in the next node:

<https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexp-Backslash.html>

Robert

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-03-05 16:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-24  1:16 [RFC] Fixing link encoding once and for all Nicolas Goaziou
2019-02-24 23:04 ` Neil Jerram
2019-02-27 10:48   ` Nicolas Goaziou
2019-02-28 10:24     ` Neil Jerram
2019-03-01  8:14       ` Nicolas Goaziou
2019-03-01  8:30         ` Nicolas Goaziou
2019-03-01  8:40         ` Michael Brand
2019-03-01  8:41         ` Jens Lechtenboerger
2019-03-01  8:56           ` Nicolas Goaziou
2019-03-01  9:40             ` Jens Lechtenboerger
2019-03-03  6:58         ` stardiviner
2019-03-03  8:08           ` Nicolas Goaziou
2019-03-04 23:16         ` Neil Jerram
2019-03-05  0:23           ` Nicolas Goaziou
2019-03-05 16:27             ` Neil Jerram
2019-03-05 16:36               ` Robert Pluim
2019-02-25  8:54 ` stardiviner
2019-02-27  8:07 ` Jens Lechtenboerger
2019-02-27 11:25   ` Nicolas Goaziou
2019-02-27 12:57     ` Jens Lechtenboerger
2019-02-28 10:51       ` Nicolas Goaziou

Code repositories for project(s) associated with this inbox:

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).