emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)]
@ 2024-06-13 13:32 Morgan Willcock
  2024-06-14 14:04 ` Ihor Radchenko
  0 siblings, 1 reply; 6+ messages in thread
From: Morgan Willcock @ 2024-06-13 13:32 UTC (permalink / raw)
  To: emacs-orgmode



Remember to cover the basics, that is, what you expected to happen and
what in fact did happen.  You don't know how to make a good report?  See

     https://orgmode.org/manual/Feedback.html#Feedback

Your bug report will be posted to the Org mailing list.
------------------------------------------------------------------------

When web links are inserted into an org buffer, if the link ends in a
trailing dash this seems to be omitted from the link target.

i.e. Inserting "https://domain/test-" into the buffer will create a
clickable link for "https://domain/test".

These types of links will likely be encountered for sites where anchor
targets are automatically generated from documentation headings which
are questions.

e.g. https://learn.microsoft.com/en-us/entra/identity/hybrid/connect/how-to-connect-sso-faq#how-can-i-roll-over-the-kerberos-decryption-key-of-the--azureadsso--computer-account-

It seems straight-forward to verify that the trailing dash of the link
is not considered part of the link:

  (with-temp-buffer
    (org-mode)
    (insert "https://domain/test-")
    (goto-char (point-min))
    (let ((context (org-element-context)))
      (cl-assert (eq (org-element-type context) 'link))
      (buffer-substring-no-properties
       (org-element-property :begin context)
       (org-element-property :end context))))

Emacs  : GNU Emacs 29.3 (build 2, x86_64-pc-linux-gnu, X toolkit, cairo version 1.16.0, Xaw3d scroll bars)
 of 2024-03-25
Package: Org mode version 9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)]
  2024-06-13 13:32 [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] Morgan Willcock
@ 2024-06-14 14:04 ` Ihor Radchenko
  2024-06-16 15:43   ` Max Nikulin
  0 siblings, 1 reply; 6+ messages in thread
From: Ihor Radchenko @ 2024-06-14 14:04 UTC (permalink / raw)
  To: Morgan Willcock; +Cc: emacs-orgmode

Morgan Willcock <morgan@ice9.digital> writes:

> When web links are inserted into an org buffer, if the link ends in a
> trailing dash this seems to be omitted from the link target.
>
> i.e. Inserting "https://domain/test-" into the buffer will create a
> clickable link for "https://domain/test".
>
> These types of links will likely be encountered for sites where anchor
> targets are automatically generated from documentation headings which
> are questions.

This makes sense.
I improved the heuristics we use to detect plain links.
Fixed, on main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=73da6beb5

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)]
  2024-06-14 14:04 ` Ihor Radchenko
@ 2024-06-16 15:43   ` Max Nikulin
  2024-06-16 15:59     ` Ihor Radchenko
  0 siblings, 1 reply; 6+ messages in thread
From: Max Nikulin @ 2024-06-16 15:43 UTC (permalink / raw)
  To: emacs-orgmode

On 14/06/2024 21:04, Ihor Radchenko wrote:
> Morgan Willcock writes:
> 
>> i.e. Inserting "https://domain/test-" into the buffer will create a
>> clickable link for "https://domain/test".
>>
> I improved the heuristics we use to detect plain links.
> Fixed, on main.
> https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=73da6beb5

> +++ b/etc/ORG-NEWS
[...]
> +*** Trailing =-= is now allowed in plain links

After a look into

7dcb1afb6 2021-03-24 21:27:24 +0800 Ihor Radchenko: Improve 
org-link-plain-re

I suspect, it worked prior to v9.5. Without a unit test it may be 
accidentally broken again.

> +: https://domain/test-

example.org, example.net, example.com are domains reserved for usage in 
examples: 
<https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml>

>                     (or (regexp "[^[:punct:] \t\n]")

I have realized that some Org regexps use [:punct:] *regexp class* and 
others *syntax class*, see latex math regexp. I am in doubts if the 
discrepancy is intentional.

I have noticed that the following change

09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: 
Improve regexp heuristics

that causes

     (link http://example.org/a<b)

input is exported as

     <p>
     (link <a 
href="http://example.org/a%3Cb)">http://example.org/a%3Cb)</a></p>

I expect that ")" should not be parsed as a part of the link. Balanced 
brackets are tricky with regexps (and it is not possible to match 
arbitrary nested ones).

Perhaps "[^[:punct:] \t\n]" is too strict in respect to spaces. It does 
not allow the recommended workaround with zero width space:

(org-export-string-as
  "http://example.org\N{ZERO WIDTH SPACE}[fn::footnote]" 'html 'body)
"<p>
<a 
href=\"http://example.org​[fn::footnote]\">http://example.org​[fn::footnote]</a></p>
"

Actually some kind of non-breakable space should be better in such cases:

(org-export-string-as
  "http://example.org\N{NO-BREAK SPACE}[fn::footnote]" 'html 'body)
"<p>
<a 
href=\"http://example.org [fn::footnote]\">http://example.org [fn::footnote]</a></p>
"

I would consider [:space:] or \s-.

As to the original bug report, while reading it, I noticed that 
thunderbird includes dash into the recognized link for

   "https://domain/test-"

I decided to look into its implementation and to my surprise I found: 
``punctation chars and "-" at the end are stipped off.'' I realized that 
double quotes along with angle brackets are treated as a recommended way 
to mark URLs in plain text. Thunderbird does not consider dash as a part 
of links for e.g. http://example.org/t- It might be an attempt to 
reserve possibility to assemble URLs wrapped into several lines with 
added hyphenation marks, but it has not been implemented (RFC2396 
appendix E warns about accidentally added hyphens).

https://www.bucksch.org/1/projects/mozilla/16507/
https://searchfox.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#line-243
mozTXTToHTMLConv::FindURLEnd

Implementation is tricky, I have not noticed anything that may be reused 
to improve heuristics for Org. Nowadays it is likely better to inspect 
autolinking code for GitHub/GitLab or widely used python packages.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)]
  2024-06-16 15:43   ` Max Nikulin
@ 2024-06-16 15:59     ` Ihor Radchenko
  2024-06-20 12:15       ` Max Nikulin
  0 siblings, 1 reply; 6+ messages in thread
From: Ihor Radchenko @ 2024-06-16 15:59 UTC (permalink / raw)
  To: Max Nikulin; +Cc: emacs-orgmode

Max Nikulin <manikulin@gmail.com> writes:

>> +*** Trailing =-= is now allowed in plain links
>
> After a look into
>
> 7dcb1afb6 2021-03-24 21:27:24 +0800 Ihor Radchenko: Improve 
> org-link-plain-re
>
> I suspect, it worked prior to v9.5. Without a unit test it may be 
> accidentally broken again.

No, it did not work.
If you can, please do not make such assertions without testing.

>> +: https://domain/test-
>
> example.org, example.net, example.com are domains reserved for usage in 
> examples: 
> <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml>

And so?

>>                     (or (regexp "[^[:punct:] \t\n]")
>
> I have realized that some Org regexps use [:punct:] *regexp class* and 
> others *syntax class*, see latex math regexp. I am in doubts if the 
> discrepancy is intentional.

It is not intentional, but using syntax classes can sometimes be
fragile.

> I have noticed that the following change
>
> 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: 
> Improve regexp heuristics
>
> that causes
>
>      (link http://example.org/a<b)
>
> input is exported as
>
>      <p>
>      (link <a 
> href="http://example.org/a%3Cb)">http://example.org/a%3Cb)</a></p>
>
> I expect that ")" should not be parsed as a part of the link. Balanced 
> brackets are tricky with regexps (and it is not possible to match 
> arbitrary nested ones).

It is heuristics. We cannot be 100% right. So, it is what it is.

> Perhaps "[^[:punct:] \t\n]" is too strict in respect to spaces. It does 
> not allow the recommended workaround with zero width space:

You don't need zero width space for links.
Just use <bracket link>.

> As to the original bug report, while reading it, I noticed that 
> thunderbird includes dash into the recognized link for
>
>    "https://domain/test-"
>
> I decided to look into its implementation and to my surprise I found: 
> ``punctation chars and "-" at the end are stipped off.'' I realized that 
> double quotes along with angle brackets are treated as a recommended way 
> to mark URLs in plain text. Thunderbird does not consider dash as a part 
> of links for e.g. http://example.org/t- It might be an attempt to 
> reserve possibility to assemble URLs wrapped into several lines with 
> added hyphenation marks, but it has not been implemented (RFC2396 
> appendix E warns about accidentally added hyphens).
>
> https://www.bucksch.org/1/projects/mozilla/16507/
> https://searchfox.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#line-243
> mozTXTToHTMLConv::FindURLEnd
>
> Implementation is tricky, I have not noticed anything that may be reused 
> to improve heuristics for Org. Nowadays it is likely better to inspect 
> autolinking code for GitHub/GitLab or widely used python packages.

If you have concrete proposals, please share them.

> I would consider [:space:] or \s-.

Do you mean "[^[:punct:][:space:]\t\n]"?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)]
  2024-06-16 15:59     ` Ihor Radchenko
@ 2024-06-20 12:15       ` Max Nikulin
  2024-06-22 13:41         ` Ihor Radchenko
  0 siblings, 1 reply; 6+ messages in thread
From: Max Nikulin @ 2024-06-20 12:15 UTC (permalink / raw)
  To: emacs-orgmode

On 16/06/2024 22:59, Ihor Radchenko wrote:
> Max Nikulin writes:
>>
>> I suspect, it worked prior to v9.5. Without a unit test it may be
>> accidentally broken again.
> 
> No, it did not work.
> If you can, please do not make such assertions without testing.

I am sorry, I had no intention to offend you. I missed that the removed 
line with explicit list of punctuation characters was commented out. I 
have tried the regexp used before (a part of v6.34)

     facedba05 2009-12-09 15:13:50 +0100 Carsten Dominik: Use John 
Gruber's regular expression for URL's

and it seems trailing dash was allowed.

>>> +: https://domain/test-
>>
>> example.org, example.net, example.com are domains reserved for usage in
>> examples:
>> <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml>
> 
> And so?

http://example.org/dash- may be a bit better for docs. (For IPv6 
addresses the difference should be more noticeable, but I do not 
remember what range is reserved for usage in examples there.)

>> I have realized that some Org regexps use [:punct:] *regexp class* and
>> others *syntax class*, see latex math regexp. I am in doubts if the
>> discrepancy is intentional.
> 
> It is not intentional, but using syntax classes can sometimes be
> fragile.

Do you mean that result depends on current buffer? I do not have strong 
opinion what variant should be used. What I do not like is that in the 
case of $n$-th the character after second "$" is tested against syntax 
class, while regexp class is used for links. This subtle difference is 
almost certainly ignored in alternative implementations of the parser. 
However I am not sure what characters besides dash and apostrophe are 
affected and whether it depends on locale.

>> 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re:
>> Improve regexp heuristics
[...]
>>       (link http://example.org/a<b)
[...]
> It is heuristics. We cannot be 100% right. So, it is what it is.

 From my point of view it is at least close to a regression. I do not 
have any argument against http://example.org/a<b>, but the regexp should 
not match whole "http://example.org/a<b)"

[...]
>> Nowadays it is likely better to inspect
>> autolinking code for GitHub/GitLab or widely used python packages.
> 
> If you have concrete proposals, please share them.

Not yet. I consider inspecting mozilla's code as a kind of negative 
result from the point of view of usefulness for Org. Expanding test 
suite by gathering examples of failed heuristics from bug reports 
require enough reports. https://wpt.live/url/resources/urltestdata.json 
(https://github.com/web-platform-tests/wpt) is too specific for browsers 
and HTML/JS.

>> I would consider [:space:] or \s-.
> 
> Do you mean "[^[:punct:][:space:]\t\n]"?

I believe it might be an improvement ([:space:] includes \t).




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)]
  2024-06-20 12:15       ` Max Nikulin
@ 2024-06-22 13:41         ` Ihor Radchenko
  0 siblings, 0 replies; 6+ messages in thread
From: Ihor Radchenko @ 2024-06-22 13:41 UTC (permalink / raw)
  To: Max Nikulin; +Cc: emacs-orgmode

Max Nikulin <manikulin@gmail.com> writes:

>> If you can, please do not make such assertions without testing.
>
> I am sorry, I had no intention to offend you. I missed that the removed 
> line with explicit list of punctuation characters was commented out. I 
> have tried the regexp used before (a part of v6.34)

>      facedba05 2009-12-09 15:13:50 +0100 Carsten Dominik: Use John 
> Gruber's regular expression for URL's
>
> and it seems trailing dash was allowed.

Hmm. That's a really long time ago, earlier than built-in Org in Emacs
versions that are available in various distros. My reading of "prior to
v9.5" was more like "not too far before v9.5" (and I tested everything
down to Org mode included into Emacs 26).

>>>> +: https://domain/test-
>>>
>>> example.org, example.net, example.com are domains reserved for usage in
>>> examples:
>>> <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml>
>> 
>> And so?
>
> http://example.org/dash- may be a bit better for docs. (For IPv6 
> addresses the difference should be more noticeable, but I do not 
> remember what range is reserved for usage in examples there.)

I see. I would not mind installing a patch, if you submit it.

>>> I have realized that some Org regexps use [:punct:] *regexp class* and
>>> others *syntax class*, see latex math regexp. I am in doubts if the
>>> discrepancy is intentional.
>> 
>> It is not intentional, but using syntax classes can sometimes be
>> fragile.
>
> Do you mean that result depends on current buffer? I do not have strong 
> opinion what variant should be used.

Not current buffer. Current syntax table, inherited from
outline-mode. And that syntax table is customized by some users, leading
to Org parser behaving unexpectedly in some scenarios.

Also, there is 'syntax-table text property, and I have managed to break
Org parser in the past by trying to apply 'syntax-table property to code
blocks in Org mode (I was trying to solve `forward-sexp' bug people
frequently report).

So, we should generally avoid using syntax tables, so that Org syntax
becomes independent of user customizations in that area. Or, at least,
we should not introduce more syntax class uses when possible.

> ... What I do not like is that in the 
> case of $n$-th the character after second "$" is tested against syntax 
> class, while regexp class is used for links. This subtle difference is 
> almost certainly ignored in alternative implementations of the parser. 
> However I am not sure what characters besides dash and apostrophe are 
> affected and whether it depends on locale.

These kinds of inconsistencies should be solved eventually. We should not
use locale, but UTF syntax classes; and document it in org-syntax
document.

>>> 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re:
>>> Improve regexp heuristics
> [...]
>>>       (link http://example.org/a<b)
> [...]
>> It is heuristics. We cannot be 100% right. So, it is what it is.
>
>  From my point of view it is at least close to a regression. I do not 
> have any argument against http://example.org/a<b>, but the regexp should 
> not match whole "http://example.org/a<b)"

No bug reports, so your point is rather theoretical.

I do not mind improving the regexp, of course, but I am afraid that we
will need PEG or `org-element--parse-paired-brackets' to match paired
brackets accurately. And that kind of change will be breaking - we will
need to trash the regexp variable.

>>> I would consider [:space:] or \s-.
>> 
>> Do you mean "[^[:punct:][:space:]\t\n]"?
>
> I believe it might be an improvement ([:space:] includes \t).

https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=6cada29c0

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-06-22 13:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 13:32 [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] Morgan Willcock
2024-06-14 14:04 ` Ihor Radchenko
2024-06-16 15:43   ` Max Nikulin
2024-06-16 15:59     ` Ihor Radchenko
2024-06-20 12:15       ` Max Nikulin
2024-06-22 13:41         ` Ihor Radchenko

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).