From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id WJDRDqlXnWCKYQEAgWs5BA (envelope-from ) for ; Thu, 13 May 2021 18:45:29 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id mMuRCqlXnWBvFwAA1q6Kng (envelope-from ) for ; Thu, 13 May 2021 16:45:29 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 6EBAA11415 for ; Thu, 13 May 2021 18:45:28 +0200 (CEST) Received: from localhost ([::1]:51542 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lhESk-0006vg-AO for larch@yhetil.org; Thu, 13 May 2021 12:45:26 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40822) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lhEEt-00063o-V6 for emacs-orgmode@gnu.org; Thu, 13 May 2021 12:31:08 -0400 Received: from ciao.gmane.io ([116.202.254.214]:49864) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lhEEs-0004Vf-2z for emacs-orgmode@gnu.org; Thu, 13 May 2021 12:31:07 -0400 Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1lhEEq-0007A3-2s for emacs-orgmode@gnu.org; Thu, 13 May 2021 18:31:04 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: emacs-orgmode@gnu.org From: Maxim Nikulin Subject: Re: URLs with brackets not recognised Date: Thu, 13 May 2021 23:30:56 +0700 Message-ID: References: <87lf8k2yr6.fsf@yandex.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 In-Reply-To: Content-Language: en-US Received-SPF: pass client-ip=116.202.254.214; envelope-from=geo-emacs-orgmode@m.gmane-mx.org; helo=ciao.gmane.io X-Spam_score_int: 28 X-Spam_score: 2.8 X-Spam_bar: ++ X-Spam_report: (2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001, FORGED_GMAIL_RCVD=1, FORGED_MUA_MOZILLA=2.309, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.248, NICE_REPLY_A=-0.001, NML_ADSP_CUSTOM_MED=0.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1620924328; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=0K8X94qbVza661Q6c09nG4QYnL52miXtSmZUpZ0h8QQ=; b=iNM3jn0eygPA0qbERLVLw6aBcE9A3uAOF+b3ZlcXbMlOVmGrgXIuQ5W30IOJB3ZVyyhcP2 lE+2KZvFyPFj/G1uU0n8X3M3zJJ0Gnq4SyP2LyzkADBx6oWQDvTxZ3CnrMWb+666JzOYjO 1I0wbyVZLrnVOLZqS5QJ/P/6nsSADZVOeL5SrSCcoUpfiETBvotUYnrK9EozMv91tpAlU/ /RjPBH9OMmbTIfjH6p7wg01TNUKw+qi84eJpNrxK8zvdVQkAS8Ml1PS52TO1RAWww6rShd On7N+VTpIJbFGGp9kkeGlmB8oOb+1/XvYzuw63jEa3mZSAf7LmMVxlmGeC3rug== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1620924328; a=rsa-sha256; cv=none; b=q2XIIxMBkQj3XvuQqMnrKCpGPKDAommAa207F5KI83Ior7kWzPBzgYu/ZUf4s3ciL81f8Y bQCene6Tf7W+tCSGWdpdbWMstYaNZ9J5D9HKxiTnbtWjovmrzeZCMxzr+6fGIz4Bx4C1So yNsZxswMLEGKx3WctHkOus0vNpY51N0a7s4kn+s/G57N0I5ZyeGHunDlUiFeNRW+awDDmL 5qBBDx1d36I+YX/hwxIq+/K3u69ekAFlxhkJdMFKE6xk6D3OHxm4VBFWkiAlX/I5XxxrHY N0ZdlKW5rKHxYEpHPV+Gy3zg7LXzx9IEfcvt6dKtCJEDS3gWxyc8jqtfa2EuGQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Spam-Score: -1.85 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Queue-Id: 6EBAA11415 X-Spam-Score: -1.85 X-Migadu-Scanner: scn0.migadu.com X-TUID: PB7d2MBpfZwN On 13/05/2021 03:06, Rudolf Adamkovič wrote: > Maxim Nikulin writes: > >> I do not think it is a bug. Plain text links detection is a kind of >> heuristics. It will be always possible to win competition with regexp. >> Consider it as a limitation requiring some hints from an intelligent >> user. > > I disagree. Me too. I disagree with most of statements in this thread, even with some arguments supposed to support my opinion. Exception is Ihor's message. I hope, more liberal regexp will not interfere with parsing of other constructs. Actually I think, you do not realize that detection of URLs in arbitrary text is tricky. Maybe you have not noticed corner cases before. False positives may be even more annoying. At least in the past "smart" detection of smiles and emoji in skype transformed code snippets into unreadable mess of "glasses of wine" and other "funny" stuff. > URLs are well-specified. Per RFC 3986, It describes isolated URI assuming some protocol that allows to determine begin and end of URI string. It is impossible to unambiguously extract URLs from text written in human languages. Tom pointed that some character sequences in URLs can interfere with org markup. > the characters > allowed in a URL are [A-Za-z0-9\-._~!$&'()*+,;=:@\/?]. 1. Surrounded text may use the same characters. I do not think, you would be happy if you got - - from "(see https://orgmode.org/, https://orgmode.org/worg/org-faq.html)" just because of "," and ")" characters are allowed in URIs. There is just some heuristics that works more or less acceptable in common cases. Various implementation have their strong and weak sides. 2. Allowed characters are specified at protocol level. Fortunately in user interface most of unicode characters are allowed. Certainly the following URLs are more portable and reliable https://el.wikipedia.org/wiki/%CE%9B%CE%AC%CE%BC%CE%B4%CE%B1 https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC https://ru.wikipedia.org/wiki/%D0%A1%D1%82%D0%BE%D0%BB%D0%BB%D0%BC%D0%B0%D0%BD,_%D0%A0%D0%B8%D1%87%D0%B0%D1%80%D0%B4_%D0%9C%D1%8D%D1%82%D1%82%D1%8C%D1%8E#%D0%9A%D1%80%D0%B0%D1%82%D0%BA%D0%B0%D1%8F_%D0%B1%D0%B8%D0%BE%D0%B3%D1%80%D0%B0%D1%84%D0%B8%D1%8F However unicode variants are more informative and readable for humans https://el.wikipedia.org/wiki/Λάμδα https://ja.wikipedia.org/wiki/日本 https://ru.wikipedia.org/wiki/Столлман,_Ричард_Мэттью#Краткая_биография The same is applicable for domain names. Extreme case: https://xn--i-7iq.ws/ - https://i❤️.ws/ Even space characters can be used in query part. Modern applications are able to convert them to "+" or to "%20" for communication with HTTP servers. > Org mode should > implement proper URL detection, not asking its users "to give it some > hints" and using "a kind of heuristics". Some tools detect www.google.com as valid URL, others (including org) do not. Heuristics can evolve in time. Org render on github can differ from elisp original code. Explicit markup is a way to avoid problems. More complicated regexp makes it harder to support it. (Explaining to user that technologies have limitations is a kind of maintenance cost as well). Long regexp will have performance penalty and still can be fooled. Example of link that causes problems even with brackets: https://lists.gnu.org/archive/html/emacs-orgmode/2020-12/msg00706.html https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~(view~'timeSeries~stacked~false~metrics~(~(~'CWAgent~'backup_time~'host~'desktop~'metric_type~'timing))~region~'us-east-1);query=~'*7bCWAgent*2chost*2cmetric_type*7d On 12/05/2021 23:44, Colin Baxter wrote: > It might be worthwhile to issue an warning each time a url is written in > an org file without enclosing brackets < > or [[ ]]. Simple links works well. I am afraid that detecting, whether a particular link is a corner case that needs brackets, may require more complicated logic than regexp detecting links. On 13/05/2021 09:21, Tim Cross wrote: > As this is defined and documented behaviour, My impression that nuances of recognition of plain text links are not documented. Even unit tests exists only in the proposed patch. Actually I do not think that such details are necessary in the manual. Fontification provides feedback. As soon as problems noticed, explicit marks can be added. On 13/05/2021 05:23, Tom Gillespie wrote: > A quick fix is to percent encode the troublesome characters org-lint does not like percent encoding in links. It is heritage of a period when *extra* pass of percent encoding was used to escape square brackets and spaces. Current recommendation is to escape only brackets and backslashes leaving spaces as is (however org-fill-paragraph believes that it has full rights to do something with spaces). Personally I do not see why adding angle or double square brackets is a problem. While approaching limits, it is better to stay on the safe side. Particular case initiated this topic can be solved but more complicated URLs will arise. Just admit that preparing of documents requires some collaboration and assistance from users to make intentions more explicit.