From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:306:2d92::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id 0AkFDXAftmQ4jwAASxT56A (envelope-from ) for ; Tue, 18 Jul 2023 07:13:20 +0200 Received: from aspmx1.migadu.com ([2001:41d0:306:2d92::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id kKXRDHAftmTvfgAAauVa8A (envelope-from ) for ; Tue, 18 Jul 2023 07:13:20 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id EB47746F16 for ; Tue, 18 Jul 2023 07:13:19 +0200 (CEST) Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=d1H81ulz; spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1689657200; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=xL9vLCKdzL9lKXf8WsD5vE40HhTIvuJfd6h7K5xgkR0=; b=usCi3Fi8O5p0WyNY7/QnTEF5ejitOYQub+oednYkM3PLxDifUUAnFBQGC2shHazBsVPUyb bg67EvfCCJrG6HT4QysRIB6jZMfrS56JHPHz5m/3Jmz25rpUHp3vfTYbJ/eWVXauNG1/y4 DELJmq3KpUNl3xdYjpT0D9l1w93tByxlFTa9Bj5r9tOH2iO5cyJUXjWD405oXPBjV/nneZ OQ7Bik8FfL8JQD540Xf/SFqD6TtZHhOElkRTHxeM0GAijW0hxEAqn4KKOrwHZJF3mSzAtf krY4YVKdaIvW/7PLSW1WQdeSpnnqg1EIkRZpDCgmr5I+zK1uUYLfuwMvvAKDjg== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1689657200; a=rsa-sha256; cv=none; b=jxujgecgD+EtDghmaWYhl/59Qpr1AtPupROdqmIGyNdTAmdOnvDVN8cOLnNyU21WRrLDet bCJiTfL6Irj1/n/SJMQaooT/Wc2bJjnGaw96u0LaFBW0q1WUO/BlQwkZaKLYsjdys/bB9u ircEf0NLevK7eS4DfYzx/QZ1P6nLRczhJcz/x31h53RfT3pwChOt+ctxNeyeL9OQGkD8XF aBm72T5o1WSHlJjRa/EV46ioBr7n8hY0Hm8Cf5H23wk7LXb9OGkiVNoF5qqJHLwIu1dWLD t4fDh7pY2d70ebmCn0XK8RJzC1x17UG8ofzCwoiIdeaOx8z54kPT/Is2sRFfGA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=d1H81ulz; spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qLcvf-0001Q6-BK; Tue, 18 Jul 2023 01:07:19 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qLcve-0001Py-C7 for emacs-orgmode@gnu.org; Tue, 18 Jul 2023 01:07:18 -0400 Received: from mout02.posteo.de ([185.67.36.66]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qLcvb-0000w6-Ri for emacs-orgmode@gnu.org; Tue, 18 Jul 2023 01:07:18 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id 7AA1B240104 for ; Tue, 18 Jul 2023 07:07:12 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1689656832; bh=Jncqtq1tQRyYrOceQ8qeL6vBK5L0sbkZAK9UOGlPTGU=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:From; b=d1H81ulzg50CVL7GhcN21L9goFNf1IRUKNzkhyInjelbIRThYKKrnpCSrsRbg2L+7 iwVg7IHSnrUwdg4aw+kqH5I7Czg9tI9nsiwTFGrKnmtPZ4uWTIPavwxJUOr2x0CtCd Yawr27zDuLw7mDsP4DmlW6ffP8QJv3YEuBJq8dONb4k36k2bOyuTaxuwXpLD94DSsj AHmkYMqiKHEONcdURuo6H4eEri3bjq4S2PIIurxtYInyRk5mjzV2cgqPRLpnCSVtiY 1wW1jziv20hHXYa+F+S+J3ADGTRB9A8IgfMZRCK3658NYoQ1Wp59rx8c9AwcRo6vzs EjTfJeSDZAgpw== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4R4n4b54B8z6tsg; Tue, 18 Jul 2023 07:07:11 +0200 (CEST) From: Ihor Radchenko To: Tom Gillespie Cc: Max Nikulin , emacs-orgmode@gnu.org, Timothy , Bastien Subject: Re: Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements) In-Reply-To: References: <87o86mw86r.fsf@localhost> <87fsrxkahq.fsf@nicolasgoaziou.fr> <87fsrxa1j5.fsf@localhost> <878rxoa6lk.fsf@localhost> <87tug93b2a.fsf@localhost> <87y25l8wvs.fsf@nicolasgoaziou.fr> <87r1bd39ny.fsf@localhost> <8735nsv9qo.fsf@nicolasgoaziou.fr> <87mtm09xzf.fsf@localhost> <87zgq02ueq.fsf@nicolasgoaziou.fr> <87h7c89rqr.fsf@localhost> <874k86y997.fsf@nicolasgoaziou.fr> <87v90lzwkm.fsf@localhost> <874jm2kb7x.fsf@localhost> Date: Tue, 18 Jul 2023 05:07:19 +0000 Message-ID: <87ttu13j08.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=185.67.36.66; envelope-from=yantar92@posteo.net; helo=mout02.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: emacs-orgmode-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Scanner: mx2.migadu.com X-Migadu-Spam-Score: -5.17 X-Spam-Score: -5.17 X-Migadu-Queue-Id: EB47746F16 X-TUID: uCJk3Vcq+ROH Tom Gillespie writes: > The way I have implemented this is by maintaining an explicit list of > characters that are safe for pre markup and another for post markup. > > It is not possible to use unicode punctuation for this because there > are a variety of punctuation marks that cannot appear in that position > and be considered markup, those include @, #, % to name just a few. Not that bad. Unicode standard defines the following categories (I listed those that might be of use): Pc = Punctuation, connector Pd = Punctuation, dash Ps = Punctuation, open Pe = Punctuation, close Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage) Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) Po = Punctuation, other Zs = Separator, space Zl = Separator, line Zp = Separator, paragraph We currently use the following: PRE = - ( ' " { POST = - . ; : ! ? ' " ) } \ [ At least, ({ have (get-char-code-property ?{ 'general-category) ;=> Ps (punctuation, open) We might probably generalize to PRE = Zs Zl Pc Pd Ps Pi ' " POST = Zs Zl Pc Pd Pe Pf . ; : ! ? ' " \ [ Though we need to take care excluding zero-width spaces. I can find https://www.unicode.org/review/pr-23.html that defines punctuation terminals like .;:!? It looks like it is adopted, via special properties: https://www.unicode.org/reports/tr44/#STerm and https://www.unicode.org/reports/tr44/#Terminal_Punctuation Emacs does not support them though (yet?). > Therefore, if we want to do this we commit to extending and then > maintaining the lists of valid pre and post markup delimiters as > special cases. We certainly do not want to do this. It is out of scope of Org, when Unicode can be of use. > Note also this could produce changes from current behavior because > things that previously tokenized as a series of words connected by > e.g. underscores could become markup. Indeed. And we should study the feedback. However, most scenarios that will change will involve non-standard Unicode markup characters. The odds are low that users will use such Unicode at markup boundary and _also expect markup to be ignored_. At the end, it is the current ASCII limitation plus partially arbitrary choice of boundaries that keep some users confused (we are getting bug reports about confusing markup from time to time). Of course, we can, as usual, provide a linter to catch such scenarios and warn in the ORG_NEWS. I do believe that better Unicode support will benefit many Org users that use non-Latin scripts. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at