From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id 6HxxEChS22CQCQAAgWs5BA (envelope-from ) for ; Tue, 29 Jun 2021 19:02:32 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id yLU0DChS22BrNAAA1q6Kng (envelope-from ) for ; Tue, 29 Jun 2021 17:02:32 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id A2D3125107 for ; Tue, 29 Jun 2021 19:02:31 +0200 (CEST) Received: from localhost ([::1]:33264 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lyH82-0008BH-GV for larch@yhetil.org; Tue, 29 Jun 2021 13:02:30 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:36264) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lyH6o-0008B9-9q for emacs-orgmode@gnu.org; Tue, 29 Jun 2021 13:01:14 -0400 Received: from ciao.gmane.io ([116.202.254.214]:52472) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lyH6m-00018d-Ni for emacs-orgmode@gnu.org; Tue, 29 Jun 2021 13:01:14 -0400 Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1lyH6j-0002DY-V2 for emacs-orgmode@gnu.org; Tue, 29 Jun 2021 19:01:09 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: emacs-orgmode@gnu.org From: Maxim Nikulin Subject: Re: Bug: ODT export of Chinese text inserts spaces for line breaks Date: Wed, 30 Jun 2021 00:01:00 +0700 Message-ID: References: <17a55e0b01d.11be78c6c72761.7557666657037565597@zoho.com> <557d5f5d.2eed.17a56147ed0.Coremail.tumashu@163.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 In-Reply-To: <557d5f5d.2eed.17a56147ed0.Coremail.tumashu@163.com> Content-Language: en-US Received-SPF: pass client-ip=116.202.254.214; envelope-from=geo-emacs-orgmode@m.gmane-mx.org; helo=ciao.gmane.io X-Spam_score_int: 28 X-Spam_score: 2.8 X-Spam_bar: ++ X-Spam_report: (2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001, FORGED_GMAIL_RCVD=1, FORGED_MUA_MOZILLA=2.309, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.248, NICE_REPLY_A=-0.001, NML_ADSP_CUSTOM_MED=0.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1624986151; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=/oxXZInQJ35CKzFMp7pnvbgKWPmWfxE1HFzF3XrZLRA=; b=RyT/+HFUkMxfjfKlNiyOAifhWqSDddsOuwetINnscPz7F+45248TMHSdujx+lEdiBn2Bv9 Fg4mgqIu8XPP/2lmO9cu2OGq0+ARDYls3XIq/XjcDGZc4BYtusbgu5eXz4j+THRqneZIFP 4mv+WtUjfI/s1NEfCA1wkUAeUTJA65WEz56XkhjVBZDDrTE+bjyguTiSUB+gf547PBHN4V VLllVc+UOBYs4a2ZOB73xU/JDdnt1eDxNbGImCj/wnu30E/nxmOV1eYI0cKJwtgkuG78Hh q+TLUj+lO0dAV6fSK8pOkvQNxKf1IUU2LNt873kgp9qvYcAN2+KiaKIr/NIpNg== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1624986151; a=rsa-sha256; cv=none; b=Nc0R+ocZGfADdzN4ePnkWZSiWjXU8MnryTPoenkKB0ILcRiBq80afumYuVA0KFs9v0rfRI ZoWsBAtm5xpyuxe05mLTmRTGapSH2PtxkfesLvaa90GRQihnrbyoFqcU977XsZ0PW4PjrQ J80NiVaWksDw8hrE6WpjvM8ffzrhHffpkl087A869eQhvFZep+H/eRB2DBmkKQxpESHHQm J4nUAul3Jyw8uoNhMsAtaxiH5b3pG5LgQaTE86oexbYkNMc7OYGBYBV174nQ1Sbsv+5s+/ w4h/GzZp5fUbXsawU8nOsAS1hnSBSUypm/SZ6jxKdc1T7jmf1jHUfTkWFYnYxg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Spam-Score: -1.82 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Queue-Id: A2D3125107 X-Spam-Score: -1.82 X-Migadu-Scanner: scn1.migadu.com X-TUID: /cNowtPRbUXS On 29/06/2021 10:47, James Harkins wrote: > So, it would make sense to add a rule to the exporter: if one of the > characters before or after a source-text line break is a Chinese, > Japanese or Korean character, do not add a space. On 29/06/2021 11:43, tumashu wrote: > You can try the below config :-) >     (let ((regexp "[[:multibyte:]]") >           (string text)) >       (setq string >             (replace-regexp-in-string >              (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp) >              "\\1\\2" string)) Notice that [[:multibyte:]] means almost any non-ASCII script, e.g. Cyrillic: (let ((sample "abc абв def")) (and (string-match "[[:multibyte:]]\+" sample) (match-string 0 sample))) "абв" It seems, `org-fill-paragraph' M-q is smart enough to avoid a space before or after a CJK character, so it is possible to determine correct way to splice lines, despite e.g. "Script" Unicode property is not exposed to elisp: https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html (Anyway maintaining explicit list of scripts is not a straightforward approach.) P.S. JavaScript in browsers allows to filter characters that belong to particular script: "abc абв def".match(/\p{Script=Cyrillic}+/u) Array [ "абв" ] I have not found such feature in regular expressions available in Emacs.