emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Bug: ODT export of Chinese text inserts spaces for line breaks
@ 2021-06-29  3:47 James Harkins
  2021-06-29  4:43 ` tumashu
  0 siblings, 1 reply; 7+ messages in thread
From: James Harkins @ 2021-06-29  3:47 UTC (permalink / raw)
  To: emacs-orgmode

Consider the following org document.

* Test
1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
要求办理离校手续,领取相关证书后离校;

This was produced by pasting in a single, long line, and then using alt-Q (a normal thing to do, and good for readability, because org-mode doesn't wrap lines by default).

Exporting to ODT produces the following (body text, omitting titles, headers and such).

1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;

Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does not allow for spaces mid-sentence.

So, it would make sense to add a rule to the exporter: if one of the characters before or after a source-text line break is a Chinese, Japanese or Korean character, do not add a space. (The space is valid, of course, if the characters on either side of the line breaks are Roman or [I would guess] Cyrillic as well.)

(Side note: Exporting to a LaTeX buffer shows that the line breaks have been copied into the .tex document as is -- but, provided that you have a `usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems for ODT, Org needs to handle the spacing within its own logic.)

This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this has not been a high-visibility issue.

hjh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re:Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29  3:47 Bug: ODT export of Chinese text inserts spaces for line breaks James Harkins
@ 2021-06-29  4:43 ` tumashu
  2021-06-29 17:01   ` Bug: " Maxim Nikulin
  0 siblings, 1 reply; 7+ messages in thread
From: tumashu @ 2021-06-29  4:43 UTC (permalink / raw)
  To: James Harkins; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 2924 bytes --]

You can try the below config :-)





(defun eh-org-wash-text (text backend _info)
  "导出 org file 时,删除中文之间不必要的空格。"
  (when (or (org-export-derived-backend-p backend 'html)
            (org-export-derived-backend-p backend 'odt))
    (let ((regexp "[[:multibyte:]]")
          (string text))
      ;; org-mode 默认将一个换行符转换为空格,但中文不需要这个空格,删除。
      (setq string
            (replace-regexp-in-string
             (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
             "\\1\\2" string))
      ;; 删除粗体之后的空格
      (dolist (str '("</b>" "</code>" "</del>" "</i>"))
        (setq string
              (replace-regexp-in-string
               (format "\\(%s\\)\\(%s\\)[ ]+\\(%s\\)" regexp str regexp)
               "\\1\\2\\3" string)))
      ;; 删除粗体之前的空格
      (dolist (str '("<b>" "<code>" "<del>" "<i>" "<span class=\"underline\">"))
        (setq string
              (replace-regexp-in-string
               (format "\\(%s\\)[ ]+\\(%s\\)\\(%s\\)" regexp str regexp)
               "\\1\\2\\3" string)))
      string)))

(add-hook 'org-export-filter-headline-functions #'eh-org-wash-text)
(add-hook 'org-export-filter-paragraph-functions #'eh-org-wash-text)













在 2021-06-29 11:47:06,"James Harkins" <jamshark70@zoho.com> 写道:
>Consider the following org document.
>
>* Test
>1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
>书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
>要求办理离校手续,领取相关证书后离校;
>
>This was produced by pasting in a single, long line, and then using alt-Q (a normal thing to do, and good for readability, because org-mode doesn't wrap lines by default).
>
>Exporting to ODT produces the following (body text, omitting titles, headers and such).
>
>1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
>
>Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does not allow for spaces mid-sentence.
>
>So, it would make sense to add a rule to the exporter: if one of the characters before or after a source-text line break is a Chinese, Japanese or Korean character, do not add a space. (The space is valid, of course, if the characters on either side of the line breaks are Roman or [I would guess] Cyrillic as well.)
>
>(Side note: Exporting to a LaTeX buffer shows that the line breaks have been copied into the .tex document as is -- but, provided that you have a `usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems for ODT, Org needs to handle the spacing within its own logic.)
>
>This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this has not been a high-visibility issue.
>
>hjh

[-- Attachment #2: Type: text/html, Size: 4423 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29  4:43 ` tumashu
@ 2021-06-29 17:01   ` Maxim Nikulin
  2021-06-29 18:19     ` Eric Abrahamsen
  0 siblings, 1 reply; 7+ messages in thread
From: Maxim Nikulin @ 2021-06-29 17:01 UTC (permalink / raw)
  To: emacs-orgmode

On 29/06/2021 10:47, James Harkins wrote:
> So, it would make sense to add a rule to the exporter: if one of the
> characters before or after a source-text line break is a Chinese,
> Japanese or Korean character, do not add a space.

On 29/06/2021 11:43, tumashu wrote:
> You can try the below config :-)
>      (let ((regexp "[[:multibyte:]]")
>            (string text))
>        (setq string
>              (replace-regexp-in-string
>               (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
>               "\\1\\2" string))

Notice that [[:multibyte:]] means almost any non-ASCII script, e.g. 
Cyrillic:

(let ((sample "abc абв def"))
   (and (string-match "[[:multibyte:]]\+" sample)
        (match-string 0 sample)))
"абв"

It seems, `org-fill-paragraph' M-q is smart enough to avoid a space 
before or after a CJK character, so it is possible to determine correct 
way to splice lines, despite e.g. "Script" Unicode property is not 
exposed to elisp: 
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html 
(Anyway maintaining explicit list of scripts is not a straightforward 
approach.)

P.S.
JavaScript in browsers allows to filter characters that belong to 
particular script:

"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]

I have not found such feature in regular expressions available in Emacs.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29 17:01   ` Bug: " Maxim Nikulin
@ 2021-06-29 18:19     ` Eric Abrahamsen
  2021-06-30 12:22       ` Maxim Nikulin
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Abrahamsen @ 2021-06-29 18:19 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: emacs-orgmode

Maxim Nikulin <manikulin@gmail.com> writes:

> On 29/06/2021 10:47, James Harkins wrote:
>> So, it would make sense to add a rule to the exporter: if one of the
>> characters before or after a source-text line break is a Chinese,
>> Japanese or Korean character, do not add a space.
>
> On 29/06/2021 11:43, tumashu wrote:
>> You can try the below config :-)
>>      (let ((regexp "[[:multibyte:]]")
>>            (string text))
>>        (setq string
>>              (replace-regexp-in-string
>>               (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
>>               "\\1\\2" string))
>
> Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
> Cyrillic:
>
> (let ((sample "abc абв def"))
>   (and (string-match "[[:multibyte:]]\+" sample)
>        (match-string 0 sample)))
> "абв"
>
> It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
> before or after a CJK character, so it is possible to determine
> correct way to splice lines, despite e.g. "Script" Unicode property is
> not exposed to elisp:
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
> (Anyway maintaining explicit list of scripts is not a straightforward
> approach.)

There are a few ways to approach this:

(aref char-script-table ?中) -> 'han

(string-match-p "\\cc" "中") -> 0

(aref (char-category-set ?中) ?|) -> t


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29 18:19     ` Eric Abrahamsen
@ 2021-06-30 12:22       ` Maxim Nikulin
  2022-10-08 13:14         ` Ihor Radchenko
  0 siblings, 1 reply; 7+ messages in thread
From: Maxim Nikulin @ 2021-06-30 12:22 UTC (permalink / raw)
  To: emacs-orgmode

On 29/06/2021 10:47, James Harkins wrote:
> * Test
> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
> 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
> 要求办理离校手续,领取相关证书后离校;

> Exporting to ODT produces the following (body text, omitting titles,
> headers and such).
> 
> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;

Confirmed: newlines are copied to ODT document as is and they appear as 
spaces in libreoffice. I did not tried HTML since I am unsure if 
browsers should glue paragraphs with newlines into continuous string 
without spaces. Maybe it is necessary to add some attributes for proper 
representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help 
even though libreoffice considers paragraph as Chinese.

On 30/06/2021 01:19, Eric Abrahamsen wrote:
> There are a few ways to approach this:
> 
> (aref char-script-table ?中) -> 'han
> (string-match-p "\\cc" "中") -> 0
> (aref (char-category-set ?中) ?|) -> t

Thank you. I have not noticed all features hidden behind \c. I believe,

     (rx (category can-break))

is more readable and I am a bit surprised that there is no descriptive 
aliases char-categories such as ?|. Just to add another example:

     (category-set-mnemonics (char-category-set ?ф)) -> ".LYchjy"

and `describe-categories' to decipher it.

As to splicing lines, I found `fill-delete-newlines' that uses 
`fill-nospace-between-words-table' besides ?| category to determine 
whether space should be suppressed while splicing lines. In addition 
there are some variables to tune behavior.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-30 12:22       ` Maxim Nikulin
@ 2022-10-08 13:14         ` Ihor Radchenko
  2022-10-21  5:38           ` Ihor Radchenko
  0 siblings, 1 reply; 7+ messages in thread
From: Ihor Radchenko @ 2022-10-08 13:14 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

Maxim Nikulin <manikulin@gmail.com> writes:

> On 29/06/2021 10:47, James Harkins wrote:
>> * Test
>> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
>> 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
>> 要求办理离校手续,领取相关证书后离校;
>
>> Exporting to ODT produces the following (body text, omitting titles,
>> headers and such).
>> 
>> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
>
> Confirmed: newlines are copied to ODT document as is and they appear as 
> spaces in libreoffice. I did not tried HTML since I am unsure if 
> browsers should glue paragraphs with newlines into continuous string 
> without spaces. Maybe it is necessary to add some attributes for proper 
> representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help 
> even though libreoffice considers paragraph as Chinese.

Newlines appearing as spaces is in ODT schema.

> As to splicing lines, I found `fill-delete-newlines' that uses 
> `fill-nospace-between-words-table' besides ?| category to determine 
> whether space should be suppressed while splicing lines. In addition 
> there are some variables to tune behavior.

I am attaching the fix that leverages `fill-region' to handle all the
complexities for us. It is the easiest way and I see no reason to look
deeper.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ox-odt-Fix-newlines-replaced-by-spaces-in-Han-script.patch --]
[-- Type: text/x-patch, Size: 1833 bytes --]

From 614944ba1ac5502c7648747363674b8d45bfaaf7 Mon Sep 17 00:00:00 2001
Message-Id: <614944ba1ac5502c7648747363674b8d45bfaaf7.1665234699.git.yantar92@gmail.com>
From: Ihor Radchenko <yantar92@gmail.com>
Date: Sat, 8 Oct 2022 21:08:47 +0800
Subject: [PATCH] ox-odt: Fix newlines replaced by spaces in Han script

* lisp/ox-odt.el (org-odt-plain-text): Use `fill-region' to unfill the
paragraphs with newlines accounting for scripts without spaces between
words.

Reported-by: James Harkins <jamshark70@zoho.com>
Link: https://orgmode.org/list/sbhnlv$4t1$1@ciao.gmane.io
---
 lisp/ox-odt.el | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index 208a39d9d..c989d2014 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -2903,9 +2903,20 @@ (defun org-odt-plain-text (text info)
 	(setq output
 	      (replace-regexp-in-string (car pair) (cdr pair) output t nil))))
     ;; Handle break preservation if required.
-    (when (plist-get info :preserve-breaks)
-      (setq output (replace-regexp-in-string
-		    "\\(\\\\\\\\\\)?[ \t]*\n" "<text:line-break/>" output t)))
+    (if (plist-get info :preserve-breaks)
+        (setq output (replace-regexp-in-string
+		      "\\(\\\\\\\\\\)?[ \t]*\n" "<text:line-break/>" output t))
+      ;; OpenDocument schema recognizes newlines as spaces, which may
+      ;; not be desired in scripts that do not separate words with
+      ;; spaces (for example, Han script).  `fill-region' is able to
+      ;; handle such situations.
+      (setq output
+            (with-temp-buffer
+              (insert output)
+              ;; Unfill.
+              (let ((fill-column (point-max)))
+                (fill-region (point-min) (point-max)))
+              (buffer-string))))
     ;; Return value.
     output))
 
-- 
2.35.1


[-- Attachment #3: Type: text/plain, Size: 224 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2022-10-08 13:14         ` Ihor Radchenko
@ 2022-10-21  5:38           ` Ihor Radchenko
  0 siblings, 0 replies; 7+ messages in thread
From: Ihor Radchenko @ 2022-10-21  5:38 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Maxim Nikulin, emacs-orgmode

Ihor Radchenko <yantar92@gmail.com> writes:

> I am attaching the fix that leverages `fill-region' to handle all the
> complexities for us. It is the easiest way and I see no reason to look
> deeper.

Applied onto main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=3502ce2dbb29b70cdbb978d144322d48cb00f26d

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-10-21  5:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-29  3:47 Bug: ODT export of Chinese text inserts spaces for line breaks James Harkins
2021-06-29  4:43 ` tumashu
2021-06-29 17:01   ` Bug: " Maxim Nikulin
2021-06-29 18:19     ` Eric Abrahamsen
2021-06-30 12:22       ` Maxim Nikulin
2022-10-08 13:14         ` Ihor Radchenko
2022-10-21  5:38           ` Ihor Radchenko

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).