emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Bug: ODT export of Chinese text inserts spaces for line breaks
@ 2021-06-29  3:47 James Harkins
  2021-06-29  4:43 ` tumashu
  0 siblings, 1 reply; 5+ messages in thread
From: James Harkins @ 2021-06-29  3:47 UTC (permalink / raw)
  To: emacs-orgmode

Consider the following org document.

* Test
1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
要求办理离校手续,领取相关证书后离校;

This was produced by pasting in a single, long line, and then using alt-Q (a normal thing to do, and good for readability, because org-mode doesn't wrap lines by default).

Exporting to ODT produces the following (body text, omitting titles, headers and such).

1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;

Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does not allow for spaces mid-sentence.

So, it would make sense to add a rule to the exporter: if one of the characters before or after a source-text line break is a Chinese, Japanese or Korean character, do not add a space. (The space is valid, of course, if the characters on either side of the line breaks are Roman or [I would guess] Cyrillic as well.)

(Side note: Exporting to a LaTeX buffer shows that the line breaks have been copied into the .tex document as is -- but, provided that you have a `usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems for ODT, Org needs to handle the spacing within its own logic.)

This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this has not been a high-visibility issue.

hjh


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29  3:47 Bug: ODT export of Chinese text inserts spaces for line breaks James Harkins
@ 2021-06-29  4:43 ` tumashu
  2021-06-29 17:01   ` Bug: " Maxim Nikulin
  0 siblings, 1 reply; 5+ messages in thread
From: tumashu @ 2021-06-29  4:43 UTC (permalink / raw)
  To: James Harkins; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 2924 bytes --]

You can try the below config :-)





(defun eh-org-wash-text (text backend _info)
  "导出 org file 时,删除中文之间不必要的空格。"
  (when (or (org-export-derived-backend-p backend 'html)
            (org-export-derived-backend-p backend 'odt))
    (let ((regexp "[[:multibyte:]]")
          (string text))
      ;; org-mode 默认将一个换行符转换为空格,但中文不需要这个空格,删除。
      (setq string
            (replace-regexp-in-string
             (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
             "\\1\\2" string))
      ;; 删除粗体之后的空格
      (dolist (str '("</b>" "</code>" "</del>" "</i>"))
        (setq string
              (replace-regexp-in-string
               (format "\\(%s\\)\\(%s\\)[ ]+\\(%s\\)" regexp str regexp)
               "\\1\\2\\3" string)))
      ;; 删除粗体之前的空格
      (dolist (str '("<b>" "<code>" "<del>" "<i>" "<span class=\"underline\">"))
        (setq string
              (replace-regexp-in-string
               (format "\\(%s\\)[ ]+\\(%s\\)\\(%s\\)" regexp str regexp)
               "\\1\\2\\3" string)))
      string)))

(add-hook 'org-export-filter-headline-functions #'eh-org-wash-text)
(add-hook 'org-export-filter-paragraph-functions #'eh-org-wash-text)













在 2021-06-29 11:47:06,"James Harkins" <jamshark70@zoho.com> 写道:
>Consider the following org document.
>
>* Test
>1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
>书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
>要求办理离校手续,领取相关证书后离校;
>
>This was produced by pasting in a single, long line, and then using alt-Q (a normal thing to do, and good for readability, because org-mode doesn't wrap lines by default).
>
>Exporting to ODT produces the following (body text, omitting titles, headers and such).
>
>1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
>
>Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does not allow for spaces mid-sentence.
>
>So, it would make sense to add a rule to the exporter: if one of the characters before or after a source-text line break is a Chinese, Japanese or Korean character, do not add a space. (The space is valid, of course, if the characters on either side of the line breaks are Roman or [I would guess] Cyrillic as well.)
>
>(Side note: Exporting to a LaTeX buffer shows that the line breaks have been copied into the .tex document as is -- but, provided that you have a `usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems for ODT, Org needs to handle the spacing within its own logic.)
>
>This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this has not been a high-visibility issue.
>
>hjh

[-- Attachment #2: Type: text/html, Size: 4423 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29  4:43 ` tumashu
@ 2021-06-29 17:01   ` Maxim Nikulin
  2021-06-29 18:19     ` Eric Abrahamsen
  0 siblings, 1 reply; 5+ messages in thread
From: Maxim Nikulin @ 2021-06-29 17:01 UTC (permalink / raw)
  To: emacs-orgmode

On 29/06/2021 10:47, James Harkins wrote:
> So, it would make sense to add a rule to the exporter: if one of the
> characters before or after a source-text line break is a Chinese,
> Japanese or Korean character, do not add a space.

On 29/06/2021 11:43, tumashu wrote:
> You can try the below config :-)
>      (let ((regexp "[[:multibyte:]]")
>            (string text))
>        (setq string
>              (replace-regexp-in-string
>               (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
>               "\\1\\2" string))

Notice that [[:multibyte:]] means almost any non-ASCII script, e.g. 
Cyrillic:

(let ((sample "abc абв def"))
   (and (string-match "[[:multibyte:]]\+" sample)
        (match-string 0 sample)))
"абв"

It seems, `org-fill-paragraph' M-q is smart enough to avoid a space 
before or after a CJK character, so it is possible to determine correct 
way to splice lines, despite e.g. "Script" Unicode property is not 
exposed to elisp: 
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html 
(Anyway maintaining explicit list of scripts is not a straightforward 
approach.)

P.S.
JavaScript in browsers allows to filter characters that belong to 
particular script:

"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]

I have not found such feature in regular expressions available in Emacs.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29 17:01   ` Bug: " Maxim Nikulin
@ 2021-06-29 18:19     ` Eric Abrahamsen
  2021-06-30 12:22       ` Maxim Nikulin
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Abrahamsen @ 2021-06-29 18:19 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: emacs-orgmode

Maxim Nikulin <manikulin@gmail.com> writes:

> On 29/06/2021 10:47, James Harkins wrote:
>> So, it would make sense to add a rule to the exporter: if one of the
>> characters before or after a source-text line break is a Chinese,
>> Japanese or Korean character, do not add a space.
>
> On 29/06/2021 11:43, tumashu wrote:
>> You can try the below config :-)
>>      (let ((regexp "[[:multibyte:]]")
>>            (string text))
>>        (setq string
>>              (replace-regexp-in-string
>>               (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
>>               "\\1\\2" string))
>
> Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
> Cyrillic:
>
> (let ((sample "abc абв def"))
>   (and (string-match "[[:multibyte:]]\+" sample)
>        (match-string 0 sample)))
> "абв"
>
> It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
> before or after a CJK character, so it is possible to determine
> correct way to splice lines, despite e.g. "Script" Unicode property is
> not exposed to elisp:
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
> (Anyway maintaining explicit list of scripts is not a straightforward
> approach.)

There are a few ways to approach this:

(aref char-script-table ?中) -> 'han

(string-match-p "\\cc" "中") -> 0

(aref (char-category-set ?中) ?|) -> t


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
  2021-06-29 18:19     ` Eric Abrahamsen
@ 2021-06-30 12:22       ` Maxim Nikulin
  0 siblings, 0 replies; 5+ messages in thread
From: Maxim Nikulin @ 2021-06-30 12:22 UTC (permalink / raw)
  To: emacs-orgmode

On 29/06/2021 10:47, James Harkins wrote:
> * Test
> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
> 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
> 要求办理离校手续,领取相关证书后离校;

> Exporting to ODT produces the following (body text, omitting titles,
> headers and such).
> 
> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;

Confirmed: newlines are copied to ODT document as is and they appear as 
spaces in libreoffice. I did not tried HTML since I am unsure if 
browsers should glue paragraphs with newlines into continuous string 
without spaces. Maybe it is necessary to add some attributes for proper 
representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help 
even though libreoffice considers paragraph as Chinese.

On 30/06/2021 01:19, Eric Abrahamsen wrote:
> There are a few ways to approach this:
> 
> (aref char-script-table ?中) -> 'han
> (string-match-p "\\cc" "中") -> 0
> (aref (char-category-set ?中) ?|) -> t

Thank you. I have not noticed all features hidden behind \c. I believe,

     (rx (category can-break))

is more readable and I am a bit surprised that there is no descriptive 
aliases char-categories such as ?|. Just to add another example:

     (category-set-mnemonics (char-category-set ?ф)) -> ".LYchjy"

and `describe-categories' to decipher it.

As to splicing lines, I found `fill-delete-newlines' that uses 
`fill-nospace-between-words-table' besides ?| category to determine 
whether space should be suppressed while splicing lines. In addition 
there are some variables to tune behavior.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-30 12:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-29  3:47 Bug: ODT export of Chinese text inserts spaces for line breaks James Harkins
2021-06-29  4:43 ` tumashu
2021-06-29 17:01   ` Bug: " Maxim Nikulin
2021-06-29 18:19     ` Eric Abrahamsen
2021-06-30 12:22       ` Maxim Nikulin

Code repositories for project(s) associated with this inbox:

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).