emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Form feed characters break odt export
@ 2024-12-21  1:48 Joseph Turner via General discussions about Org-mode.
  2024-12-21  3:56 ` Max Nikulin
  2024-12-23 17:32 ` Ihor Radchenko
  0 siblings, 2 replies; 17+ messages in thread
From: Joseph Turner via General discussions about Org-mode. @ 2024-12-21  1:48 UTC (permalink / raw)
  To: Org Mode Mailing List; +Cc: Bohong Huang

Tested on
GNU Emacs 29.4 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.41, cairo version 1.18.0)
Org mode version 9.7.6 (9.7.6-7a4527 @ /home/joseph/.emacs.d/elpa/org-9.7.6/)

I can export the following Org content to a .odt file, but the exported
file cannot be opened ("Read Error. Format error discovered in the file
in sub-document content.xml at 368,2(row,col).")

--8<---------------cut here---------------start------------->8---
#+TITLE: Foo
* Bar
Baz
\f
--8<---------------cut here---------------end--------------->8---

First reported by bohonghuang:
https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871

Thanks!

Joseph


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-21  1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
@ 2024-12-21  3:56 ` Max Nikulin
  2024-12-21  6:52   ` Joseph Turner
  2024-12-24 16:23   ` Max Nikulin
  2024-12-23 17:32 ` Ihor Radchenko
  1 sibling, 2 replies; 17+ messages in thread
From: Max Nikulin @ 2024-12-21  3:56 UTC (permalink / raw)
  To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang

On 21/12/2024 08:48, Joseph Turner wrote:
> 
> I can export the following Org content to a .odt file, but the exported
> file cannot be opened ("Read Error. Format error discovered in the file
> in sub-document content.xml at 368,2(row,col).")
[...]
> First reported by bohonghuang:
> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871

In this specific context a workaround should be

#+begin_comment
^L
#+end_comment

Or a commented out empty local variables block above.

I have wrote already that I do not like non-printable characters in Org 
files.

I admit that special characters either should cause `org-lint' warnings 
or should be filtered out by exporters.

Specifically to ^L, there was a request to treat it as a page break by 
all exporters (I would prefer some entity or macro instead to not 
deviate from plain text markup).

Marvin Gülker. Feature request: export form feed as page break. Sat, 21 
Oct 2023 09:42:33 +0200.
<https://list.orgmode.org/87zg0ce6yi.fsf@guelker.eu>

I have not had a close look at another proposed feature, but I suspect 
that it might make filtering special characters more tricky. (I would be 
happy to hear that I am wrong.)

Nathaniel Nicandro. [PATCH] ANSI color on example blocks and fixed width 
elements. Wed, 05 Apr 2023 07:03:43 -0500.
<https://list.orgmode.org/874jpuijpc.fsf@gmail.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-21  3:56 ` Max Nikulin
@ 2024-12-21  6:52   ` Joseph Turner
  2024-12-21  7:23     ` Max Nikulin
  2024-12-24 16:23   ` Max Nikulin
  1 sibling, 1 reply; 17+ messages in thread
From: Joseph Turner @ 2024-12-21  6:52 UTC (permalink / raw)
  To: emacs-orgmode; +Cc: Bohong Huang, Max Nikulin

Max Nikulin <manikulin@gmail.com> writes:

> On 21/12/2024 08:48, Joseph Turner wrote:
>> I can export the following Org content to a .odt file, but the
>> exported
>> file cannot be opened ("Read Error. Format error discovered in the file
>> in sub-document content.xml at 368,2(row,col).")
> [...]
>> First reported by bohonghuang:
>> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
>
> In this specific context a workaround should be
>
> #+begin_comment
> ^L
> #+end_comment

Thank you!  Or even simpler:

# ^L

> Or a commented out empty local variables block above.
>
> I have wrote already that I do not like non-printable characters in
> Org files.

I agree that they make Org files less portable outside Emacs, and they
complicate org-export.

> I admit that special characters either should cause `org-lint'
> warnings or should be filtered out by exporters.
>
> Specifically to ^L, there was a request to treat it as a page break by
> all exporters (I would prefer some entity or macro instead to not
> deviate from plain text markup).
>
> Marvin Gülker. Feature request: export form feed as page break. Sat,
> 21 Oct 2023 09:42:33 +0200.
> <https://list.orgmode.org/87zg0ce6yi.fsf@guelker.eu>
>
> I have not had a close look at another proposed feature, but I suspect
> that it might make filtering special characters more tricky. (I would
> be happy to hear that I am wrong.)

Yes.  Without digging into it, my gut feeling is also that handling one
non-printable character specially would open Pandora's box.

> Nathaniel Nicandro. [PATCH] ANSI color on example blocks and fixed
> width elements. Wed, 05 Apr 2023 07:03:43 -0500.
> <https://list.orgmode.org/874jpuijpc.fsf@gmail.com>

Gratefully,

Joseph


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-21  6:52   ` Joseph Turner
@ 2024-12-21  7:23     ` Max Nikulin
  2024-12-21 19:06       ` Joseph Turner
  0 siblings, 1 reply; 17+ messages in thread
From: Max Nikulin @ 2024-12-21  7:23 UTC (permalink / raw)
  To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang

On 21/12/2024 13:52, Joseph Turner wrote:
> Max Nikulin writes:
>>
>> #+begin_comment
>> ^L
>> #+end_comment

> Thank you!  Or even simpler:
> 
> # ^L

It was first I tried, but Emacs-28.2 demands to decide if Local 
Variables should be applied.

You may ask Emacs developers for a *plain text* spell to stop processing 
of local variables (or to take *last* found block).

Notice that commit diff looks confusing.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-21  7:23     ` Max Nikulin
@ 2024-12-21 19:06       ` Joseph Turner
  0 siblings, 0 replies; 17+ messages in thread
From: Joseph Turner @ 2024-12-21 19:06 UTC (permalink / raw)
  To: emacs-orgmode; +Cc: Bohong Huang, Max Nikulin

Max Nikulin <manikulin@gmail.com> writes:

> On 21/12/2024 13:52, Joseph Turner wrote:
>> Max Nikulin writes:
>>>
>>> #+begin_comment
>>> ^L
>>> #+end_comment
>
>> Thank you!  Or even simpler:
>> # ^L
>
> It was first I tried, but Emacs-28.2 demands to decide if Local
> Variables should be applied.

Oops!  You're right.  The form feed needs to be at the beginning of the line.

> You may ask Emacs developers for a *plain text* spell to stop
> processing of local variables (or to take *last* found block).

Good idea:

https://yhetil.org/emacs-devel/87ttawewpx.fsf@breatheoutbreathe.in/T/#u

> Notice that commit diff looks confusing.

Joseph


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-21  1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
  2024-12-21  3:56 ` Max Nikulin
@ 2024-12-23 17:32 ` Ihor Radchenko
  2024-12-24 11:04   ` Christian Moe
  1 sibling, 1 reply; 17+ messages in thread
From: Ihor Radchenko @ 2024-12-23 17:32 UTC (permalink / raw)
  To: Joseph Turner; +Cc: Org Mode Mailing List, Bohong Huang

Joseph Turner via "General discussions about Org-mode."
<emacs-orgmode@gnu.org> writes:

> I can export the following Org content to a .odt file, but the exported
> file cannot be opened ("Read Error. Format error discovered in the file
> in sub-document content.xml at 368,2(row,col).")
>
> --8<---------------cut here---------------start------------->8---
> #+TITLE: Foo
> * Bar
> Baz
> \f
> --8<---------------cut here---------------end--------------->8---

Looks like ^L is not allowed in ODT files.
However, I see no such information on
http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html

May somebody check if there is an official list of unsupported
characters in ODT? Or maybe it is simply a bug in LibreOffice?

-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-23 17:32 ` Ihor Radchenko
@ 2024-12-24 11:04   ` Christian Moe
  2024-12-24 14:14     ` Ihor Radchenko
  2024-12-24 14:25     ` Max Nikulin
  0 siblings, 2 replies; 17+ messages in thread
From: Christian Moe @ 2024-12-24 11:04 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Joseph Turner, Org Mode Mailing List, Bohong Huang

(re-sending to include the list, apologies, recent mu4e ui changes keep
tripping me up)

Ihor Radchenko <yantar92@posteo.net> writes:

> Joseph Turner via "General discussions about Org-mode."
> <emacs-orgmode@gnu.org> writes:
>
>> I can export the following Org content to a .odt file, but the exported
>> file cannot be opened ("Read Error. Format error discovered in the file
>> in sub-document content.xml at 368,2(row,col).")
>>
>> --8<---------------cut here---------------start------------->8---
>> #+TITLE: Foo
>> * Bar
>> Baz
>> \f
>> --8<---------------cut here---------------end--------------->8---
>
> Looks like ^L is not allowed in ODT files.
> However, I see no such information on
> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
>
> May somebody check if there is an official list of unsupported
> characters in ODT? Or maybe it is simply a bug in LibreOffice?

I don't think it's specific to ODT or LibreOffice, it's the underlying
XML 1.0 spec that "discourages" control characters and does not include
#xC in the range of characters that XML processors must accept.

Spec: https://www.w3.org/TR/REC-xml/#charsets

Some discussion:
https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0

Yours,
Christian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-24 11:04   ` Christian Moe
@ 2024-12-24 14:14     ` Ihor Radchenko
  2024-12-25 10:10       ` Joseph Turner
  2024-12-24 14:25     ` Max Nikulin
  1 sibling, 1 reply; 17+ messages in thread
From: Ihor Radchenko @ 2024-12-24 14:14 UTC (permalink / raw)
  To: Christian Moe; +Cc: Joseph Turner, Org Mode Mailing List, Bohong Huang

[-- Attachment #1: Type: text/plain, Size: 520 bytes --]

Christian Moe <mail@christianmoe.com> writes:

> I don't think it's specific to ODT or LibreOffice, it's the underlying
> XML 1.0 spec that "discourages" control characters and does not include
> #xC in the range of characters that XML processors must accept.
>
> Spec: https://www.w3.org/TR/REC-xml/#charsets
>
> Some discussion:
> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0

Thanks!
Then, we can simply remove the disallowed characters.
See the attached tentative patch.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch --]
[-- Type: text/x-patch, Size: 3707 bytes --]

From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001
Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Tue, 24 Dec 2024 15:11:22 +0100
Subject: [PATCH] ox-odt: Avoid putting forbidden  characters into ODT xml

* lisp/ox-odt.el (org-odt-forbidden-char-re):
(org-odt-discouraged-char-re): New constants codifying characters that
are prohibited in XML spec.
(org-odt--remove-forbidden): New function removing the prohibited
characters.
(org-odt--encode-plain-text): Remove the prohibited characters.
(org-odt-plain-text): Update comment.

Reported-by: Joseph Turner <joseph@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
---
 lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef0..61c8d4ec75 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps
     ("\\.\\.\\." . "&#x2026;"))		; hellip
   "Regular expressions for special string conversion.")
 
+(defconst org-odt-forbidden-char-re
+  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+             (?\N{U+20} . ?\N{U+D7FF})
+             (?\N{U+E000} . ?\N{U+FFFD})
+             (?\N{U+10000} . ?\N{U+10FFFF}))))
+  "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
+(defconst org-odt-discouraged-char-re
+  (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F})
+	  (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF})
+          (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF})
+	  (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF})
+          (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF})
+          (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF})
+	  (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF})
+          (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF})
+          (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF})
+	  (?\N{U+10FFFE} . ?\N{U+10FFFF})))
+  "Regexp matching discouraged XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
 (defconst org-odt-schema-dir-list
   (list (expand-file-name "./schema/" org-odt-data-dir))
   "List of directories to search for OpenDocument schema files.
@@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line)
        (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
    line))
 
+(defun org-odt--remove-forbidden (text)
+  "Remove forbidden and discouraged characters from TEXT.
+https://www.w3.org/TR/REC-xml/#charsets"
+  (replace-regexp-in-string
+   org-odt-forbidden-char-re ""
+   (replace-regexp-in-string
+    org-odt-discouraged-char-re ""
+    text)))
+
 (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
   (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
-  (if no-whitespace-filling text
-    (org-odt--encode-tabs-and-spaces text)))
+  (org-odt--remove-forbidden
+   (if no-whitespace-filling text
+     (org-odt--encode-tabs-and-spaces text))))
 
 (defun org-odt-plain-text (text info)
   "Transcode a TEXT string from Org to ODT.
 TEXT is the string to transcode.  INFO is a plist holding
 contextual information."
   (let ((output text))
-    ;; Protect &, < and >.
+    ;; Protect &, < and >, and remove forbidden characters.
     (setq output (org-odt--encode-plain-text output t))
     ;; Handle smart quotes.  Be sure to provide original string since
     ;; OUTPUT may have been modified.
-- 
2.47.1


[-- Attachment #3: Type: text/plain, Size: 223 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-24 11:04   ` Christian Moe
  2024-12-24 14:14     ` Ihor Radchenko
@ 2024-12-24 14:25     ` Max Nikulin
  2024-12-24 14:30       ` Ihor Radchenko
  1 sibling, 1 reply; 17+ messages in thread
From: Max Nikulin @ 2024-12-24 14:25 UTC (permalink / raw)
  To: emacs-orgmode

On 24/12/2024 18:04, Christian Moe wrote:
> I don't think it's specific to ODT or LibreOffice, it's the underlying
> XML 1.0 spec that "discourages" control characters and does not include
> #xC in the range of characters that XML processors must accept.

Pandoc retains "^L" in export to markdown, but replaces the line by a 
space in .odt. I am curious if it is a dedicated output filter or just a 
feature of XML writer.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-24 14:25     ` Max Nikulin
@ 2024-12-24 14:30       ` Ihor Radchenko
  0 siblings, 0 replies; 17+ messages in thread
From: Ihor Radchenko @ 2024-12-24 14:30 UTC (permalink / raw)
  To: Max Nikulin; +Cc: emacs-orgmode

Max Nikulin <manikulin@gmail.com> writes:

> On 24/12/2024 18:04, Christian Moe wrote:
>> I don't think it's specific to ODT or LibreOffice, it's the underlying
>> XML 1.0 spec that "discourages" control characters and does not include
>> #xC in the range of characters that XML processors must accept.
>
> Pandoc retains "^L" in export to markdown, but replaces the line by a 
> space in .odt. I am curious if it is a dedicated output filter or just a 
> feature of XML writer.

What about other control characters? Does pandoc also replace them with space?

-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-21  3:56 ` Max Nikulin
  2024-12-21  6:52   ` Joseph Turner
@ 2024-12-24 16:23   ` Max Nikulin
  2024-12-25 10:16     ` Joseph Turner
  1 sibling, 1 reply; 17+ messages in thread
From: Max Nikulin @ 2024-12-24 16:23 UTC (permalink / raw)
  To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang

On 21/12/2024 10:56, Max Nikulin wrote:
> On 21/12/2024 08:48, Joseph Turner wrote:
>> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
> 
> In this specific context a workaround should be
> 
> #+begin_comment
> ^L
> #+end_comment

To avoid confusion of other contributors it should be more verbose:

#+begin_comment
   Keep this block at the bottom of the file.
   It instructs Emacs to ignore examples
   of local variables sections above, see
   <info:emacs#Specifying File Variables>
   The following line contains the form feed 0x0c character.
^L
#+end_comment


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-24 14:14     ` Ihor Radchenko
@ 2024-12-25 10:10       ` Joseph Turner
  2024-12-27 10:21         ` Ihor Radchenko
  0 siblings, 1 reply; 17+ messages in thread
From: Joseph Turner @ 2024-12-25 10:10 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang

Ihor Radchenko <yantar92@posteo.net> writes:

> Christian Moe <mail@christianmoe.com> writes:
>
>> I don't think it's specific to ODT or LibreOffice, it's the underlying
>> XML 1.0 spec that "discourages" control characters and does not include
>> #xC in the range of characters that XML processors must accept.
>>
>> Spec: https://www.w3.org/TR/REC-xml/#charsets
>>
>> Some discussion:
>> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0
>
> Thanks!
> Then, we can simply remove the disallowed characters.
> See the attached tentative patch.
>
> From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001
> Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net>
> From: Ihor Radchenko <yantar92@posteo.net>
> Date: Tue, 24 Dec 2024 15:11:22 +0100
> Subject: [PATCH] ox-odt: Avoid putting forbidden  characters into ODT xml
>
> * lisp/ox-odt.el (org-odt-forbidden-char-re):
> (org-odt-discouraged-char-re): New constants codifying characters that
> are prohibited in XML spec.
> (org-odt--remove-forbidden): New function removing the prohibited
> characters.
> (org-odt--encode-plain-text): Remove the prohibited characters.
> (org-odt-plain-text): Update comment.
>
> Reported-by: Joseph Turner <joseph@breatheoutbreathe.in>
> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
> ---
>  lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++---
>  1 file changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
> index ec81637ef0..61c8d4ec75 100644
> --- a/lisp/ox-odt.el
> +++ b/lisp/ox-odt.el
> @@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps
>      ("\\.\\.\\." . "&#x2026;"))		; hellip
>    "Regular expressions for special string conversion.")
>
> +(defconst org-odt-forbidden-char-re
> +  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
> +             (?\N{U+20} . ?\N{U+D7FF})
> +             (?\N{U+E000} . ?\N{U+FFFD})
> +             (?\N{U+10000} . ?\N{U+10FFFF}))))
> +  "Regexp matching forbidden XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
> +(defconst org-odt-discouraged-char-re
> +  (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F})
> +	  (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF})
> +          (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF})
> +	  (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF})
> +          (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF})
> +          (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF})
> +	  (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF})
> +          (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF})
> +          (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF})
> +	  (?\N{U+10FFFE} . ?\N{U+10FFFF})))
> +  "Regexp matching discouraged XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
>  (defconst org-odt-schema-dir-list
>    (list (expand-file-name "./schema/" org-odt-data-dir))
>    "List of directories to search for OpenDocument schema files.
> @@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line)
>         (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
>     line))
>
> +(defun org-odt--remove-forbidden (text)
> +  "Remove forbidden and discouraged characters from TEXT.
> +https://www.w3.org/TR/REC-xml/#charsets"
> +  (replace-regexp-in-string
> +   org-odt-forbidden-char-re ""
> +   (replace-regexp-in-string
> +    org-odt-discouraged-char-re ""
> +    text)))
> +
>  (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
>    (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
>      (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
> -  (if no-whitespace-filling text
> -    (org-odt--encode-tabs-and-spaces text)))
> +  (org-odt--remove-forbidden
> +   (if no-whitespace-filling text
> +     (org-odt--encode-tabs-and-spaces text))))
>
>  (defun org-odt-plain-text (text info)
>    "Transcode a TEXT string from Org to ODT.
>  TEXT is the string to transcode.  INFO is a plist holding
>  contextual information."
>    (let ((output text))
> -    ;; Protect &, < and >.
> +    ;; Protect &, < and >, and remove forbidden characters.
>      (setq output (org-odt--encode-plain-text output t))
>      ;; Handle smart quotes.  Be sure to provide original string since
>      ;; OUTPUT may have been modified.
> --
> 2.47.1

Thanks, Ihor!  Tested working on my machine.

Here's another potential solution to consider, which adds a defcustom to
let the user decide how to handle forbidden characters:

https://github.com/kjambunathan/org-mode-ox-odt/commit/07fde1e9b7cdda3e3ef8136f5b1d478499dfd780

Joseph


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-24 16:23   ` Max Nikulin
@ 2024-12-25 10:16     ` Joseph Turner
  0 siblings, 0 replies; 17+ messages in thread
From: Joseph Turner @ 2024-12-25 10:16 UTC (permalink / raw)
  To: emacs-orgmode; +Cc: Bohong Huang

Max Nikulin <manikulin@gmail.com> writes:

> On 21/12/2024 10:56, Max Nikulin wrote:
>> On 21/12/2024 08:48, Joseph Turner wrote:
>>> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
>> In this specific context a workaround should be
>> #+begin_comment
>> ^L
>> #+end_comment
>
> To avoid confusion of other contributors it should be more verbose:
>
> #+begin_comment
>   Keep this block at the bottom of the file.
>   It instructs Emacs to ignore examples
>   of local variables sections above, see
>   <info:emacs#Specifying File Variables>
>   The following line contains the form feed 0x0c character.
> ^L
> #+end_comment

Thank you, Max.  Submitted PR:

https://github.com/bohonghuang/org-srs/pull/12

Joseph


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-25 10:10       ` Joseph Turner
@ 2024-12-27 10:21         ` Ihor Radchenko
  2024-12-27 20:42           ` Joseph Turner
  0 siblings, 1 reply; 17+ messages in thread
From: Ihor Radchenko @ 2024-12-27 10:21 UTC (permalink / raw)
  To: Joseph Turner; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang

[-- Attachment #1: Type: text/plain, Size: 443 bytes --]

Joseph Turner <joseph@breatheoutbreathe.in> writes:

> Thanks, Ihor!  Tested working on my machine.
>
> Here's another potential solution to consider, which adds a defcustom to
> let the user decide how to handle forbidden characters:
>
> https://github.com/kjambunathan/org-mode-ox-odt/commit/07fde1e9b7cdda3e3ef8136f5b1d478499dfd780

Good idea!
I went even further and used a proper export setting.
See the attached 2nd version of the fix.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: v2-0001-ox-odt-Avoid-putting-forbidden-characters-into-OD.patch --]
[-- Type: text/x-patch, Size: 4293 bytes --]

From de015e4a3b98bc975c2dcd1dfce7adcf77eb537c Mon Sep 17 00:00:00 2001
Message-ID: <de015e4a3b98bc975c2dcd1dfce7adcf77eb537c.1735294805.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Tue, 24 Dec 2024 15:11:22 +0100
Subject: [PATCH v2] ox-odt: Avoid putting forbidden characters into ODT xml

* lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to
control how to handle forbidden XML characters.
(org-odt--remove-forbidden): New filter removing/replacing forbidden
characters.

Reported-by: Joseph Turner <joseph@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
---
 lisp/ox-odt.el | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef0..635bf38971 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -94,7 +94,8 @@ (org-export-define-backend 'odt
 		    . (org-odt--translate-latex-fragments
 		       org-odt--translate-description-lists
 		       org-odt--translate-list-tables
-		       org-odt--translate-image-links)))
+		       org-odt--translate-image-links))
+                   (:filter-final-output . org-odt--remove-forbidden))
   :menu-entry
   '(?o "Export to ODT"
        ((?o "As ODT file" org-odt-export-to-odt)
@@ -108,6 +109,7 @@ (org-export-define-backend 'odt
     (:keywords "KEYWORDS" nil nil space)
     (:subtitle "SUBTITLE" nil nil parse)
     ;; Other variables.
+    (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars)
     (:odt-content-template-file nil nil org-odt-content-template-file)
     (:odt-display-outline-level nil nil org-odt-display-outline-level)
     (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks)
@@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps
     ("\\.\\.\\." . "&#x2026;"))		; hellip
   "Regular expressions for special string conversion.")
 
+(defconst org-odt-forbidden-char-re
+  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+             (?\N{U+20} . ?\N{U+D7FF})
+             (?\N{U+E000} . ?\N{U+FFFD})
+             (?\N{U+10000} . ?\N{U+10FFFF}))))
+  "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
 (defconst org-odt-schema-dir-list
   (list (expand-file-name "./schema/" org-odt-data-dir))
   "List of directories to search for OpenDocument schema files.
@@ -364,6 +374,19 @@ (defgroup org-export-odt nil
   :tag "Org Export ODT"
   :group 'org-export)
 
+(defcustom org-odt-with-forbidden-chars ""
+  "String to replace forbidden XML characters.
+When set to t, forbidden characters are retained.
+When set to nil, an error is thrown.
+See `org-odt-forbidden-char-re' for the list of forbidden characters
+that cannot occur inside ODT documents.
+
+You may also consider export filters to perform more fine-grained
+replacements.  See info node `(org)Advanced Export Configuration'."
+  :package-version '(Org . "9.8")
+  :type '(choice (const :tag "Strip forbidden characters" t)
+                 (const :tag "Err when forbidden characters encountered" nil)
+                 (string :tag "Replacement string")))
 
 ;;;; Debugging
 
@@ -2892,6 +2915,24 @@ (defun org-odt--encode-tabs-and-spaces (line)
        (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
    line))
 
+(defun org-odt--remove-forbidden (text _backend info)
+  "Remove forbidden and discouraged characters from TEXT.
+INFO is the communication plist"
+  (pcase (plist-get info :odt-with-forbidden-chars)
+    ((and (pred stringp) rep)
+     (prog1 (replace-regexp-in-string org-odt-forbidden-char-re rep text)
+       (when (match-string 0 text)
+         (display-warning
+          '(ox-odt ox-odt-with-forbidden-chars)
+          (format "Replacing forbidden character '%s' with '%s'"
+                  (match-string 0 text) rep)))))
+    (`nil
+     (if (string-match org-odt-forbidden-char-re text)
+         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
+                (match-string 0 text))
+       text))
+    (_ text)))
+
 (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
   (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
-- 
2.47.1


[-- Attachment #3: Type: text/plain, Size: 223 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-27 10:21         ` Ihor Radchenko
@ 2024-12-27 20:42           ` Joseph Turner
  2024-12-28  8:32             ` Ihor Radchenko
  0 siblings, 1 reply; 17+ messages in thread
From: Joseph Turner @ 2024-12-27 20:42 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang

[-- Attachment #1: Type: text/plain, Size: 5065 bytes --]

Ihor Radchenko <yantar92@posteo.net> writes:

[...]

> +(defconst org-odt-forbidden-char-re
> +  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
> +             (?\N{U+20} . ?\N{U+D7FF})
> +             (?\N{U+E000} . ?\N{U+FFFD})
> +             (?\N{U+10000} . ?\N{U+10FFFF}))))

Indentation mismatch ^

> +  "Regexp matching forbidden XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
>  (defconst org-odt-schema-dir-list
>    (list (expand-file-name "./schema/" org-odt-data-dir))
>    "List of directories to search for OpenDocument schema files.
> @@ -364,6 +374,19 @@ (defgroup org-export-odt nil
>    :tag "Org Export ODT"
>    :group 'org-export)
>
> +(defcustom org-odt-with-forbidden-chars ""
> +  "String to replace forbidden XML characters.
> +When set to t, forbidden characters are retained.
> +When set to nil, an error is thrown.
> +See `org-odt-forbidden-char-re' for the list of forbidden characters
> +that cannot occur inside ODT documents.
> +
> +You may also consider export filters to perform more fine-grained
> +replacements.  See info node `(org)Advanced Export Configuration'."
> +  :package-version '(Org . "9.8")
> +  :type '(choice (const :tag "Strip forbidden characters" t)

According to the docstring, the above tag should say "Leave forbidden
characters as-is".  See patch which slightly rewords the docstring too.

> +                 (const :tag "Err when forbidden characters encountered" nil)
> +                 (string :tag "Replacement string")))
>
>  ;;;; Debugging
>
> @@ -2892,6 +2915,24 @@ (defun org-odt--encode-tabs-and-spaces (line)
>         (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
>     line))
>
> +(defun org-odt--remove-forbidden (text _backend info)
> +  "Remove forbidden and discouraged characters from TEXT.
> +INFO is the communication plist"
> +  (pcase (plist-get info :odt-with-forbidden-chars)

Should we use pcase-exhaustive?

> +    ((and (pred stringp) rep)
> +     (prog1 (replace-regexp-in-string org-odt-forbidden-char-re rep text)
> +       (when (match-string 0 text)

The replacement appears to work well on my machine, but there are
unnecessary warnings.  Run org-odt-export-to-odt on a buffer containing:

--8<---------------cut here---------------start------------->8---
* foo

bar
--8<---------------cut here---------------end--------------->8---

the (match-string 0 text) form inside org-odt--remove-forbidden evals to

"<?xml version=\"1.0\" "

which causes the incorrect warning message "Warning (ox-odt): Replacing forbidden character '' with ''"

Confusingly, `text' and the replacement text are string-equal, so it
appears that no replacement has been made.

I suspect that match-string and replace-regexp-in-string perhaps do not
play well together.  Try this out:

(let* ((text "bar")
       (new (replace-regexp-in-string "r" "z" text)))
  new                    ; "baz", as expected
  (match-string 0 new)   ; signals error
  (match-string 0 text)) ; signals error

I get the following stack trace (for the first error):

Debugger entered--Lisp error: (args-out-of-range "baz" 402 403)
substring("baz" 402 403)
(if string (substring string (match-beginning num) (match-end num)) (buffer-substring (match-beginning num) (match-end num)))
(if (match-beginning num) (if string (substring string (match-beginning num) (match-end num)) (buffer-substring (match-beginning num) (match-end num))))
match-string(0 "baz")
(let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text))
(progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text)))
(let ((print-level nil) (print-length nil)) (progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text))))
(setq elisp--eval-defun-result (let ((print-level nil) (print-length nil)) (progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text)))))
elisp--eval-defun()
#<subr eval-defun>(nil)
edebug--eval-defun(#<subr eval-defun> nil)
apply(edebug--eval-defun #<subr eval-defun> nil)
eval-defun(nil)
funcall-interactively(eval-defun nil)
command-execute(eval-defun)


Also with the replace-regexp-in-string design, there will only be one
warning even with multiple forbidden characters.  See patch below.

> +         (display-warning
> +          '(ox-odt ox-odt-with-forbidden-chars)
> +          (format "Replacing forbidden character '%s' with '%s'"
> +                  (match-string 0 text) rep)))))
> +    (`nil
> +     (if (string-match org-odt-forbidden-char-re text)
> +         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
> +                (match-string 0 text))
> +       text))
> +    (_ text)))
> +
>  (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
>    (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
>      (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
> --
> 2.47.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch --]
[-- Type: text/x-diff, Size: 4623 bytes --]

From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001
From: Ihor Radchenko <yantar92@posteo.net>
Date: Fri, 27 Dec 2024 10:21:02 +0000
Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml

* lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to
control how to handle forbidden XML characters.
(org-odt--remove-forbidden): New filter removing/replacing forbidden
characters.

Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
---
 lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef..960bab286 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -94,7 +94,8 @@ (org-export-define-backend 'odt
 		    . (org-odt--translate-latex-fragments
 		       org-odt--translate-description-lists
 		       org-odt--translate-list-tables
-		       org-odt--translate-image-links)))
+		       org-odt--translate-image-links))
+                   (:filter-final-output . org-odt--remove-forbidden))
   :menu-entry
   '(?o "Export to ODT"
        ((?o "As ODT file" org-odt-export-to-odt)
@@ -108,6 +109,7 @@ (org-export-define-backend 'odt
     (:keywords "KEYWORDS" nil nil space)
     (:subtitle "SUBTITLE" nil nil parse)
     ;; Other variables.
+    (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars)
     (:odt-content-template-file nil nil org-odt-content-template-file)
     (:odt-display-outline-level nil nil org-odt-display-outline-level)
     (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks)
@@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps
     ("\\.\\.\\." . "&#x2026;"))		; hellip
   "Regular expressions for special string conversion.")
 
+(defconst org-odt-forbidden-char-re
+  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+               (?\N{U+20} . ?\N{U+D7FF})
+               (?\N{U+E000} . ?\N{U+FFFD})
+               (?\N{U+10000} . ?\N{U+10FFFF}))))
+  "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
 (defconst org-odt-schema-dir-list
   (list (expand-file-name "./schema/" org-odt-data-dir))
   "List of directories to search for OpenDocument schema files.
@@ -364,6 +374,19 @@ (defgroup org-export-odt nil
   :tag "Org Export ODT"
   :group 'org-export)
 
+(defcustom org-odt-with-forbidden-chars ""
+  "String to replace forbidden XML characters.
+When set to t, forbidden characters are left as-is.
+When set to nil, an error is thrown.
+See `org-odt-forbidden-char-re' for the list of forbidden characters
+that cannot occur inside ODT documents.
+
+You may also consider export filters to perform more fine-grained
+replacements.  See info node `(org)Advanced Export Configuration'."
+  :package-version '(Org . "9.8")
+  :type '(choice (const :tag "Leave forbidden characters as-is" t)
+                 (const :tag "Err when forbidden characters encountered" nil)
+                 (string :tag "Replacement string")))
 
 ;;;; Debugging
 
@@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line)
        (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
    line))
 
+(defun org-odt--remove-forbidden (text _backend info)
+  "Remove forbidden and discouraged characters from TEXT.
+INFO is the communication plist"
+  (pcase-exhaustive (plist-get info :odt-with-forbidden-chars)
+    ((and (pred stringp) rep)
+     (let ((replacements (make-hash-table :test 'equal)))
+       (with-temp-buffer
+         (insert text)
+         (goto-char (point-min))
+         (while (re-search-forward org-odt-forbidden-char-re nil t)
+           (cl-incf (gethash (match-string 0) replacements 0))
+           (replace-match rep))
+         (cl-loop for forbidden being the hash-keys of replacements
+                  using (hash-values count)
+                  do (display-warning
+                      '(ox-odt ox-odt-with-forbidden-chars)
+                      (format "Replaced forbidden character '%s' with '%s' %d times"
+                              forbidden rep count)))
+         (buffer-string))))
+    (`nil
+     (if (string-match org-odt-forbidden-char-re text)
+         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
+                (match-string 0 text))
+       text))
+    ('t text)))
+
 (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
   (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
-- 
2.46.0


[-- Attachment #3: Type: text/plain, Size: 21 bytes --]


Thank you!!

Joseph

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-27 20:42           ` Joseph Turner
@ 2024-12-28  8:32             ` Ihor Radchenko
  2024-12-28  9:50               ` Joseph Turner
  0 siblings, 1 reply; 17+ messages in thread
From: Ihor Radchenko @ 2024-12-28  8:32 UTC (permalink / raw)
  To: Joseph Turner; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

Joseph Turner <joseph@breatheoutbreathe.in> writes:

> From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001
> From: Ihor Radchenko <yantar92@posteo.net>
> Date: Fri, 27 Dec 2024 10:21:02 +0000
> Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml

Thanks for helping with the patch!
I modified it further, adding ORG-NEWS entry announcing the new export
option.


[-- Attachment #2: v3-0001-ox-odt-Avoid-putting-forbidden-characters-into-OD.patch --]
[-- Type: text/x-patch, Size: 6021 bytes --]

From 89901da3a0d00598c5ac40cddb2f6dec7c7047cf Mon Sep 17 00:00:00 2001
Message-ID: <89901da3a0d00598c5ac40cddb2f6dec7c7047cf.1735374641.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Fri, 27 Dec 2024 10:21:02 +0000
Subject: [PATCH v3] ox-odt: Avoid putting forbidden characters into ODT xml

* lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to
control how to handle forbidden XML characters.
(org-odt--remove-forbidden): New filter removing/replacing forbidden
characters.
* etc/ORG-NEWS (ox-odt: New export option
~org-odt-with-forbidden-chars~): Announce the new option.

Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
---
 etc/ORG-NEWS   | 16 ++++++++++++++++
 lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/etc/ORG-NEWS b/etc/ORG-NEWS
index d26813c983..a56e105481 100644
--- a/etc/ORG-NEWS
+++ b/etc/ORG-NEWS
@@ -182,6 +182,22 @@ now be pasted as an Org table using ~yank-media~.
 # adding new customizations, or changing the interpretation of the
 # existing customizations.
 
+*** ox-odt: New export option ~org-odt-with-forbidden-chars~
+
+The new export option controls how to deal with characters that are forbidden
+inside ODT documents during export.
+
+The ODT documents must follow XML1.0 specification and cannot contain
+certain unicode characters.  For example, form feed characters like ^L
+are disallowed.
+
+By default, =ox-odt= will strip such characters and display warning.
+You may return to the previous behaviour by setting
+~org-odt-with-forbidden-chars~ to t.
+
+Note that Emacs warnings can always be suppressed by clicking on ⛔
+symbol or by customizing ~warning-suppress-types~.
+
 *** New option ~org-edit-keep-region~
 
 Since Org 9.7, structure editing commands do not deactivate region
diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef0..960bab286a 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -94,7 +94,8 @@ (org-export-define-backend 'odt
 		    . (org-odt--translate-latex-fragments
 		       org-odt--translate-description-lists
 		       org-odt--translate-list-tables
-		       org-odt--translate-image-links)))
+		       org-odt--translate-image-links))
+                   (:filter-final-output . org-odt--remove-forbidden))
   :menu-entry
   '(?o "Export to ODT"
        ((?o "As ODT file" org-odt-export-to-odt)
@@ -108,6 +109,7 @@ (org-export-define-backend 'odt
     (:keywords "KEYWORDS" nil nil space)
     (:subtitle "SUBTITLE" nil nil parse)
     ;; Other variables.
+    (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars)
     (:odt-content-template-file nil nil org-odt-content-template-file)
     (:odt-display-outline-level nil nil org-odt-display-outline-level)
     (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks)
@@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps
     ("\\.\\.\\." . "&#x2026;"))		; hellip
   "Regular expressions for special string conversion.")
 
+(defconst org-odt-forbidden-char-re
+  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+               (?\N{U+20} . ?\N{U+D7FF})
+               (?\N{U+E000} . ?\N{U+FFFD})
+               (?\N{U+10000} . ?\N{U+10FFFF}))))
+  "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
 (defconst org-odt-schema-dir-list
   (list (expand-file-name "./schema/" org-odt-data-dir))
   "List of directories to search for OpenDocument schema files.
@@ -364,6 +374,19 @@ (defgroup org-export-odt nil
   :tag "Org Export ODT"
   :group 'org-export)
 
+(defcustom org-odt-with-forbidden-chars ""
+  "String to replace forbidden XML characters.
+When set to t, forbidden characters are left as-is.
+When set to nil, an error is thrown.
+See `org-odt-forbidden-char-re' for the list of forbidden characters
+that cannot occur inside ODT documents.
+
+You may also consider export filters to perform more fine-grained
+replacements.  See info node `(org)Advanced Export Configuration'."
+  :package-version '(Org . "9.8")
+  :type '(choice (const :tag "Leave forbidden characters as-is" t)
+                 (const :tag "Err when forbidden characters encountered" nil)
+                 (string :tag "Replacement string")))
 
 ;;;; Debugging
 
@@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line)
        (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
    line))
 
+(defun org-odt--remove-forbidden (text _backend info)
+  "Remove forbidden and discouraged characters from TEXT.
+INFO is the communication plist"
+  (pcase-exhaustive (plist-get info :odt-with-forbidden-chars)
+    ((and (pred stringp) rep)
+     (let ((replacements (make-hash-table :test 'equal)))
+       (with-temp-buffer
+         (insert text)
+         (goto-char (point-min))
+         (while (re-search-forward org-odt-forbidden-char-re nil t)
+           (cl-incf (gethash (match-string 0) replacements 0))
+           (replace-match rep))
+         (cl-loop for forbidden being the hash-keys of replacements
+                  using (hash-values count)
+                  do (display-warning
+                      '(ox-odt ox-odt-with-forbidden-chars)
+                      (format "Replaced forbidden character '%s' with '%s' %d times"
+                              forbidden rep count)))
+         (buffer-string))))
+    (`nil
+     (if (string-match org-odt-forbidden-char-re text)
+         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
+                (match-string 0 text))
+       text))
+    ('t text)))
+
 (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
   (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
-- 
2.47.1


[-- Attachment #3: Type: text/plain, Size: 223 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Form feed characters break odt export
  2024-12-28  8:32             ` Ihor Radchenko
@ 2024-12-28  9:50               ` Joseph Turner
  0 siblings, 0 replies; 17+ messages in thread
From: Joseph Turner @ 2024-12-28  9:50 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang

Ihor Radchenko <yantar92@posteo.net> writes:

> Joseph Turner <joseph@breatheoutbreathe.in> writes:
>
>> From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001
>> From: Ihor Radchenko <yantar92@posteo.net>
>> Date: Fri, 27 Dec 2024 10:21:02 +0000
>> Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml
>
> Thanks for helping with the patch!
> I modified it further, adding ORG-NEWS entry announcing the new export
> option.
>
> From 89901da3a0d00598c5ac40cddb2f6dec7c7047cf Mon Sep 17 00:00:00 2001
> Message-ID: <89901da3a0d00598c5ac40cddb2f6dec7c7047cf.1735374641.git.yantar92@posteo.net>
> From: Ihor Radchenko <yantar92@posteo.net>
> Date: Fri, 27 Dec 2024 10:21:02 +0000
> Subject: [PATCH v3] ox-odt: Avoid putting forbidden characters into ODT xml
>
> * lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to
> control how to handle forbidden XML characters.
> (org-odt--remove-forbidden): New filter removing/replacing forbidden
> characters.
> * etc/ORG-NEWS (ox-odt: New export option
> ~org-odt-with-forbidden-chars~): Announce the new option.
>
> Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in>
> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
> ---
>  etc/ORG-NEWS   | 16 ++++++++++++++++
>  lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 66 insertions(+), 1 deletion(-)
>
> diff --git a/etc/ORG-NEWS b/etc/ORG-NEWS
> index d26813c983..a56e105481 100644
> --- a/etc/ORG-NEWS
> +++ b/etc/ORG-NEWS
> @@ -182,6 +182,22 @@ now be pasted as an Org table using ~yank-media~.
>  # adding new customizations, or changing the interpretation of the
>  # existing customizations.
>
> +*** ox-odt: New export option ~org-odt-with-forbidden-chars~
> +
> +The new export option controls how to deal with characters that are forbidden
> +inside ODT documents during export.
> +
> +The ODT documents must follow XML1.0 specification and cannot contain
> +certain unicode characters.  For example, form feed characters like ^L
> +are disallowed.
> +
> +By default, =ox-odt= will strip such characters and display warning.
> +You may return to the previous behaviour by setting
> +~org-odt-with-forbidden-chars~ to t.
> +
> +Note that Emacs warnings can always be suppressed by clicking on ⛔
> +symbol or by customizing ~warning-suppress-types~.
> +
>  *** New option ~org-edit-keep-region~
>
>  Since Org 9.7, structure editing commands do not deactivate region
> diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
> index ec81637ef0..960bab286a 100644
> --- a/lisp/ox-odt.el
> +++ b/lisp/ox-odt.el
> @@ -94,7 +94,8 @@ (org-export-define-backend 'odt
>  		    . (org-odt--translate-latex-fragments
>  		       org-odt--translate-description-lists
>  		       org-odt--translate-list-tables
> -		       org-odt--translate-image-links)))
> +		       org-odt--translate-image-links))
> +                   (:filter-final-output . org-odt--remove-forbidden))
>    :menu-entry
>    '(?o "Export to ODT"
>         ((?o "As ODT file" org-odt-export-to-odt)
> @@ -108,6 +109,7 @@ (org-export-define-backend 'odt
>      (:keywords "KEYWORDS" nil nil space)
>      (:subtitle "SUBTITLE" nil nil parse)
>      ;; Other variables.
> +    (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars)
>      (:odt-content-template-file nil nil org-odt-content-template-file)
>      (:odt-display-outline-level nil nil org-odt-display-outline-level)
>      (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks)
> @@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps
>      ("\\.\\.\\." . "&#x2026;"))		; hellip
>    "Regular expressions for special string conversion.")
>
> +(defconst org-odt-forbidden-char-re
> +  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
> +               (?\N{U+20} . ?\N{U+D7FF})
> +               (?\N{U+E000} . ?\N{U+FFFD})
> +               (?\N{U+10000} . ?\N{U+10FFFF}))))
> +  "Regexp matching forbidden XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
>  (defconst org-odt-schema-dir-list
>    (list (expand-file-name "./schema/" org-odt-data-dir))
>    "List of directories to search for OpenDocument schema files.
> @@ -364,6 +374,19 @@ (defgroup org-export-odt nil
>    :tag "Org Export ODT"
>    :group 'org-export)
>
> +(defcustom org-odt-with-forbidden-chars ""
> +  "String to replace forbidden XML characters.
> +When set to t, forbidden characters are left as-is.
> +When set to nil, an error is thrown.
> +See `org-odt-forbidden-char-re' for the list of forbidden characters
> +that cannot occur inside ODT documents.
> +
> +You may also consider export filters to perform more fine-grained
> +replacements.  See info node `(org)Advanced Export Configuration'."
> +  :package-version '(Org . "9.8")
> +  :type '(choice (const :tag "Leave forbidden characters as-is" t)
> +                 (const :tag "Err when forbidden characters encountered" nil)
> +                 (string :tag "Replacement string")))
>
>  ;;;; Debugging
>
> @@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line)
>         (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
>     line))
>
> +(defun org-odt--remove-forbidden (text _backend info)
> +  "Remove forbidden and discouraged characters from TEXT.
> +INFO is the communication plist"
> +  (pcase-exhaustive (plist-get info :odt-with-forbidden-chars)
> +    ((and (pred stringp) rep)
> +     (let ((replacements (make-hash-table :test 'equal)))
> +       (with-temp-buffer
> +         (insert text)
> +         (goto-char (point-min))
> +         (while (re-search-forward org-odt-forbidden-char-re nil t)
> +           (cl-incf (gethash (match-string 0) replacements 0))
> +           (replace-match rep))
> +         (cl-loop for forbidden being the hash-keys of replacements
> +                  using (hash-values count)
> +                  do (display-warning
> +                      '(ox-odt ox-odt-with-forbidden-chars)
> +                      (format "Replaced forbidden character '%s' with '%s' %d times"
> +                              forbidden rep count)))
> +         (buffer-string))))
> +    (`nil
> +     (if (string-match org-odt-forbidden-char-re text)
> +         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
> +                (match-string 0 text))
> +       text))
> +    ('t text)))
> +
>  (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
>    (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
>      (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
> --
> 2.47.1

LGTM!  TIL about clicking on ⛔ and warning-suppress-types.  Thank you!

Joseph


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-12-28  9:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-21  1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
2024-12-21  3:56 ` Max Nikulin
2024-12-21  6:52   ` Joseph Turner
2024-12-21  7:23     ` Max Nikulin
2024-12-21 19:06       ` Joseph Turner
2024-12-24 16:23   ` Max Nikulin
2024-12-25 10:16     ` Joseph Turner
2024-12-23 17:32 ` Ihor Radchenko
2024-12-24 11:04   ` Christian Moe
2024-12-24 14:14     ` Ihor Radchenko
2024-12-25 10:10       ` Joseph Turner
2024-12-27 10:21         ` Ihor Radchenko
2024-12-27 20:42           ` Joseph Turner
2024-12-28  8:32             ` Ihor Radchenko
2024-12-28  9:50               ` Joseph Turner
2024-12-24 14:25     ` Max Nikulin
2024-12-24 14:30       ` Ihor Radchenko

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).