* Form feed characters break odt export
@ 2024-12-21 1:48 Joseph Turner via General discussions about Org-mode.
2024-12-21 3:56 ` Max Nikulin
2024-12-23 17:32 ` Ihor Radchenko
0 siblings, 2 replies; 11+ messages in thread
From: Joseph Turner via General discussions about Org-mode. @ 2024-12-21 1:48 UTC (permalink / raw)
To: Org Mode Mailing List; +Cc: Bohong Huang
Tested on
GNU Emacs 29.4 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.41, cairo version 1.18.0)
Org mode version 9.7.6 (9.7.6-7a4527 @ /home/joseph/.emacs.d/elpa/org-9.7.6/)
I can export the following Org content to a .odt file, but the exported
file cannot be opened ("Read Error. Format error discovered in the file
in sub-document content.xml at 368,2(row,col).")
--8<---------------cut here---------------start------------->8---
#+TITLE: Foo
* Bar
Baz
\f
--8<---------------cut here---------------end--------------->8---
First reported by bohonghuang:
https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
Thanks!
Joseph
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
@ 2024-12-21 3:56 ` Max Nikulin
2024-12-21 6:52 ` Joseph Turner
2024-12-24 16:23 ` Max Nikulin
2024-12-23 17:32 ` Ihor Radchenko
1 sibling, 2 replies; 11+ messages in thread
From: Max Nikulin @ 2024-12-21 3:56 UTC (permalink / raw)
To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang
On 21/12/2024 08:48, Joseph Turner wrote:
>
> I can export the following Org content to a .odt file, but the exported
> file cannot be opened ("Read Error. Format error discovered in the file
> in sub-document content.xml at 368,2(row,col).")
[...]
> First reported by bohonghuang:
> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
In this specific context a workaround should be
#+begin_comment
^L
#+end_comment
Or a commented out empty local variables block above.
I have wrote already that I do not like non-printable characters in Org
files.
I admit that special characters either should cause `org-lint' warnings
or should be filtered out by exporters.
Specifically to ^L, there was a request to treat it as a page break by
all exporters (I would prefer some entity or macro instead to not
deviate from plain text markup).
Marvin Gülker. Feature request: export form feed as page break. Sat, 21
Oct 2023 09:42:33 +0200.
<https://list.orgmode.org/87zg0ce6yi.fsf@guelker.eu>
I have not had a close look at another proposed feature, but I suspect
that it might make filtering special characters more tricky. (I would be
happy to hear that I am wrong.)
Nathaniel Nicandro. [PATCH] ANSI color on example blocks and fixed width
elements. Wed, 05 Apr 2023 07:03:43 -0500.
<https://list.orgmode.org/874jpuijpc.fsf@gmail.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-21 3:56 ` Max Nikulin
@ 2024-12-21 6:52 ` Joseph Turner
2024-12-21 7:23 ` Max Nikulin
2024-12-24 16:23 ` Max Nikulin
1 sibling, 1 reply; 11+ messages in thread
From: Joseph Turner @ 2024-12-21 6:52 UTC (permalink / raw)
To: emacs-orgmode; +Cc: Bohong Huang, Max Nikulin
Max Nikulin <manikulin@gmail.com> writes:
> On 21/12/2024 08:48, Joseph Turner wrote:
>> I can export the following Org content to a .odt file, but the
>> exported
>> file cannot be opened ("Read Error. Format error discovered in the file
>> in sub-document content.xml at 368,2(row,col).")
> [...]
>> First reported by bohonghuang:
>> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
>
> In this specific context a workaround should be
>
> #+begin_comment
> ^L
> #+end_comment
Thank you! Or even simpler:
# ^L
> Or a commented out empty local variables block above.
>
> I have wrote already that I do not like non-printable characters in
> Org files.
I agree that they make Org files less portable outside Emacs, and they
complicate org-export.
> I admit that special characters either should cause `org-lint'
> warnings or should be filtered out by exporters.
>
> Specifically to ^L, there was a request to treat it as a page break by
> all exporters (I would prefer some entity or macro instead to not
> deviate from plain text markup).
>
> Marvin Gülker. Feature request: export form feed as page break. Sat,
> 21 Oct 2023 09:42:33 +0200.
> <https://list.orgmode.org/87zg0ce6yi.fsf@guelker.eu>
>
> I have not had a close look at another proposed feature, but I suspect
> that it might make filtering special characters more tricky. (I would
> be happy to hear that I am wrong.)
Yes. Without digging into it, my gut feeling is also that handling one
non-printable character specially would open Pandora's box.
> Nathaniel Nicandro. [PATCH] ANSI color on example blocks and fixed
> width elements. Wed, 05 Apr 2023 07:03:43 -0500.
> <https://list.orgmode.org/874jpuijpc.fsf@gmail.com>
Gratefully,
Joseph
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-21 6:52 ` Joseph Turner
@ 2024-12-21 7:23 ` Max Nikulin
2024-12-21 19:06 ` Joseph Turner
0 siblings, 1 reply; 11+ messages in thread
From: Max Nikulin @ 2024-12-21 7:23 UTC (permalink / raw)
To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang
On 21/12/2024 13:52, Joseph Turner wrote:
> Max Nikulin writes:
>>
>> #+begin_comment
>> ^L
>> #+end_comment
> Thank you! Or even simpler:
>
> # ^L
It was first I tried, but Emacs-28.2 demands to decide if Local
Variables should be applied.
You may ask Emacs developers for a *plain text* spell to stop processing
of local variables (or to take *last* found block).
Notice that commit diff looks confusing.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-21 3:56 ` Max Nikulin
2024-12-21 6:52 ` Joseph Turner
@ 2024-12-24 16:23 ` Max Nikulin
1 sibling, 0 replies; 11+ messages in thread
From: Max Nikulin @ 2024-12-24 16:23 UTC (permalink / raw)
To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang
On 21/12/2024 10:56, Max Nikulin wrote:
> On 21/12/2024 08:48, Joseph Turner wrote:
>> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871
>
> In this specific context a workaround should be
>
> #+begin_comment
> ^L
> #+end_comment
To avoid confusion of other contributors it should be more verbose:
#+begin_comment
Keep this block at the bottom of the file.
It instructs Emacs to ignore examples
of local variables sections above, see
<info:emacs#Specifying File Variables>
The following line contains the form feed 0x0c character.
^L
#+end_comment
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
2024-12-21 3:56 ` Max Nikulin
@ 2024-12-23 17:32 ` Ihor Radchenko
2024-12-24 11:04 ` Christian Moe
1 sibling, 1 reply; 11+ messages in thread
From: Ihor Radchenko @ 2024-12-23 17:32 UTC (permalink / raw)
To: Joseph Turner; +Cc: Org Mode Mailing List, Bohong Huang
Joseph Turner via "General discussions about Org-mode."
<emacs-orgmode@gnu.org> writes:
> I can export the following Org content to a .odt file, but the exported
> file cannot be opened ("Read Error. Format error discovered in the file
> in sub-document content.xml at 368,2(row,col).")
>
> --8<---------------cut here---------------start------------->8---
> #+TITLE: Foo
> * Bar
> Baz
> \f
> --8<---------------cut here---------------end--------------->8---
Looks like ^L is not allowed in ODT files.
However, I see no such information on
http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
May somebody check if there is an official list of unsupported
characters in ODT? Or maybe it is simply a bug in LibreOffice?
--
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-23 17:32 ` Ihor Radchenko
@ 2024-12-24 11:04 ` Christian Moe
2024-12-24 14:14 ` Ihor Radchenko
2024-12-24 14:25 ` Max Nikulin
0 siblings, 2 replies; 11+ messages in thread
From: Christian Moe @ 2024-12-24 11:04 UTC (permalink / raw)
To: Ihor Radchenko; +Cc: Joseph Turner, Org Mode Mailing List, Bohong Huang
(re-sending to include the list, apologies, recent mu4e ui changes keep
tripping me up)
Ihor Radchenko <yantar92@posteo.net> writes:
> Joseph Turner via "General discussions about Org-mode."
> <emacs-orgmode@gnu.org> writes:
>
>> I can export the following Org content to a .odt file, but the exported
>> file cannot be opened ("Read Error. Format error discovered in the file
>> in sub-document content.xml at 368,2(row,col).")
>>
>> --8<---------------cut here---------------start------------->8---
>> #+TITLE: Foo
>> * Bar
>> Baz
>> \f
>> --8<---------------cut here---------------end--------------->8---
>
> Looks like ^L is not allowed in ODT files.
> However, I see no such information on
> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
>
> May somebody check if there is an official list of unsupported
> characters in ODT? Or maybe it is simply a bug in LibreOffice?
I don't think it's specific to ODT or LibreOffice, it's the underlying
XML 1.0 spec that "discourages" control characters and does not include
#xC in the range of characters that XML processors must accept.
Spec: https://www.w3.org/TR/REC-xml/#charsets
Some discussion:
https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0
Yours,
Christian
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-24 11:04 ` Christian Moe
@ 2024-12-24 14:14 ` Ihor Radchenko
2024-12-24 14:25 ` Max Nikulin
1 sibling, 0 replies; 11+ messages in thread
From: Ihor Radchenko @ 2024-12-24 14:14 UTC (permalink / raw)
To: Christian Moe; +Cc: Joseph Turner, Org Mode Mailing List, Bohong Huang
[-- Attachment #1: Type: text/plain, Size: 520 bytes --]
Christian Moe <mail@christianmoe.com> writes:
> I don't think it's specific to ODT or LibreOffice, it's the underlying
> XML 1.0 spec that "discourages" control characters and does not include
> #xC in the range of characters that XML processors must accept.
>
> Spec: https://www.w3.org/TR/REC-xml/#charsets
>
> Some discussion:
> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0
Thanks!
Then, we can simply remove the disallowed characters.
See the attached tentative patch.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch --]
[-- Type: text/x-patch, Size: 3707 bytes --]
From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001
Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Tue, 24 Dec 2024 15:11:22 +0100
Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml
* lisp/ox-odt.el (org-odt-forbidden-char-re):
(org-odt-discouraged-char-re): New constants codifying characters that
are prohibited in XML spec.
(org-odt--remove-forbidden): New function removing the prohibited
characters.
(org-odt--encode-plain-text): Remove the prohibited characters.
(org-odt-plain-text): Update comment.
Reported-by: Joseph Turner <joseph@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
---
lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef0..61c8d4ec75 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps
("\\.\\.\\." . "…")) ; hellip
"Regular expressions for special string conversion.")
+(defconst org-odt-forbidden-char-re
+ (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+ (?\N{U+20} . ?\N{U+D7FF})
+ (?\N{U+E000} . ?\N{U+FFFD})
+ (?\N{U+10000} . ?\N{U+10FFFF}))))
+ "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
+(defconst org-odt-discouraged-char-re
+ (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F})
+ (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF})
+ (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF})
+ (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF})
+ (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF})
+ (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF})
+ (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF})
+ (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF})
+ (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF})
+ (?\N{U+10FFFE} . ?\N{U+10FFFF})))
+ "Regexp matching discouraged XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
(defconst org-odt-schema-dir-list
(list (expand-file-name "./schema/" org-odt-data-dir))
"List of directories to search for OpenDocument schema files.
@@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line)
(format " <text:s text:c=\"%d\"/>" (1- (length s)))))
line))
+(defun org-odt--remove-forbidden (text)
+ "Remove forbidden and discouraged characters from TEXT.
+https://www.w3.org/TR/REC-xml/#charsets"
+ (replace-regexp-in-string
+ org-odt-forbidden-char-re ""
+ (replace-regexp-in-string
+ org-odt-discouraged-char-re ""
+ text)))
+
(defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
(dolist (pair '(("&" . "&") ("<" . "<") (">" . ">")))
(setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
- (if no-whitespace-filling text
- (org-odt--encode-tabs-and-spaces text)))
+ (org-odt--remove-forbidden
+ (if no-whitespace-filling text
+ (org-odt--encode-tabs-and-spaces text))))
(defun org-odt-plain-text (text info)
"Transcode a TEXT string from Org to ODT.
TEXT is the string to transcode. INFO is a plist holding
contextual information."
(let ((output text))
- ;; Protect &, < and >.
+ ;; Protect &, < and >, and remove forbidden characters.
(setq output (org-odt--encode-plain-text output t))
;; Handle smart quotes. Be sure to provide original string since
;; OUTPUT may have been modified.
--
2.47.1
[-- Attachment #3: Type: text/plain, Size: 223 bytes --]
--
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-24 11:04 ` Christian Moe
2024-12-24 14:14 ` Ihor Radchenko
@ 2024-12-24 14:25 ` Max Nikulin
2024-12-24 14:30 ` Ihor Radchenko
1 sibling, 1 reply; 11+ messages in thread
From: Max Nikulin @ 2024-12-24 14:25 UTC (permalink / raw)
To: emacs-orgmode
On 24/12/2024 18:04, Christian Moe wrote:
> I don't think it's specific to ODT or LibreOffice, it's the underlying
> XML 1.0 spec that "discourages" control characters and does not include
> #xC in the range of characters that XML processors must accept.
Pandoc retains "^L" in export to markdown, but replaces the line by a
space in .odt. I am curious if it is a dedicated output filter or just a
feature of XML writer.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Form feed characters break odt export
2024-12-24 14:25 ` Max Nikulin
@ 2024-12-24 14:30 ` Ihor Radchenko
0 siblings, 0 replies; 11+ messages in thread
From: Ihor Radchenko @ 2024-12-24 14:30 UTC (permalink / raw)
To: Max Nikulin; +Cc: emacs-orgmode
Max Nikulin <manikulin@gmail.com> writes:
> On 24/12/2024 18:04, Christian Moe wrote:
>> I don't think it's specific to ODT or LibreOffice, it's the underlying
>> XML 1.0 spec that "discourages" control characters and does not include
>> #xC in the range of characters that XML processors must accept.
>
> Pandoc retains "^L" in export to markdown, but replaces the line by a
> space in .odt. I am curious if it is a dedicated output filter or just a
> feature of XML writer.
What about other control characters? Does pandoc also replace them with space?
--
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-12-24 17:14 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
2024-12-21 3:56 ` Max Nikulin
2024-12-21 6:52 ` Joseph Turner
2024-12-21 7:23 ` Max Nikulin
2024-12-21 19:06 ` Joseph Turner
2024-12-24 16:23 ` Max Nikulin
2024-12-23 17:32 ` Ihor Radchenko
2024-12-24 11:04 ` Christian Moe
2024-12-24 14:14 ` Ihor Radchenko
2024-12-24 14:25 ` Max Nikulin
2024-12-24 14:30 ` Ihor Radchenko
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).