emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Joseph Turner <joseph@breatheoutbreathe.in>
To: Ihor Radchenko <yantar92@posteo.net>
Cc: Christian Moe <mail@christianmoe.com>,
	 Org Mode Mailing List <emacs-orgmode@gnu.org>,
	 Bohong Huang <bohonghuang@qq.com>
Subject: Re: Form feed characters break odt export
Date: Sat, 28 Dec 2024 01:50:32 -0800	[thread overview]
Message-ID: <87frm8kufr.fsf@breatheoutbreathe.in> (raw)
In-Reply-To: <871pxs438i.fsf@localhost> (Ihor Radchenko's message of "Sat, 28 Dec 2024 08:32:29 +0000")

Ihor Radchenko <yantar92@posteo.net> writes:

> Joseph Turner <joseph@breatheoutbreathe.in> writes:
>
>> From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001
>> From: Ihor Radchenko <yantar92@posteo.net>
>> Date: Fri, 27 Dec 2024 10:21:02 +0000
>> Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml
>
> Thanks for helping with the patch!
> I modified it further, adding ORG-NEWS entry announcing the new export
> option.
>
> From 89901da3a0d00598c5ac40cddb2f6dec7c7047cf Mon Sep 17 00:00:00 2001
> Message-ID: <89901da3a0d00598c5ac40cddb2f6dec7c7047cf.1735374641.git.yantar92@posteo.net>
> From: Ihor Radchenko <yantar92@posteo.net>
> Date: Fri, 27 Dec 2024 10:21:02 +0000
> Subject: [PATCH v3] ox-odt: Avoid putting forbidden characters into ODT xml
>
> * lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to
> control how to handle forbidden XML characters.
> (org-odt--remove-forbidden): New filter removing/replacing forbidden
> characters.
> * etc/ORG-NEWS (ox-odt: New export option
> ~org-odt-with-forbidden-chars~): Announce the new option.
>
> Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in>
> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
> ---
>  etc/ORG-NEWS   | 16 ++++++++++++++++
>  lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 66 insertions(+), 1 deletion(-)
>
> diff --git a/etc/ORG-NEWS b/etc/ORG-NEWS
> index d26813c983..a56e105481 100644
> --- a/etc/ORG-NEWS
> +++ b/etc/ORG-NEWS
> @@ -182,6 +182,22 @@ now be pasted as an Org table using ~yank-media~.
>  # adding new customizations, or changing the interpretation of the
>  # existing customizations.
>
> +*** ox-odt: New export option ~org-odt-with-forbidden-chars~
> +
> +The new export option controls how to deal with characters that are forbidden
> +inside ODT documents during export.
> +
> +The ODT documents must follow XML1.0 specification and cannot contain
> +certain unicode characters.  For example, form feed characters like ^L
> +are disallowed.
> +
> +By default, =ox-odt= will strip such characters and display warning.
> +You may return to the previous behaviour by setting
> +~org-odt-with-forbidden-chars~ to t.
> +
> +Note that Emacs warnings can always be suppressed by clicking on ⛔
> +symbol or by customizing ~warning-suppress-types~.
> +
>  *** New option ~org-edit-keep-region~
>
>  Since Org 9.7, structure editing commands do not deactivate region
> diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
> index ec81637ef0..960bab286a 100644
> --- a/lisp/ox-odt.el
> +++ b/lisp/ox-odt.el
> @@ -94,7 +94,8 @@ (org-export-define-backend 'odt
>  		    . (org-odt--translate-latex-fragments
>  		       org-odt--translate-description-lists
>  		       org-odt--translate-list-tables
> -		       org-odt--translate-image-links)))
> +		       org-odt--translate-image-links))
> +                   (:filter-final-output . org-odt--remove-forbidden))
>    :menu-entry
>    '(?o "Export to ODT"
>         ((?o "As ODT file" org-odt-export-to-odt)
> @@ -108,6 +109,7 @@ (org-export-define-backend 'odt
>      (:keywords "KEYWORDS" nil nil space)
>      (:subtitle "SUBTITLE" nil nil parse)
>      ;; Other variables.
> +    (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars)
>      (:odt-content-template-file nil nil org-odt-content-template-file)
>      (:odt-display-outline-level nil nil org-odt-display-outline-level)
>      (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks)
> @@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps
>      ("\\.\\.\\." . "&#x2026;"))		; hellip
>    "Regular expressions for special string conversion.")
>
> +(defconst org-odt-forbidden-char-re
> +  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
> +               (?\N{U+20} . ?\N{U+D7FF})
> +               (?\N{U+E000} . ?\N{U+FFFD})
> +               (?\N{U+10000} . ?\N{U+10FFFF}))))
> +  "Regexp matching forbidden XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
>  (defconst org-odt-schema-dir-list
>    (list (expand-file-name "./schema/" org-odt-data-dir))
>    "List of directories to search for OpenDocument schema files.
> @@ -364,6 +374,19 @@ (defgroup org-export-odt nil
>    :tag "Org Export ODT"
>    :group 'org-export)
>
> +(defcustom org-odt-with-forbidden-chars ""
> +  "String to replace forbidden XML characters.
> +When set to t, forbidden characters are left as-is.
> +When set to nil, an error is thrown.
> +See `org-odt-forbidden-char-re' for the list of forbidden characters
> +that cannot occur inside ODT documents.
> +
> +You may also consider export filters to perform more fine-grained
> +replacements.  See info node `(org)Advanced Export Configuration'."
> +  :package-version '(Org . "9.8")
> +  :type '(choice (const :tag "Leave forbidden characters as-is" t)
> +                 (const :tag "Err when forbidden characters encountered" nil)
> +                 (string :tag "Replacement string")))
>
>  ;;;; Debugging
>
> @@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line)
>         (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
>     line))
>
> +(defun org-odt--remove-forbidden (text _backend info)
> +  "Remove forbidden and discouraged characters from TEXT.
> +INFO is the communication plist"
> +  (pcase-exhaustive (plist-get info :odt-with-forbidden-chars)
> +    ((and (pred stringp) rep)
> +     (let ((replacements (make-hash-table :test 'equal)))
> +       (with-temp-buffer
> +         (insert text)
> +         (goto-char (point-min))
> +         (while (re-search-forward org-odt-forbidden-char-re nil t)
> +           (cl-incf (gethash (match-string 0) replacements 0))
> +           (replace-match rep))
> +         (cl-loop for forbidden being the hash-keys of replacements
> +                  using (hash-values count)
> +                  do (display-warning
> +                      '(ox-odt ox-odt-with-forbidden-chars)
> +                      (format "Replaced forbidden character '%s' with '%s' %d times"
> +                              forbidden rep count)))
> +         (buffer-string))))
> +    (`nil
> +     (if (string-match org-odt-forbidden-char-re text)
> +         (error "Forbidden character '%s' found.  See `org-odt-with-forbidden-chars'"
> +                (match-string 0 text))
> +       text))
> +    ('t text)))
> +
>  (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
>    (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
>      (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
> --
> 2.47.1

LGTM!  TIL about clicking on ⛔ and warning-suppress-types.  Thank you!

Joseph


  reply	other threads:[~2024-12-28  9:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-21  1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
2024-12-21  3:56 ` Max Nikulin
2024-12-21  6:52   ` Joseph Turner
2024-12-21  7:23     ` Max Nikulin
2024-12-21 19:06       ` Joseph Turner
2024-12-24 16:23   ` Max Nikulin
2024-12-25 10:16     ` Joseph Turner
2024-12-23 17:32 ` Ihor Radchenko
2024-12-24 11:04   ` Christian Moe
2024-12-24 14:14     ` Ihor Radchenko
2024-12-25 10:10       ` Joseph Turner
2024-12-27 10:21         ` Ihor Radchenko
2024-12-27 20:42           ` Joseph Turner
2024-12-28  8:32             ` Ihor Radchenko
2024-12-28  9:50               ` Joseph Turner [this message]
2024-12-28 15:50                 ` Ihor Radchenko
2024-12-24 14:25     ` Max Nikulin
2024-12-24 14:30       ` Ihor Radchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87frm8kufr.fsf@breatheoutbreathe.in \
    --to=joseph@breatheoutbreathe.in \
    --cc=bohonghuang@qq.com \
    --cc=emacs-orgmode@gnu.org \
    --cc=mail@christianmoe.com \
    --cc=yantar92@posteo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).