emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Vaidheeswaran <vaidheeswaran.chinnaraju@gmail.com>
Cc: orgmode list <emacs-orgmode@gnu.org>
Subject: Re: Orgmode → ODT: Certain chars break export
Date: Sat, 14 Feb 2015 14:20:25 +0530	[thread overview]
Message-ID: <54DF0C51.2090604@gmail.com> (raw)
In-Reply-To: <87r3tu5bus.fsf@gmail.com>


On Friday 13 February 2015 04:15 PM, Tory S. Anderson wrote:
> While we're on the topic of ODT export problems: I was in the process of converting PDF to Text to Org to ODT/DocX and discovered that certain characters seem to break exported odt documents, which fail with a line and col number. So far the only one I know for sure is the "\f" (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle all such cases.
>
> You probably don't need it, but I verified with the following file:
> http://toryanderson.com/files/breakorg.org
>
> Org-mode version 8.2.10 (8.2.10-32-gddaa1d-elpa)
>
>

I assume that you are using pdftotext.  In that case, you can use the
following argument.

   -nopgbrk          : don't insert page breaks between pages

That said, it is very difficult to say what the right action should be
when encountering ^L or other problematic characters.  Much depends on
the context.  Neither an outright removal, or replacement with a
single SPC, a NEWLINE or a double NEWLINE may be
satisfactory. Specifically, in the pdftotext case above, I believe the
best action would be to M-x flush-lines that match ^L so that page
headers are stripped.

----------------------------------------------------------------

 From exporter side of things, the best that one could do is to catch
such exceptional cases and report it to the user for further repair.
i.e., Instead of waiting of LibreOffice to catch this exception and
leave the user in utter confusion, the export backend could catch the
error early in the export process and report a useful message.

A variation of following snippet can be used for catching the error
early.

(add-hook
  'org-export-before-parsing-hook
  (lambda (backend)
    (when (eq backend 'odt)
      (goto-char (point-min))
      (when (re-search-forward
	    (rx-to-string '(or (in (#x0 . #x8))
			       (in (#xB . #xC))
			       (in (#xE. #x1F))
			       (in (#xD800. #xDFFF))
			       (in (#xFFFE . #xFFFF))
			       (in (#x110000 . #x3FFFFF)))) nil t)
        (user-error "Input file has a problematic char [%s]."
		   (format "#x%x" (string-to-char (match-string 0))))))))

The following snippet could be used for outright removal of
problematic characters.

(add-hook
  'org-export-before-parsing-hook
  (lambda (backend)
    (when (eq backend 'odt)
      (goto-char (point-min))
      (when (re-search-forward
	    (rx-to-string '(one-or-more
			    (or (in (#x0 . #x8))
				(in (#xB . #xC))
				(in (#xE. #x1F))
				(in (#xD800. #xDFFF))
				(in (#xFFFE . #xFFFF))
				(in (#x110000 . #x3FFFFF))))) nil t)
        (replace-match "" t t)))))

----------------------------------------------------------------

Note to the developers:

1. xmltok.el has `xmltok-valid-char-p'.
2. From http://www.w3.org/TR/2008/REC-xml-20081126/#charsets

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]


Document authors are encouraged to avoid "compatibility characters",
as defined in section 2.3 of [Unicode]. The characters defined in the
following ranges are also discouraged. They are either control
characters or permanently undefined Unicode characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

----------------------------------------------------------------

  parent reply	other threads:[~2015-02-14  8:48 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-13 10:45 Orgmode → ODT: Certain chars break export Tory S. Anderson
2015-02-13 11:04 ` Rasmus
2015-02-13 15:18   ` Tory S. Anderson
2015-02-13 16:07     ` Rasmus
2015-02-13 16:41       ` Tory S. Anderson
2015-02-14  1:18         ` Rasmus
2015-02-14  8:50 ` Vaidheeswaran [this message]
2015-02-14 10:43   ` Vaidheeswaran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54DF0C51.2090604@gmail.com \
    --to=vaidheeswaran.chinnaraju@gmail.com \
    --cc=emacs-orgmode@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).