From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vaidheeswaran Subject: =?UTF-8?B?UmU6IE9yZ21vZGUg4oaSIE9EVDogQ2VydGFpbiBjaGFycyBicmVh?= =?UTF-8?B?ayBleHBvcnQ=?= Date: Sat, 14 Feb 2015 14:20:25 +0530 Message-ID: <54DF0C51.2090604@gmail.com> References: <87r3tu5bus.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:34545) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YMYOf-00086e-OL for emacs-orgmode@gnu.org; Sat, 14 Feb 2015 03:48:19 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YMYOc-00022d-Gn for emacs-orgmode@gnu.org; Sat, 14 Feb 2015 03:48:17 -0500 Received: from plane.gmane.org ([80.91.229.3]:38497) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YMYOc-00022X-9V for emacs-orgmode@gnu.org; Sat, 14 Feb 2015 03:48:14 -0500 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1YMYOa-0007z2-St for emacs-orgmode@gnu.org; Sat, 14 Feb 2015 09:48:13 +0100 Received: from 117.96.15.187 ([117.96.15.187]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 14 Feb 2015 09:48:12 +0100 Received: from vaidheeswaran.chinnaraju by 117.96.15.187 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 14 Feb 2015 09:48:12 +0100 In-Reply-To: <87r3tu5bus.fsf@gmail.com> List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Cc: orgmode list On Friday 13 February 2015 04:15 PM, Tory S. Anderson wrote: > While we're on the topic of ODT export problems: I was in the process of converting PDF to Text to Org to ODT/DocX and discovered that certain characters seem to break exported odt documents, which fail with a line and col number. So far the only one I know for sure is the " " (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle all such cases. > > You probably don't need it, but I verified with the following file: > http://toryanderson.com/files/breakorg.org > > Org-mode version 8.2.10 (8.2.10-32-gddaa1d-elpa) > > I assume that you are using pdftotext. In that case, you can use the following argument. -nopgbrk : don't insert page breaks between pages That said, it is very difficult to say what the right action should be when encountering ^L or other problematic characters. Much depends on the context. Neither an outright removal, or replacement with a single SPC, a NEWLINE or a double NEWLINE may be satisfactory. Specifically, in the pdftotext case above, I believe the best action would be to M-x flush-lines that match ^L so that page headers are stripped. ---------------------------------------------------------------- From exporter side of things, the best that one could do is to catch such exceptional cases and report it to the user for further repair. i.e., Instead of waiting of LibreOffice to catch this exception and leave the user in utter confusion, the export backend could catch the error early in the export process and report a useful message. A variation of following snippet can be used for catching the error early. (add-hook 'org-export-before-parsing-hook (lambda (backend) (when (eq backend 'odt) (goto-char (point-min)) (when (re-search-forward (rx-to-string '(or (in (#x0 . #x8)) (in (#xB . #xC)) (in (#xE. #x1F)) (in (#xD800. #xDFFF)) (in (#xFFFE . #xFFFF)) (in (#x110000 . #x3FFFFF)))) nil t) (user-error "Input file has a problematic char [%s]." (format "#x%x" (string-to-char (match-string 0)))))))) The following snippet could be used for outright removal of problematic characters. (add-hook 'org-export-before-parsing-hook (lambda (backend) (when (eq backend 'odt) (goto-char (point-min)) (when (re-search-forward (rx-to-string '(one-or-more (or (in (#x0 . #x8)) (in (#xB . #xC)) (in (#xE. #x1F)) (in (#xD800. #xDFFF)) (in (#xFFFE . #xFFFF)) (in (#x110000 . #x3FFFFF))))) nil t) (replace-match "" t t))))) ---------------------------------------------------------------- Note to the developers: 1. xmltok.el has `xmltok-valid-char-p'. 2. From http://www.w3.org/TR/2008/REC-xml-20081126/#charsets /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF]. ----------------------------------------------------------------