From mboxrd@z Thu Jan 1 00:00:00 1970 From: torys.anderson@gmail.com (Tory S. Anderson) Subject: =?UTF-8?B?UmU6IE9yZ21vZGUg4oaSIE9EVDogQ2VydGFpbiBjaGFycyBicmVh?= =?UTF-8?B?ayBleHBvcnQ=?= Date: Fri, 13 Feb 2015 10:18:24 -0500 Message-ID: <87egpt6drz.fsf@gmail.com> References: <87r3tu5bus.fsf@gmail.com> <87386avzrl.fsf@gmx.us> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:59791) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YMI0k-0008LG-VG for emacs-orgmode@gnu.org; Fri, 13 Feb 2015 10:18:31 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YMI0h-0000ZS-Im for emacs-orgmode@gnu.org; Fri, 13 Feb 2015 10:18:30 -0500 Received: from mail-qc0-x22d.google.com ([2607:f8b0:400d:c01::22d]:45151) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YMI0h-0000ZJ-EK for emacs-orgmode@gnu.org; Fri, 13 Feb 2015 10:18:27 -0500 Received: by mail-qc0-f173.google.com with SMTP id w7so14256896qcr.4 for ; Fri, 13 Feb 2015 07:18:26 -0800 (PST) In-Reply-To: <87386avzrl.fsf@gmx.us> (rasmus@gmx.us's message of "Fri, 13 Feb 2015 12:04:14 +0100") List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: emacs-orgmode@gnu.org >From a user perspective just stripping the characters seems best to me, but finding out what the characters seems obnoxious. Neither a quick search nor skimming the ODT doc specification[1][2] seem to give any insight into a set of illegal characters. Does elisp have anything similar to Java's "isWhitespace"[3] that could be used to check character features? Rasmus writes: > torys.anderson@gmail.com (Tory S. Anderson) writes: > >> While we're on the topic of ODT export problems: I was in the process >> of converting PDF to Text to Org to ODT/DocX and discovered that >> certain characters seem to break exported odt documents, which fail >> with a line and col number. So far the only one I know for sure is the >> " " (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle >> all such cases. >> >> You probably don't need it, but I verified with the following file: >> http://toryanderson.com/files/breakorg.org > > The export is fine, but the produced XML is invalid since it contains an > illegal character. But how to resolve this? Should ox strip illegal > charterers (if so what are they)? If so, could they be used for entities? > > —Rasmus Footnotes: [1] https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office [2] http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415196_253892949 [3] http://www.fileformat.info/info/unicode/char/000c/index.htm