From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jambunathan K Subject: Re: ODT Charset/Encoding issues (was question about ODT export behavior) Date: Mon, 18 Jul 2011 00:43:04 +0530 Message-ID: <81fwm43em7.fsf@gmail.com> References: <817h7mce7q.fsf@gmail.com> <81oc0yaqes.fsf@gmail.com> <4E1E91D3.70800@diplan.de> <87wrfkx3wo.fsf@gnu.org> <81oc0w9jie.fsf@gmail.com> <81r55qx9uf.fsf_-_@gmail.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Return-path: Received: from eggs.gnu.org ([140.186.70.92]:56535) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QiWmU-0005DP-V3 for emacs-orgmode@gnu.org; Sun, 17 Jul 2011 15:13:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QiWmQ-0007fa-3a for emacs-orgmode@gnu.org; Sun, 17 Jul 2011 15:13:34 -0400 Received: from mail-pz0-f43.google.com ([209.85.210.43]:35912) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QiWmP-0007fU-QY for emacs-orgmode@gnu.org; Sun, 17 Jul 2011 15:13:30 -0400 Received: by pzk1 with SMTP id 1so3786826pzk.30 for ; Sun, 17 Jul 2011 12:13:28 -0700 (PDT) In-Reply-To: (Renzo Been's message of "Sun, 17 Jul 2011 16:12:09 +0200") List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: Renzo Been , Christian Moe Cc: emacs-orgmode@gnu.org --=-=-= Content-Type: text/plain Hello Renzo & Christian Thanks for the test files and sharing your views on this issue. With the attached patch I can export the test files successfully. The attached patch ensures that component xml files created by the odt exporter are always utf-8 encoded. This is irrespective of the coding system used by the Org buffer. Jambunathan K. --=-=-= Content-Type: text/plain Content-Disposition: inline; filename=0001-org-odt-Correctly-export-iso-8859-1-files-with-non-a.patch Content-Description: 0001-org-odt-Correctly-export-iso-8859-1-files-with-non-a.patch >From 1ec1e3c9248387ab2daabe7b9c7cc4a3c42b4998 Mon Sep 17 00:00:00 2001 From: Jambunathan K Date: Mon, 18 Jul 2011 00:26:41 +0530 Subject: [PATCH] org-odt: Correctly export iso-8859-1 files with non-ascii chars * contrib/lisp/org-odt.el (org-odt-get): Set CODING-SYSTEM-FOR-WRITE and CODING-SYSTEM-FOR-SAVE to 'utf-8 irrespective of buffer-file-coding-system. Fixes issue reported by Renzo Been in the following post. http://lists.gnu.org/archive/html/emacs-orgmode/2011-07/msg00795.html --- contrib/lisp/org-odt.el | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/contrib/lisp/org-odt.el b/contrib/lisp/org-odt.el index f3a4067..bd2ea33 100644 --- a/contrib/lisp/org-odt.el +++ b/contrib/lisp/org-odt.el @@ -1380,6 +1380,8 @@ MAY-INLINE-P allows inlining it as an image." (PLAIN-TEXT-MAP '(("&" . "&") ("<" . "<") (">" . ">"))) (TABLE-FIRST-COLUMN-AS-LABELS nil) (FOOTNOTE-SEPARATOR (org-lparse-format 'FONTIFY "," 'superscript)) + (CODING-SYSTEM-FOR-WRITE 'utf-8) + (CODING-SYSTEM-FOR-SAVE 'utf-8) (t (error "Unknown property: %s" what)))) (defun org-odt-parse-label (label) -- 1.7.2.3 --=-=-= Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable > Hi Jambunathan, > > See comments below. > > Ciao, > Renzo > P.S. I'm on a camping-site right now, so I do not have good Internet acce= ss... > > On 16 July 2011 22:13, Jambunathan K wrote: >> >> Renzo >> >>> I just want to add one point that I did not find in the org-manual. =A0= I tested >>> some of my org-files and exported them to the OpenOffice format. When I= tried to >>> open these documents in OpenOffice, they were corrupt and could not be = opened. >>> >>> I soon found out why. If you want to export an org-mode file to .odt, y= ou need >>> to explicitly set the file encoding to UTF-8 (I usually use iso-8859-1 = encoding >>> for my files), like: >>> #-*- mode: org; coding: utf-8; -*- >>> After that OpenOffice could open the files without any problems. >> >> I use English for communication and I have to admit that I have zero >> understanding of things like character sets, encodings etc. > > As for communicating; I'm from the border regions of The Netherlands, Bel= gium > and Germany... And therefore I'm multilingual, and often need to type wor= ds > with accents. > >> Thanks for the above note. I surely see is a bug but my poor >> understanding prevents me from quantifying it further. > > Well... I would not really see it as a bug... As long as it is mentioned = in the > documentation, that org-file encoding's other then utf-8 could result in = corrupt > output-files. > >> Could you please send me a minimal iso-8859-1 test.org file and the >> associated corrupted test.odt file? I will look in to this issue. > > See attachment. I can only send you the org file, because I do not have a= ccess > to a working Emacs at the moment... > >> 1. Do you have any specific requirement on how the component xml files >> =A0 be encoded? A cursory look at the odt exporter suggests that it could >> =A0 actually be emitting xml files in iso-8859-1 format while wrongly >> =A0 claiming UTF-8 encoding as below >> >> --8<---------------cut here---------------start------------->8--- >> >> --8<---------------cut here---------------end--------------->8--- >> >> 2. Should the xml file be always ejected in UTF-8 irrespective of how >> =A0 the original Org file is encoded. > > Yes that would seem a good solution to me... If the odt-exporter checks t= he > files encoding, and then changes the encoding to utf-8 (maybe using a tem= porary > buffer?) before the actual exporting, then there would be no further > problems... > > As for the idea that the OpenOffice xml can actually be in another encodi= ng > than utf-8; I do not know how much work that would be for you, to impleme= nt in > the odt-exporter. It might be to much effort... > Also I don't know if such an OpenOffice document will open with no proble= ms in > all OpenOffice applications. > >> [Notes to Self] >> [Notes from odbook] >> >> Para 3 of http://books.evc-cit.info/odbook/apa.html#appc-11-fm2xml >> says >> >> --8<---------------cut here---------------start------------->8--- >> OpenDocument files are always encoded in UTF-8. >> --8<---------------cut here---------------end--------------->8--- >> >> Para 2 of >> http://books.evc-cit.info/odbook/apa.html#xml-other-char-encodings-secti= on >> says >> >> --8<---------------cut here---------------start------------->8--- >> XML 1.0 allows a document to be encoded in any character set registered >> with the Internet Assigned Numbers Authority (IANA). European documents >> are commonly encoded in one of the ISO Latin character sets, such as >> ISO-8859-1. Japanese documents commonly use Shift-JIS, and Chinese >> documents use GB2312 and Big 5. >> --8<---------------cut here---------------end--------------->8--- >> >> Para 4 of >> http://books.evc-cit.info/odbook/apa.html#xml-other-char-encodings-secti= on >> says >> >> --8<---------------cut here---------------start------------->8--- >> XML processors are not required by the XML 1.0 specification to support >> any more than UTF-8 and UTF-16, but most commonly support other >> encodings, such as US-ASCII and ISO-8859-1. >> --8<---------------cut here---------------end--------------->8--- >> >> >> [Notes from XMLmind XSL-FO Converter] >> >> >> XFC supports outputting of content.xml and styles.xml in UTF-8 as well >> as ISO-8859-1. >> >> http://xml.web.cern.ch/XML/www.xmlmind.com/xfc_perso_java-4_4_0/doc/user= /command_line_java.html >> >> says >> >> ,---- [see outputEncoding section] >> | For OpenDocument output (.odt), this option specifies the encoding of >> | XML content (files styles.xml and content.xml) in the output >> | document. All encodings available in the current JVM are supported. The >> | option value may be either the encoding name (e.g. ISO8859_1) or the >> | charset name (e.g. ISO-8859-1). The default value is UTF8. >> `---- >> >> -- > --=20 --=-=-=--