From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jambunathan K <kjambunathan@gmail.com>
Subject: Re: ODT Charset/Encoding issues (was question about ODT export
 behavior)
Date: Mon, 18 Jul 2011 00:43:04 +0530
Message-ID: <81fwm43em7.fsf@gmail.com>
References: <ivk627$j0n$1@dough.gmane.org> <817h7mce7q.fsf@gmail.com>
	<81oc0yaqes.fsf@gmail.com> <4E1E91D3.70800@diplan.de>
	<87wrfkx3wo.fsf@gnu.org> <81oc0w9jie.fsf@gmail.com>
	<loom.20110715T221004-235@post.gmane.org>
	<81r55qx9uf.fsf_-_@gmail.com>
	<CAK_Tu5kA7Ln3kc501gdjJJ4Cxmx6=z5wW+3r-6avxYF2=5Xg3g@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Return-path: <emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([140.186.70.92]:56535)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kjambunathan@gmail.com>) id 1QiWmU-0005DP-V3
	for emacs-orgmode@gnu.org; Sun, 17 Jul 2011 15:13:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kjambunathan@gmail.com>) id 1QiWmQ-0007fa-3a
	for emacs-orgmode@gnu.org; Sun, 17 Jul 2011 15:13:34 -0400
Received: from mail-pz0-f43.google.com ([209.85.210.43]:35912)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kjambunathan@gmail.com>) id 1QiWmP-0007fU-QY
	for emacs-orgmode@gnu.org; Sun, 17 Jul 2011 15:13:30 -0400
Received: by pzk1 with SMTP id 1so3786826pzk.30
	for <emacs-orgmode@gnu.org>; Sun, 17 Jul 2011 12:13:28 -0700 (PDT)
In-Reply-To: <CAK_Tu5kA7Ln3kc501gdjJJ4Cxmx6=z5wW+3r-6avxYF2=5Xg3g@mail.gmail.com>
	(Renzo Been's message of "Sun, 17 Jul 2011 16:12:09 +0200")
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: </archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
To: Renzo Been <swangdoodles@gmail.com>, Christian Moe <mail@christianmoe.com>
Cc: emacs-orgmode@gnu.org

--=-=-=
Content-Type: text/plain


Hello Renzo & Christian

Thanks for the test files and sharing your views on this issue. With the
attached patch I can export the test files successfully.

The attached patch ensures that component xml files created by the odt
exporter are always utf-8 encoded. This is irrespective of the coding
system used by the Org buffer.

Jambunathan K.


--=-=-=
Content-Type: text/plain
Content-Disposition: inline;
 filename=0001-org-odt-Correctly-export-iso-8859-1-files-with-non-a.patch
Content-Description: 0001-org-odt-Correctly-export-iso-8859-1-files-with-non-a.patch

>From 1ec1e3c9248387ab2daabe7b9c7cc4a3c42b4998 Mon Sep 17 00:00:00 2001
From: Jambunathan K <kjambunathan@gmail.com>
Date: Mon, 18 Jul 2011 00:26:41 +0530
Subject: [PATCH] org-odt: Correctly export iso-8859-1 files with non-ascii chars

* contrib/lisp/org-odt.el (org-odt-get): Set
CODING-SYSTEM-FOR-WRITE and CODING-SYSTEM-FOR-SAVE to 'utf-8
irrespective of buffer-file-coding-system.

Fixes issue reported by Renzo Been in the following post.
http://lists.gnu.org/archive/html/emacs-orgmode/2011-07/msg00795.html
---
 contrib/lisp/org-odt.el |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/contrib/lisp/org-odt.el b/contrib/lisp/org-odt.el
index f3a4067..bd2ea33 100644
--- a/contrib/lisp/org-odt.el
+++ b/contrib/lisp/org-odt.el
@@ -1380,6 +1380,8 @@ MAY-INLINE-P allows inlining it as an image."
     (PLAIN-TEXT-MAP '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (TABLE-FIRST-COLUMN-AS-LABELS nil)
     (FOOTNOTE-SEPARATOR (org-lparse-format 'FONTIFY "," 'superscript))
+    (CODING-SYSTEM-FOR-WRITE 'utf-8)
+    (CODING-SYSTEM-FOR-SAVE 'utf-8)
     (t (error "Unknown property: %s"  what))))
 
 (defun org-odt-parse-label (label)
-- 
1.7.2.3


--=-=-=
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable


> Hi Jambunathan,
>
> See comments below.
>
> Ciao,
> Renzo
> P.S. I'm on a camping-site right now, so I do not have good Internet acce=
ss...
>
> On 16 July 2011 22:13, Jambunathan K <kjambunathan@gmail.com> wrote:
>>
>> Renzo
>>
>>> I just want to add one point that I did not find in the org-manual. =A0=
I tested
>>> some of my org-files and exported them to the OpenOffice format. When I=
 tried to
>>> open these documents in OpenOffice, they were corrupt and could not be =
opened.
>>>
>>> I soon found out why. If you want to export an org-mode file to .odt, y=
ou need
>>> to explicitly set the file encoding to UTF-8 (I usually use iso-8859-1 =
encoding
>>> for my files), like:
>>> #-*- mode: org; coding: utf-8; -*-
>>> After that OpenOffice could open the files without any problems.
>>
>> I use English for communication and I have to admit that I have zero
>> understanding of things like character sets, encodings etc.
>
> As for communicating; I'm from the border regions of The Netherlands, Bel=
gium
> and Germany... And therefore I'm multilingual, and often need to type wor=
ds
> with accents.
>
>> Thanks for the above note. I surely see is a bug but my poor
>> understanding prevents me from quantifying it further.
>
> Well... I would not really see it as a bug... As long as it is mentioned =
in the
> documentation, that org-file encoding's other then utf-8 could result in =
corrupt
> output-files.
>
>> Could you please send me a minimal iso-8859-1 test.org file and the
>> associated corrupted test.odt file? I will look in to this issue.
>
> See attachment. I can only send you the org file, because I do not have a=
ccess
> to a working Emacs at the moment...
>
>> 1. Do you have any specific requirement on how the component xml files
>> =A0 be encoded? A cursory look at the odt exporter suggests that it could
>> =A0 actually be emitting xml files in iso-8859-1 format while wrongly
>> =A0 claiming UTF-8 encoding as below
>>
>> --8<---------------cut here---------------start------------->8---
>> <?xml version=3D"1.0" encoding=3D"UTF-8"?>
>> --8<---------------cut here---------------end--------------->8---
>>
>> 2. Should the xml file be always ejected in UTF-8 irrespective of how
>> =A0 the original Org file is encoded.
>
> Yes that would seem a good solution to me... If the odt-exporter checks t=
he
> files encoding, and then changes the encoding to utf-8 (maybe using a tem=
porary
> buffer?) before the actual exporting, then there would be no further
> problems...
>
> As for the idea that the OpenOffice xml can actually be in another encodi=
ng
> than utf-8; I do not know how much work that would be for you, to impleme=
nt in
> the odt-exporter. It might be to much effort...
> Also I don't know if such an OpenOffice document will open with no proble=
ms in
> all OpenOffice applications.
>
>> [Notes to Self]
>> [Notes from odbook]
>>
>> Para 3 of http://books.evc-cit.info/odbook/apa.html#appc-11-fm2xml
>> says
>>
>> --8<---------------cut here---------------start------------->8---
>> OpenDocument files are always encoded in UTF-8.
>> --8<---------------cut here---------------end--------------->8---
>>
>> Para 2 of
>> http://books.evc-cit.info/odbook/apa.html#xml-other-char-encodings-secti=
on
>> says
>>
>> --8<---------------cut here---------------start------------->8---
>> XML 1.0 allows a document to be encoded in any character set registered
>> with the Internet Assigned Numbers Authority (IANA). European documents
>> are commonly encoded in one of the ISO Latin character sets, such as
>> ISO-8859-1. Japanese documents commonly use Shift-JIS, and Chinese
>> documents use GB2312 and Big 5.
>> --8<---------------cut here---------------end--------------->8---
>>
>> Para 4 of
>> http://books.evc-cit.info/odbook/apa.html#xml-other-char-encodings-secti=
on
>> says
>>
>> --8<---------------cut here---------------start------------->8---
>> XML processors are not required by the XML 1.0 specification to support
>> any more than UTF-8 and UTF-16, but most commonly support other
>> encodings, such as US-ASCII and ISO-8859-1.
>> --8<---------------cut here---------------end--------------->8---
>>
>>
>> [Notes from XMLmind XSL-FO Converter]
>>
>>
>> XFC supports outputting of content.xml and styles.xml in UTF-8 as well
>> as ISO-8859-1.
>>
>> http://xml.web.cern.ch/XML/www.xmlmind.com/xfc_perso_java-4_4_0/doc/user=
/command_line_java.html
>>
>> says
>>
>> ,---- [see outputEncoding section]
>> | For OpenDocument output (.odt), this option specifies the encoding of
>> | XML content (files styles.xml and content.xml) in the output
>> | document. All encodings available in the current JVM are supported. The
>> | option value may be either the encoding name (e.g. ISO8859_1) or the
>> | charset name (e.g. ISO-8859-1). The default value is UTF8.
>> `----
>>
>> --
>

--=20

--=-=-=--