From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Maus Subject: Re: org-feed XML entities and character encoding Date: Fri, 13 Aug 2010 17:59:19 +0200 Message-ID: <8762ze7bc8.wl%dmaus@ictsoc.de> References: <4C61AF9E.7040903@alumni.ethz.ch> Mime-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: multipart/mixed; boundary="===============1047680684==" Return-path: Received: from [140.186.70.92] (port=44509 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ojwfx-0008Oy-7p for emacs-orgmode@gnu.org; Fri, 13 Aug 2010 12:00:18 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Ojwfg-0007w5-BT for emacs-orgmode@gnu.org; Fri, 13 Aug 2010 11:59:53 -0400 Received: from mysql1.xlhost.de ([213.202.242.106]:58456) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Ojwfg-0007vv-2b for emacs-orgmode@gnu.org; Fri, 13 Aug 2010 11:59:52 -0400 In-Reply-To: <4C61AF9E.7040903@alumni.ethz.ch> List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: Michael Brand Cc: julien@danjou.info, Org Mode , zwz --===============1047680684== Content-Type: multipart/signed; boundary="pgp-sign-Multipart_Fri_Aug_13_17:59:18_2010-1"; micalg=pgp-sha256; protocol="application/pgp-signature" Content-Transfer-Encoding: 7bit --pgp-sign-Multipart_Fri_Aug_13_17:59:18_2010-1 Content-Type: multipart/mixed; boundary="Multipart_Fri_Aug_13_17:59:18_2010-1" --Multipart_Fri_Aug_13_17:59:18_2010-1 Content-Type: text/plain; charset=US-ASCII Michael Brand wrote: >Hi all, >org-feed is becoming very useful for me, so far to manage the >episodes of podcasts. Now I have a patch and a request for help. >1. patch for an issue with XML entities >======================================= >I found that some XML entities in my feeds are not substituted. The >comments of two recent org-feed.el commits by David Maus >http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6 >and >http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6 >lead me to the thread >http://thread.gmane.org/gmane.emacs.orgmode/26352 >and invited me to replace org-feed-unescape with xml-substitute-special >which converts more XML entities. The resulting patch below helps for >me but of course I would like it to be reviewed by an experienced elisp >programmer and org-feed user before being applied. This patch is fine and `xml-substitute-special' is the right thing to do (i.e. convert numeric character references, too). >2. request for help about an issue with multibyte character encoding >==================================================================== >There is an issue with multibyte characters that appear in the input >as unescaped, multibyte encoded characters (not as XML entities, as XML >entities multibyte characters are simply substituted correctly). I >looked for an example with a character encoding specified in the first >line of the XML feed like > >and found one here: >http://www.openscreencast.de/blog/rss.xml The problem with this feed is, that it contains raw unicode characters that must be converted to utf-8 before they can be properly inserted in the target buffer. Attached patch does this by explicitely decoding new entries according to their detected character encoding. Btw.: Helpful introduction to the topic gives The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky http://www.joelonsoftware.com/articles/Unicode.html Best, -- David -- OpenPGP... 0x99ADB83B5A4478E6 Jabber.... dmjena@jabber.org Email..... dmaus@ictsoc.de --Multipart_Fri_Aug_13_17:59:18_2010-1 Content-Type: text/plain; type=patch; charset=US-ASCII Content-Disposition: attachment; filename="0001-Decode-entry-according-to-its-character-encoding.patch" Content-Transfer-Encoding: base64 RnJvbSA5ZTQ4ODVjOWYxYjk4N2ZiMDRjOTM0ZjE3ZGNlYjFhNWYyYmIzNTQ0IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBEYXZpZCBNYXVzIDxkbWF1c0BpY3Rzb2MuZGU+CkRhdGU6IEZy aSwgMTMgQXVnIDIwMTAgMTc6MjY6NDcgKzAyMDAKU3ViamVjdDogW1BBVENIXSBEZWNvZGUgZW50 cnkgYWNjb3JkaW5nIHRvIGl0cyBjaGFyYWN0ZXIgZW5jb2RpbmcKCiogb3JnLWZlZWQuZWwgKG9y Zy1mZWVkLWZvcm1hdC1lbnRyeSk6IERlY29kZSBlbnRyeSBhY2NvcmRpbmcgdG8gaXRzCmNoYXJh Y3RlciBlbmNvZGluZy4KLS0tCiBsaXNwL29yZy1mZWVkLmVsIHwgICAgMyArKy0KIDEgZmlsZXMg Y2hhbmdlZCwgMiBpbnNlcnRpb25zKCspLCAxIGRlbGV0aW9ucygtKQoKZGlmZiAtLWdpdCBhL2xp c3Avb3JnLWZlZWQuZWwgYi9saXNwL29yZy1mZWVkLmVsCmluZGV4IDA3M2QzNDQuLjk4NGY4OTYg MTAwNjQ0Ci0tLSBhL2xpc3Avb3JnLWZlZWQuZWwKKysrIGIvbGlzcC9vcmctZmVlZC5lbApAQCAt NTUzLDcgKzU1Myw4IEBAIElmIHRoYXQgcHJvcGVydHkgaXMgYWxyZWFkeSBwcmVzZW50LCBub3Ro aW5nIGNoYW5nZXMuIgogCQkgIChzZXRxIHRtcCAob3JnLWZlZWQtbWFrZS1pbmRlbnRlZC1ibG9j awogCQkJICAgICB0bXAgKG9yZy1nZXQtaW5kZW50YXRpb24pKSkpKSkKIAkgICAgKHJlcGxhY2Ut bWF0Y2ggdG1wIHQgdCkpKSkKLQkoYnVmZmVyLXN0cmluZykpKSkpCisJKGRlY29kZS1jb2Rpbmct c3RyaW5nCisJIChidWZmZXItc3RyaW5nKSAoZGV0ZWN0LWNvZGluZy1yZWdpb24gKHBvaW50LW1p bikgKHBvaW50LW1heCkgdCkpKSkpKQogCiAoZGVmdW4gb3JnLWZlZWQtbWFrZS1pbmRlbnRlZC1i bG9jayAocyBuKQogICAiQWRkIGluZGVudGF0aW9uIG9mIE4gc3BhY2VzIHRvIGEgbXVsdGlsaW5l IHN0cmluZyBTLiIKLS0gCjEuNy4xCgo= --Multipart_Fri_Aug_13_17:59:18_2010-1-- --pgp-sign-Multipart_Fri_Aug_13_17:59:18_2010-1 Content-Type: application/pgp-signature Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iF4EABEIAAYFAkxla9YACgkQma24O1pEeOZX0AEAz6GDjw/z5fo2SGFoTGaHK9uR pe5qLeIqMzmcZYTwhKUA/1LBh4ur3HvvC5+ngFagjp86cPAum23DrLwpB2uqWRsX =8EsA -----END PGP SIGNATURE----- --pgp-sign-Multipart_Fri_Aug_13_17:59:18_2010-1-- --===============1047680684== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode --===============1047680684==--