emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* org-feed XML entities and character encoding
@ 2010-08-10 19:59 Michael Brand
  2010-08-13 15:59 ` David Maus
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Brand @ 2010-08-10 19:59 UTC (permalink / raw)
  To: dmaus, Org Mode; +Cc: julien, zwz

Hi all,

org-feed is becoming very useful for me, so far to manage the
episodes of podcasts. Now I have a patch and a request for help.

1. patch for an issue with XML entities
=======================================

I found that some XML entities in my feeds are not substituted. The
comments of two recent org-feed.el commits by David Maus
http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
and
http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
lead me to the thread
http://thread.gmane.org/gmane.emacs.orgmode/26352
and invited me to replace org-feed-unescape with xml-substitute-special
which converts more XML entities. The resulting patch below helps for
me but of course I would like it to be reviewed by an experienced elisp
programmer and org-feed user before being applied.

2. request for help about an issue with multibyte character encoding
====================================================================

There is an issue with multibyte characters that appear in the input
as unescaped, multibyte encoded characters (not as XML entities, as XML
entities multibyte characters are simply substituted correctly). I
looked for an example with a character encoding specified in the first
line of the XML feed like
<?xml version="1.0" encoding="utf-8"?>
and found one here:
http://www.openscreencast.de/blog/rss.xml

The W3C validator
http://validator.w3.org
seems to be happy with this feed but when fed into a feeds.org the
unescaped, multibyte encoded characters e. g. of the title `Screencast
076 [...]' get upset, even with `coding: utf-8-unix' in the first line
of the file feeds.org. Can someone please help to get this issue
resolved? If easily possible, like I expect it to be, generally for
all character encodings supported by Emacs? I would even like if
UTF-8 feeds like
http://pod.drs.ch/world_music_special_mpx.xml
that do not have the character encoding specified would work too.

Thanks

- Michael

------------------------------------------------------------
--- a/lisp/org-feed.el
+++ b/lisp/org-feed.el
@@ -99,6 +99,7 @@
  (declare-function xml-get-children "xml" (node child-name))
  (declare-function xml-get-attribute "xml" (node attribute))
  (declare-function xml-get-attribute-or-nil "xml" (node attribute))
+(declare-function xml-substitute-special "xml" (string))
  (defvar xml-entity-alist)

  (defgroup org-feed  nil
@@ -269,17 +270,6 @@
  (defvar org-feed-buffer "*Org feed*"
    "The buffer used to retrieve a feed.")

-(defun org-feed-unescape (s)
-  "Unescape protected entities in S."
-  (require 'xml)
-  (let ((re (concat "&\\("
-		    (mapconcat 'car xml-entity-alist "\\|")
-		    "\\);")))
-    (while (string-match re s)
-      (setq s (replace-match
-	       (cdr (assoc (match-string 1 s) xml-entity-alist)) nil nil s)))
-    s))
-
  ;;;###autoload
  (defun org-feed-update-all ()
    "Get inbox items from all feeds in `org-feed-alist'."
@@ -613,6 +603,7 @@

  (defun org-feed-parse-rss-entry (entry)
    "Parse the `:item-full-text' field for xml tags and create new properties."
+  (require 'xml)
    (with-temp-buffer
      (insert (plist-get entry :item-full-text))
      (goto-char (point-min))
@@ -620,7 +611,7 @@
  			      nil t)
        (setq entry (plist-put entry
  			     (intern (concat ":" (match-string 1)))
-			     (org-feed-unescape (match-string 2)))))
+			     (xml-substitute-special (match-string 2)))))
      (goto-char (point-min))
      (unless (re-search-forward "isPermaLink[ \t]*=[ \t]*\"false\"" nil t)
        (setq entry (plist-put entry :guid-permalink t))))
@@ -633,7 +624,6 @@

  The `:item-full-text' property actually contains the sexp
  formatted as a string, not the original XML data."
-  (require 'xml)
    (with-current-buffer buffer
      (widen)
      (let ((feed (car (xml-parse-region (point-min) (point-max)))))
@@ -654,7 +644,7 @@
  			    'href)))
      ;; Add <title/> as :title.
      (setq entry (plist-put entry :title
-			   (org-feed-unescape
+			   (xml-substitute-special
  			    (car (xml-node-children
  				  (car (xml-get-children xml 'title)))))))
      (let* ((content (car (xml-get-children xml 'content)))
@@ -664,12 +654,12 @@
  	 ((string= type "text")
  	  ;; We like plain text.
  	  (setq entry (plist-put entry :description
-				 (org-feed-unescape
+				 (xml-substitute-special
  				  (car (xml-node-children content))))))
  	 ((string= type "html")
  	  ;; TODO: convert HTML to Org markup.
  	  (setq entry (plist-put entry :description
-				 (org-feed-unescape
+				 (xml-substitute-special
  				  (car (xml-node-children content))))))
  	 ((string= type "xhtml")
  	  ;; TODO: convert XHTML to Org markup.
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: org-feed XML entities and character encoding
  2010-08-10 19:59 org-feed XML entities and character encoding Michael Brand
@ 2010-08-13 15:59 ` David Maus
  2010-08-13 19:03   ` Michael Brand
  0 siblings, 1 reply; 3+ messages in thread
From: David Maus @ 2010-08-13 15:59 UTC (permalink / raw)
  To: Michael Brand; +Cc: julien, Org Mode, zwz


[-- Attachment #1.1.1: Type: text/plain, Size: 2195 bytes --]

Michael Brand wrote:
>Hi all,

>org-feed is becoming very useful for me, so far to manage the
>episodes of podcasts. Now I have a patch and a request for help.

>1. patch for an issue with XML entities
>=======================================

>I found that some XML entities in my feeds are not substituted. The
>comments of two recent org-feed.el commits by David Maus
>http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
>and
>http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
>lead me to the thread
>http://thread.gmane.org/gmane.emacs.orgmode/26352
>and invited me to replace org-feed-unescape with xml-substitute-special
>which converts more XML entities. The resulting patch below helps for
>me but of course I would like it to be reviewed by an experienced elisp
>programmer and org-feed user before being applied.

This patch is fine and `xml-substitute-special' is the right thing to
do (i.e. convert numeric character references, too).

>2. request for help about an issue with multibyte character encoding
>====================================================================

>There is an issue with multibyte characters that appear in the input
>as unescaped, multibyte encoded characters (not as XML entities, as XML
>entities multibyte characters are simply substituted correctly). I
>looked for an example with a character encoding specified in the first
>line of the XML feed like
><?xml version="1.0" encoding="utf-8"?>
>and found one here:
>http://www.openscreencast.de/blog/rss.xml

The problem with this feed is, that it contains raw unicode characters
that must be converted to utf-8 before they can be properly inserted
in the target buffer.

Attached patch does this by explicitely decoding new entries according
to their detected character encoding.

Btw.: Helpful introduction to the topic gives

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)

by Joel Spolsky

http://www.joelonsoftware.com/articles/Unicode.html

Best,
  -- David
--
OpenPGP... 0x99ADB83B5A4478E6
Jabber.... dmjena@jabber.org
Email..... dmaus@ictsoc.de

[-- Attachment #1.1.2: 0001-Decode-entry-according-to-its-character-encoding.patch --]
[-- Type: text/plain, Size: 935 bytes --]

From 9e4885c9f1b987fb04c934f17dceb1a5f2bb3544 Mon Sep 17 00:00:00 2001
From: David Maus <dmaus@ictsoc.de>
Date: Fri, 13 Aug 2010 17:26:47 +0200
Subject: [PATCH] Decode entry according to its character encoding

* org-feed.el (org-feed-format-entry): Decode entry according to its
character encoding.
---
 lisp/org-feed.el |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/lisp/org-feed.el b/lisp/org-feed.el
index 073d344..984f896 100644
--- a/lisp/org-feed.el
+++ b/lisp/org-feed.el
@@ -553,7 +553,8 @@ If that property is already present, nothing changes."
 		  (setq tmp (org-feed-make-indented-block
 			     tmp (org-get-indentation))))))
 	    (replace-match tmp t t))))
-	(buffer-string)))))
+	(decode-coding-string
+	 (buffer-string) (detect-coding-region (point-min) (point-max) t))))))
 
 (defun org-feed-make-indented-block (s n)
   "Add indentation of N spaces to a multiline string S."
-- 
1.7.1


[-- Attachment #1.2: Type: application/pgp-signature, Size: 230 bytes --]

[-- Attachment #2: Type: text/plain, Size: 201 bytes --]

_______________________________________________
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: org-feed XML entities and character encoding
  2010-08-13 15:59 ` David Maus
@ 2010-08-13 19:03   ` Michael Brand
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Brand @ 2010-08-13 19:03 UTC (permalink / raw)
  To: David Maus; +Cc: julien, Org Mode, zwz

Hi David

On 10-08-13 17:59 , David Maus wrote:
>> 2. request for help about an issue with multibyte character encoding
>> ====================================================================
>>
>> There is an issue with multibyte characters that appear in the input
>> as unescaped, multibyte encoded characters (not as XML entities, as XML
>> entities multibyte characters are simply substituted correctly). I
>> looked for an example with a character encoding specified in the first
>> line of the XML feed like
>> <?xml version="1.0" encoding="utf-8"?>
>> and found one here:
>> http://www.openscreencast.de/blog/rss.xml
>> [...]
>
> The problem with this feed is, that it contains raw unicode characters
> that must be converted to utf-8 before they can be properly inserted
> in the target buffer.
>
> Attached patch does this by explicitely decoding new entries according
> to their detected character encoding.
>
> Btw.: Helpful introduction to the topic gives
>
> The Absolute Minimum Every Software Developer Absolutely, Positively
> Must Know About Unicode and Character Sets (No Excuses!)
>
> by Joel Spolsky
>
> http://www.joelonsoftware.com/articles/Unicode.html

Thank you very much for your patch, it resolves this issue with
org-feed.el like expected. I tested your patch with the two feeds
http://www.openscreencast.de/blog/rss.xml  (declared utf-8)
and
http://pod.drs.ch/world_music_special_mpx.xml  (not declared utf-8)
described more by me earlier and a dozen other feeds, all with
character encoding utf-8.

Michael

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-08-13 19:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-10 19:59 org-feed XML entities and character encoding Michael Brand
2010-08-13 15:59 ` David Maus
2010-08-13 19:03   ` Michael Brand

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).