From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Brand Subject: org-feed XML entities and character encoding Date: Tue, 10 Aug 2010 21:59:26 +0200 Message-ID: <4C61AF9E.7040903@alumni.ethz.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from [140.186.70.92] (port=56736 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oiuz0-0006id-9v for emacs-orgmode@gnu.org; Tue, 10 Aug 2010 15:59:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Oiuyy-000452-IF for emacs-orgmode@gnu.org; Tue, 10 Aug 2010 15:59:34 -0400 Received: from mail01.solnet.ch ([212.101.4.135]:49606) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Oiuyy-00044R-3d for emacs-orgmode@gnu.org; Tue, 10 Aug 2010 15:59:32 -0400 List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: dmaus@ictsoc.de, Org Mode Cc: julien@danjou.info, zwz Hi all, org-feed is becoming very useful for me, so far to manage the episodes of podcasts. Now I have a patch and a request for help. 1. patch for an issue with XML entities ======================================= I found that some XML entities in my feeds are not substituted. The comments of two recent org-feed.el commits by David Maus http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6 and http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6 lead me to the thread http://thread.gmane.org/gmane.emacs.orgmode/26352 and invited me to replace org-feed-unescape with xml-substitute-special which converts more XML entities. The resulting patch below helps for me but of course I would like it to be reviewed by an experienced elisp programmer and org-feed user before being applied. 2. request for help about an issue with multibyte character encoding ==================================================================== There is an issue with multibyte characters that appear in the input as unescaped, multibyte encoded characters (not as XML entities, as XML entities multibyte characters are simply substituted correctly). I looked for an example with a character encoding specified in the first line of the XML feed like and found one here: http://www.openscreencast.de/blog/rss.xml The W3C validator http://validator.w3.org seems to be happy with this feed but when fed into a feeds.org the unescaped, multibyte encoded characters e. g. of the title `Screencast 076 [...]' get upset, even with `coding: utf-8-unix' in the first line of the file feeds.org. Can someone please help to get this issue resolved? If easily possible, like I expect it to be, generally for all character encodings supported by Emacs? I would even like if UTF-8 feeds like http://pod.drs.ch/world_music_special_mpx.xml that do not have the character encoding specified would work too. Thanks - Michael ------------------------------------------------------------ --- a/lisp/org-feed.el +++ b/lisp/org-feed.el @@ -99,6 +99,7 @@ (declare-function xml-get-children "xml" (node child-name)) (declare-function xml-get-attribute "xml" (node attribute)) (declare-function xml-get-attribute-or-nil "xml" (node attribute)) +(declare-function xml-substitute-special "xml" (string)) (defvar xml-entity-alist) (defgroup org-feed nil @@ -269,17 +270,6 @@ (defvar org-feed-buffer "*Org feed*" "The buffer used to retrieve a feed.") -(defun org-feed-unescape (s) - "Unescape protected entities in S." - (require 'xml) - (let ((re (concat "&\\(" - (mapconcat 'car xml-entity-alist "\\|") - "\\);"))) - (while (string-match re s) - (setq s (replace-match - (cdr (assoc (match-string 1 s) xml-entity-alist)) nil nil s))) - s)) - ;;;###autoload (defun org-feed-update-all () "Get inbox items from all feeds in `org-feed-alist'." @@ -613,6 +603,7 @@ (defun org-feed-parse-rss-entry (entry) "Parse the `:item-full-text' field for xml tags and create new properties." + (require 'xml) (with-temp-buffer (insert (plist-get entry :item-full-text)) (goto-char (point-min)) @@ -620,7 +611,7 @@ nil t) (setq entry (plist-put entry (intern (concat ":" (match-string 1))) - (org-feed-unescape (match-string 2))))) + (xml-substitute-special (match-string 2))))) (goto-char (point-min)) (unless (re-search-forward "isPermaLink[ \t]*=[ \t]*\"false\"" nil t) (setq entry (plist-put entry :guid-permalink t)))) @@ -633,7 +624,6 @@ The `:item-full-text' property actually contains the sexp formatted as a string, not the original XML data." - (require 'xml) (with-current-buffer buffer (widen) (let ((feed (car (xml-parse-region (point-min) (point-max))))) @@ -654,7 +644,7 @@ 'href))) ;; Add as :title. (setq entry (plist-put entry :title - (org-feed-unescape + (xml-substitute-special (car (xml-node-children (car (xml-get-children xml 'title))))))) (let* ((content (car (xml-get-children xml 'content))) @@ -664,12 +654,12 @@ ((string= type "text") ;; We like plain text. (setq entry (plist-put entry :description - (org-feed-unescape + (xml-substitute-special (car (xml-node-children content)))))) ((string= type "html") ;; TODO: convert HTML to Org markup. (setq entry (plist-put entry :description - (org-feed-unescape + (xml-substitute-special (car (xml-node-children content)))))) ((string= type "xhtml") ;; TODO: convert XHTML to Org markup. ------------------------------------------------------------