From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Kitchin Subject: html to org-mode Date: Fri, 3 Jan 2014 21:40:14 -0500 Message-ID: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=047d7b5d3610494c8504ef1bf23c Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:49563) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VzH9t-00085O-IM for emacs-orgmode@gnu.org; Fri, 03 Jan 2014 21:40:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VzH9s-0002A4-7c for emacs-orgmode@gnu.org; Fri, 03 Jan 2014 21:40:17 -0500 Received: from mail-pa0-x233.google.com ([2607:f8b0:400e:c03::233]:56593) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VzH9r-00029y-Rn for emacs-orgmode@gnu.org; Fri, 03 Jan 2014 21:40:16 -0500 Received: by mail-pa0-f51.google.com with SMTP id fa1so16529555pad.10 for ; Fri, 03 Jan 2014 18:40:14 -0800 (PST) List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: "emacs-orgmode@gnu.org" --047d7b5d3610494c8504ef1bf23c Content-Type: text/plain; charset=ISO-8859-1 Hi everyone, I was playing around with org-rss today, and it is pretty cool. I would like to customize the way the subheading bodies look though, primarily to unescape some html things like <, to get rid of all the html tags, convert to org-mode links, to download so they can be displayed, etc... for example a body of an rss entry looks like: Philip Herron: Cython Book http://redbrain.co.uk/?p=147 http://redbrain.co.uk/cython-book/

Hey all i thought i should really share that i actually wrote a book on Cython. The book has detailed examples and even shows you how you can extend native C/C++ applications in python by doing it for Tmux. http://bit.ly/195ahQs

photoThe code can be found: https://github.com/redbrain/cython-book

Tue, 10 Dec 2013 14:45:08 +0000 I would like this simplified to something like: Philip Herron: Cython Book http://redbrain.co.uk/?p=147 http://redbrain.co.uk/cython-book/ Hey all i thought i should really share that i actually wrote a book on Cython. The book has detailed examples and even shows you how you can extend native C/C++ applications in python by doing it for Tmux. http://bit.ly/195ahQs [[feed-images/photo.jpg]] The code can be found: https://github.com/redbrain/cython-book basically, get the html code as close to org as reasonable. i found a way to get an html parse tree (libxml-parse-html-region start end), but I can't figure out how to convert that to the text I want. Has anyone done anything like this? John ----------------------------------- John Kitchin Associate Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu --047d7b5d3610494c8504ef1bf23c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi everyone,

I was playing aro= und with org-rss today, and it is pretty cool. I would like to customize th= e way the subheading bodies look though, primarily to unescape some html=A0= things like &lt;,=A0 to get rid of all the html tags, convert <a ..= > to org-mode links, to download <img ...> so they can be displaye= d, etc...

for example a body of an rss entry looks like:

=A0=A0=A0=A0 <= ;title>Philip Herron: Cython Book</title> =A0=A0=A0 <guid>http://redbrain.co.uk/?p=3D147= </guid> =A0=A0=A0 <link>http://redbrain.co.uk/cython-book/</link> =A0=A0=A0 <d= escription><p>Hey all i thought i should really share that i actua= lly wrote a book on Cython. The book has detailed examples and even shows y= ou how you can extend native C/C++ applications in python by doing it for T= mux. <a href=3D"http://bit.ly/195= ahQs">http://bit.ly/195ahQs</a></p> <p><a href=3D"http://redbrain.co.uk/wp-con= tent/uploads/2013/12/photo.jpg"><img class=3D"aligncent= er size-full wp-image-148" alt=3D"photo" src=3D"http://redb= rain.co.uk/wp-content/uploads/2013/12/photo.jpg" width=3D"640= " height=3D"480" /></a>The code can be found: <a= href=3D"https://g= ithub.com/redbrain/cython-book">https://github.com/redbrain/cython-book</a>&= lt;/p></description> =A0=A0=A0 <pubDate>Tue, 10 Dec 2013 14:= 45:08 +0000</pubDate>

I would like this simplified to something like:
Philip Herron:= Cython Book

http://redbr= ain.co.uk/?p=3D147

http://redbrain.co.uk/cython-book/
Hey all i thought i should really share that i actually wrote a book on Cyt= hon. The book has detailed examples and even shows you how you can extend n= ative C/C++ applications in python by doing it for Tmux. http://bit.ly/195ahQs

[[feed-images/photo.jpg]]

The code can be found: https://github.com/redbrain/cython-boo= k

basically, get the html code as close to org as reasonable. i = found a way to get an html parse tree (libxml-parse-html-region start end),= but I can't figure out how to convert that to the text I want.

Has anyone done anything like this?

John

-----------------------------------
John = Kitchin
Associate Professor
Doherty Hall A207F
Department of Chemi= cal Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
http://kitchingr= oup.cheme.cmu.edu

--047d7b5d3610494c8504ef1bf23c--