From mboxrd@z Thu Jan  1 00:00:00 1970
From: John Kitchin <jkitchin@andrew.cmu.edu>
Subject: html to org-mode
Date: Fri, 3 Jan 2014 21:40:14 -0500
Message-ID: <CAJ51ETrsyuAwpYOvJ2yqYsVirJdX4qcmkmaRVROj-mm4C3LF_g@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=047d7b5d3610494c8504ef1bf23c
Return-path: <emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49563)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <johnrkitchin@gmail.com>) id 1VzH9t-00085O-IM
	for emacs-orgmode@gnu.org; Fri, 03 Jan 2014 21:40:18 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <johnrkitchin@gmail.com>) id 1VzH9s-0002A4-7c
	for emacs-orgmode@gnu.org; Fri, 03 Jan 2014 21:40:17 -0500
Received: from mail-pa0-x233.google.com ([2607:f8b0:400e:c03::233]:56593)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <johnrkitchin@gmail.com>) id 1VzH9r-00029y-Rn
	for emacs-orgmode@gnu.org; Fri, 03 Jan 2014 21:40:16 -0500
Received: by mail-pa0-f51.google.com with SMTP id fa1so16529555pad.10
	for <emacs-orgmode@gnu.org>; Fri, 03 Jan 2014 18:40:14 -0800 (PST)
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
To: "emacs-orgmode@gnu.org" <emacs-orgmode@gnu.org>

--047d7b5d3610494c8504ef1bf23c
Content-Type: text/plain; charset=ISO-8859-1

Hi everyone,

I was playing around with org-rss today, and it is pretty cool. I would
like to customize the way the subheading bodies look though, primarily to
unescape some html  things like &lt;,  to get rid of all the html tags,
convert <a ..> to org-mode links, to download <img ...> so they can be
displayed, etc...

for example a body of an rss entry looks like:

     <title>Philip Herron: Cython Book</title>     <guid>
http://redbrain.co.uk/?p=147</guid>     <link>
http://redbrain.co.uk/cython-book/</link>     <description><p>Hey all i
thought i should really share that i actually wrote a book on Cython. The
book has detailed examples and even shows you how you can extend native
C/C++ applications in python by doing it for Tmux. <a href="
http://bit.ly/195ahQs">http://bit.ly/195ahQs</a></p> <p><a href="
http://redbrain.co.uk/wp-content/uploads/2013/12/photo.jpg"><img
class="aligncenter size-full wp-image-148" alt="photo" src="
http://redbrain.co.uk/wp-content/uploads/2013/12/photo.jpg" width="640"
height="480" /></a>The code can be found: <a href="
https://github.com/redbrain/cython-book">
https://github.com/redbrain/cython-book</a></p></description>
<pubDate>Tue, 10 Dec 2013 14:45:08 +0000</pubDate>

I would like this simplified to something like:
Philip Herron: Cython Book

http://redbrain.co.uk/?p=147

http://redbrain.co.uk/cython-book/
Hey all i thought i should really share that i actually wrote a book on
Cython. The book has detailed examples and even shows you how you can
extend native C/C++ applications in python by doing it for Tmux.
http://bit.ly/195ahQs

[[feed-images/photo.jpg]]

The code can be found: https://github.com/redbrain/cython-book

basically, get the html code as close to org as reasonable. i found a way
to get an html parse tree (libxml-parse-html-region start end), but I can't
figure out how to convert that to the text I want.

Has anyone done anything like this?

John

-----------------------------------
John Kitchin
Associate Professor
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
http://kitchingroup.cheme.cmu.edu

--047d7b5d3610494c8504ef1bf23c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Hi everyone,<br><br></div>I was playing aro=
und with org-rss today, and it is pretty cool. I would like to customize th=
e way the subheading bodies look though, primarily to unescape some html=A0=
 things like &amp;lt;,=A0 to get rid of all the html tags, convert &lt;a ..=
&gt; to org-mode links, to download &lt;img ...&gt; so they can be displaye=
d, etc... <br>
<br>for example a body of an rss entry looks like: <br><br>=A0=A0=A0=A0 &lt=
;title&gt;Philip Herron: Cython Book&lt;/title&gt; =A0=A0=A0 &lt;guid&gt;<a=
 href=3D"http://redbrain.co.uk/?p=3D147">http://redbrain.co.uk/?p=3D147</a>=
&lt;/guid&gt; =A0=A0=A0 &lt;link&gt;<a href=3D"http://redbrain.co.uk/cython=
-book/">http://redbrain.co.uk/cython-book/</a>&lt;/link&gt; =A0=A0=A0 &lt;d=
escription&gt;&lt;p&gt;Hey all i thought i should really share that i actua=
lly wrote a book on Cython. The book has detailed examples and even shows y=
ou how you can extend native C/C++ applications in python by doing it for T=
mux. &lt;a href=3D&quot;<a href=3D"http://bit.ly/195ahQs">http://bit.ly/195=
ahQs</a>&quot;&gt;<a href=3D"http://bit.ly/195ahQs">http://bit.ly/195ahQs</=
a>&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href=3D&quot;<a href=3D"http://redbra=
in.co.uk/wp-content/uploads/2013/12/photo.jpg">http://redbrain.co.uk/wp-con=
tent/uploads/2013/12/photo.jpg</a>&quot;&gt;&lt;img class=3D&quot;aligncent=
er size-full wp-image-148&quot; alt=3D&quot;photo&quot; src=3D&quot;<a href=
=3D"http://redbrain.co.uk/wp-content/uploads/2013/12/photo.jpg">http://redb=
rain.co.uk/wp-content/uploads/2013/12/photo.jpg</a>&quot; width=3D&quot;640=
&quot; height=3D&quot;480&quot; /&gt;&lt;/a&gt;The code can be found: &lt;a=
 href=3D&quot;<a href=3D"https://github.com/redbrain/cython-book">https://g=
ithub.com/redbrain/cython-book</a>&quot;&gt;<a href=3D"https://github.com/r=
edbrain/cython-book">https://github.com/redbrain/cython-book</a>&lt;/a&gt;&=
lt;/p&gt;&lt;/description&gt; =A0=A0=A0 &lt;pubDate&gt;Tue, 10 Dec 2013 14:=
45:08 +0000&lt;/pubDate&gt; <br>
<br></div>I would like this simplified to something like:<br>Philip Herron:=
 Cython Book<br><br><a href=3D"http://redbrain.co.uk/?p=3D147">http://redbr=
ain.co.uk/?p=3D147</a><br><br><a href=3D"http://redbrain.co.uk/cython-book/=
">http://redbrain.co.uk/cython-book/</a><br>
Hey all i thought i should really share that i actually wrote a book on Cyt=
hon. The book has detailed examples and even shows you how you can extend n=
ative C/C++ applications in python by doing it for Tmux. <a href=3D"http://=
bit.ly/195ahQs">http://bit.ly/195ahQs</a><br>
<br>[[feed-images/photo.jpg]]<br><br>The code can be found: <a href=3D"http=
s://github.com/redbrain/cython-book">https://github.com/redbrain/cython-boo=
k</a><br><br>basically, get the html code as close to org as reasonable. i =
found a way to get an html parse tree (libxml-parse-html-region start end),=
 but I can&#39;t figure out how to convert that to the text I want. <br>
<br></div>Has anyone done anything like this?<br><div><div><div><br clear=
=3D"all"><div><div>John<br><br>-----------------------------------<br>John =
Kitchin<br>Associate Professor<br>Doherty Hall A207F<br>Department of Chemi=
cal Engineering<br>
Carnegie Mellon University<br>Pittsburgh, PA 15213<br>412-268-7803<br><a hr=
ef=3D"http://kitchingroup.cheme.cmu.edu" target=3D"_blank">http://kitchingr=
oup.cheme.cmu.edu</a><br><br></div>
</div></div></div></div></div>

--047d7b5d3610494c8504ef1bf23c--