emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Adam Porter <adam@alphapapa.net>
To: emacs-orgmode@gnu.org
Subject: Re: org-board -- bookmarking and archival
Date: Thu, 15 Sep 2016 12:07:33 -0500	[thread overview]
Message-ID: <87oa3pdrca.fsf@alphapapa.net> (raw)
In-Reply-To: m2r391i6ho.fsf@aurox.ch

Hi Charles,

Thanks for sharing that, I will check it out.  As was mentioned, it
seems ripe for integrating with browser capture.  On that note, have you
seen org-protocol-capture-html?  For articles that are primarily text,
I've been capturing articles directly in Org format, but your package
sounds good for capturing pages as-is.

By the way, you might want to consider integrating something like
Readability or the Python package python-readability (aka
readability-lxml) for reducing web pages to the primary content.  It's
worked out well in org-protocol-capture-html.

By the way, here's some code I've been using to read and/or capture web
pages from URLs on the clipboard:

#+BEGIN_SRC elisp
(defun url-to-org-with-readability (url)
  "Get page content of URL with python-readability, convert to
Org with Pandoc, and display in buffer."

  (let (title content new-buffer)

    (with-temp-buffer
      (unless (= 0 (call-process "python" nil '(t t) nil "-m" "readability.readability" "-u" url))
        (error "Python readability-lxml script failed: %s" (buffer-string)))

      ;; Get title
      (goto-char (point-min))
      (setq title (buffer-substring-no-properties (search-forward "Title:") (line-end-position)))

      (unless (= 0 (call-process-region (point-min) (point-max) "pandoc" t t nil "--no-wrap" "-f" "html" "-t" "org"))
        (error "Pandoc failed."))
      (setq content (buffer-substring (point-min) (buffer-end 1))))

    ;; Make new buffer
    (setq new-buffer (generate-new-buffer title))
    (with-current-buffer new-buffer
      (insert (concat "* [[" url "][" title "]]\n\n"))
      (insert content)
      (org-mode)
      (goto-char (point-min))
      (org-cycle)
      (switch-to-buffer new-buffer))))
(defun read-url-with-org ()
  "Call `url-to-org-with-readability' on URL in kill ring."
  (interactive)
  (url-to-org-with-readability (first kill-ring)))

(defun org-capture-web-page-with-readability (&optional url)
  "Return string containing entire capture to be inserted in org-capture template."
  (let ((url (or url (first kill-ring)))
        ;; From org-insert-time-stamp
        (timestamp (format-time-string (concat "[" (substring (cdr org-time-stamp-formats) 1 -1) "]")))
        title title-linked content)

    (with-temp-buffer
      (unless (= 0 (call-process "python" nil '(t t) nil "-m" "readability.readability" "-u" url))
        (error "Python readability-lxml script failed: %s" (buffer-string)))

      ;; Get title
      (goto-char (point-min))
      (setq title (buffer-substring-no-properties (search-forward "Title:") (line-end-position)))
      (setq title-linked (concat "[[" url "][" title "]]"))

      (unless (= 0 (call-process-region (point-min) (point-max) "pandoc" t t nil "--no-wrap" "-f" "html" "-t" "org"))
        (error "Pandoc failed."))

      ;; Demote page headings in capture buffer to below the
      ;; top-level Org heading and "Article" 2nd-level heading
      (save-excursion
        (goto-char (point-min))
        (while (re-search-forward (rx bol (1+ "*") (1+ space)) nil t)
          (beginning-of-line)
          (insert "**")
          (end-of-line)))

      (goto-char (point-min))
      (goto-line 2)
      (setq content (s-trim (buffer-substring (point) (buffer-end 1))))

      ;; Return capture for insertion
      (concat title-linked " :website:\n\n" timestamp "\n\n** Article\n\n" content))))

;; org-capture template
("wr" "Capture Web site with python-readability" entry
 (file "~/org/articles.org")
 "* %(org-capture-web-page-with-readability)")
#+END_SRC

  parent reply	other threads:[~2016-09-15 17:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-03 11:10 org-board -- bookmarking and archival Charles A. Roelli
2016-09-14  7:10 ` Alan Schmitt
2016-09-14 18:33   ` Charles A. Roelli
2016-09-15 17:07 ` Adam Porter [this message]
2016-09-16 18:40   ` Charles A. Roelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87oa3pdrca.fsf@alphapapa.net \
    --to=adam@alphapapa.net \
    --cc=emacs-orgmode@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).