/Semi-verbose Preamble/. Having recently begun intensive use of org-mode for tangling source files, I encountered four issues related to comment extraction (two bugs, one undesirable behavior, and one ... unfulfilled need), which I describe in detail below. I started by creating an org file that would reproduce the problems, and soon started /describing/ the problems in the org file as well as putting my fixes in the source blocks. At the risk of it being too meta or annoying, I've included that org file at the end of this message as the problem description. All the details are there as well as two fixes. Tangling that file in various ways described demonstrates the problems, and you can export to PDF for nicer reading. (I've attached the PDF to this mail for convenience. It looks good; kudos, org-mode!) I've also attached a tarball with files that make it easy to try my changes and to restore the original behavior, as well as tests and results from the org file for easy comparison. See the included README. I've been using the revised code now for a few days. It fixes the problems I describe, and I think it provides a flexible framework for comment extraction with minimal change to the base code. If the reaction to this is positive, I will happily submit a patch, sign paperwork, or whatever is needed, after fixing any problems that you all see. In any case, I very much look forward to any feedback you can offer. Thanks. -- Christopher P.S. In case the attachments get dropped, I've put the PDF and the tarball at http://www.stat.cmu.edu/~genovese/depot/tangle-error.pdf http://www.stat.cmu.edu/~genovese/depot/tangle-bundle.tgz /Problem Description/ ################ Cut Here #################### # -*- org-return-follows-link: t; comment-empty-lines: t; -*- #+TITLE: Tangle this file: four issues with org-babel-tangle #+AUTHOR: Christopher Genovese #+DATE: 14 Sep 2011\vspace*{-0.5cm} #+EMAIL: genovese@cmu.edu #+OPTIONS: toc:1 H:1 #+BABEL: :tangle yes :comments org :results silent :exports code #+BIND: org-export-latex-hyperref-format "\\hyperlink{%s}{%s}" #+STARTUP: showall #+LATEX_HEADER: \usepackage[labelsep=period,labelfont=bf]{caption} #+LaTeX: \vspace*{-1cm}\vspace*{-2pt} * Four Related Issues with org-babel-tangle Running org-mode version 7.7 on both Gnu Emacs 23.2.1 and 24.0.50.1 with Mac OS X 10.5.8 (with and without a -Q option), I encountered the following issues/problems/bugs when tangling files with code blocks for which the :comment header argument is org: 1. The subtree associated with the very first code block has its headline tangled without leading stars, but all subsequent sub-trees associated with code blocks have the leading stars included in the comments. 2. If the first code block comes before the first headline, the start of the comment text will be determined by pre-existing match data and thus will likely be incorrect. 3. Org structural elements such as headline stars, =#+= control lines, and drawers are included in the comments. 4. There is no way easy to delimit comment text or transform it that does not also change the structure of the org file (e.g., by adding headlines or source blocks). Issues 1 and 2 seem to be genuine bugs. Issues 3 and 4 are more subjective, I admit, but seem undesirable. Stars, drawers, and control lines are org structure rather than content, and they are often inappropriate in comments. To reproduce the behaviors for issues 1 and 3, look at the result of tangling this file. To reproduce issue 2 as well, remove the first two stars from this file and tangle again. Alternatively, within emacs, evaluate the =(buffer-substring ...)= sexp from the original-code code below at the beginning character of a source block. (You can also export this file to PDF for more pleasant reading.) Below, I give details on these issues and code for two fixes: a [[simple-fix][simple fix]] that handles the first two issues and the stars in the third, and a [[preferred-fix][better fix]] that handles all four issues in a more modular, customizable framework. I'd be interested in hearing feedback on all of this. If the reaction is positive, I will gladly submit a patch. Thanks for your consideration. * The Original Code and Details on the Problem The relevant section of the original code from org-mode version 7.7 is shown below, comprising lines 344 through 357 of ob-tangle.el in function =org-babel-tangle-collect-blocks=. With point is at a =#+begin_src=, it scans back either for a heading line or for the end of the previous code block, whichever comes later. The resulting region becomes the comment text. #+latex: \begin{figure}[h] #+latex: \hypertarget{original-code}{} % <> #+source: original-code #+begin_src emacs-lisp (comment (when (or (string= "both" (cdr (assoc :comments params))) (string= "org" (cdr (assoc :comments params)))) ;; from the previous heading or code-block end (buffer-substring (max (condition-case nil (save-excursion (org-back-to-heading t) (point)) (error 0)) (save-excursion (re-search-backward org-babel-src-block-regexp nil t) (match-end 0))) (point)))) #+end_src #+latex: \caption{Original Code, lines 344--357 in {\tt ob-tangle.el}.} #+latex: \label{fig::original-code} #+latex: \end{figure} /Issue 1/. When in the first code block in the file, the second search fails (there is no previous code block), so the (match-end 0) call uses the match data from the (implicit) match during =org-back-to-heading=, which skips the stars. (Not a particularly transparent reference, incidentally.) For subsequent blocks, the =(match-end 0)= gives the end of the previous code block, which in these examples is earlier than the previous headline location. /Issue 2/. When the first code block lies before the first headline (say with some text before it), the searches fail in /both/ clauses of the max. So, the =match-end= will return an essentially arbitrary result, which is a bug. /Issue 3/. =org-back-to-heading= leaves point at the beginning of the stars, so a headline included in the text will have stars, except for the first one. /Issue 4/. Control lines at the end of the previous code block and before point are not filtered out and so are included in the comments. * A Simple Fix for the First Three Issues A small change addresses issues 1, 2, and the stars for issue 3: in both cases, simply use the =match-end= and replace 0 values with =(point-min)=. The latter gives a sensible result even if both computed positions are trivial (as when the first code block comes before the first headline) and respects narrowing. #+latex: \begin{figure}[h] #+latex: \hypertarget{simple-fix}{} % <> #+begin_src emacs-lisp (comment (when (or (string= "both" (cdr (assoc :comments params))) (string= "org" (cdr (assoc :comments params)))) ;; from the previous heading or code-block end (buffer-substring (max (condition-case nil (save-excursion (org-back-to-heading t) ; sets match data (match-end 0)) (error (point-min))) (save-excursion (if (re-search-backward org-babel-src-block-regexp nil t) (match-end 0) (point-min)))) (point)))) #+end_src #+latex: \caption{Simple Fix, replacement for lines 344--357 in {\tt ob-tangle.el}.} #+latex: \label{fig::simple-fix} #+latex: \end{figure} * A Fix for All Four Issues A better fix that handles issues 1--4 starts with the region computed as in the [[simple-fix][simple fix]] and then processes that text through a user-configurable sequence of functions to derive the final form of the comment text. The following changes are required. ** Extract Initial Comment Text and State from Org Buffer The initial comment text ranges from either the most recent headline at the point after the stars, the beginning of the line after the =#+end_src= of the most recent code block, or the beginning of the buffer, whichever is later, through the line before the source block.[fn:1] The [[preferred-fix][code]] to extract this is given below. #+latex: (See Figure \ref{fig::preferred-fix}.) It replaces lines 344 through 357 of =ob-tangle.el= from org-mode version 7.7 in the function =org-babel-tangle-collect-blocks=. #+latex: \begin{figure}[h] #+latex: \hypertarget{preferred-fix}{} % <> #+begin_src emacs-lisp (comment (when (or (string= "both" (cdr (assoc :comments params))) (string= "org" (cdr (assoc :comments params)))) (let* ((prev-heading (condition-case nil (save-excursion (org-back-to-heading t) ; sets match data (match-end 0)) (error (point-min)))) (end-of-prev-src-block (save-excursion (if (null (re-search-backward org-babel-src-block-regexp nil t)) (point-min) (goto-char (match-end 0)) (forward-line 1) (point)))) (comment-start (max prev-heading end-of-prev-src-block)) (comment-end (save-excursion (forward-line 0) (point))) (state (list (cons 'org-drawers org-drawers) (cons 'after-heading (= comment-start prev-heading)) (cons 'first-line (= comment-start (point-min)))))) (org-babel-process-comment-text (buffer-substring comment-start comment-end) state)))) #+end_src #+latex: \caption{Better Fix, replacement for lines 344--357 in {\tt ob-tangle.el}.} #+latex: \label{fig::preferred-fix} #+latex: \end{figure} ** Adjust =org-babel-spec-to-string= The commment block collected by the [[original-code][original code]] #+latex: (Figure \ref{fig::original-code}) in =org-babel-tangle-collect-blocks= is further processed in \newline =org-babel-spec-to-string= to trim leading and trailing whitespace from string. This was needed because spaces after a source block were included in the comment. In the revised code, however, this space trimming is handled during text transformation, except for removing trailing newlines. (Note: trailing /spaces/ are not removed to allow more flexibility in comment processing.) Hence, =org-babel-spec-to-string= needs to be slightly adjusted. #+latex: See Figure \ref{fig::spec-string-diff}. #+latex: \begin{figure}[h] #+begin_example --- ob-tangle.el 2011-09-14 11:48:26.000000000 -0400 +++ new-ob-tangle.el 2011-09-14 11:55:56.000000000 -0400 @@ -398,3 +398,3 @@ (flet ((insert-comment (text) - (let ((text (org-babel-trim text))) + (let ((text (org-babel-chomp text "[\f\t\n\r\v]"))) (when (and comments (not (string= comments "no")) #+end_example #+latex: \caption{Changes to {\tt org-spec-to-string} in {\tt ob-tangle.el}, unified diff, one line of context} #+latex: \label{fig::spec-string-diff} #+latex: \end{figure} ** Process Comment Text Through Sequence of Transforms At the end of the revised [[preferred-fix][comment collection code]], the comment text is passed to =org-babel-process-comment-text= which applies a sequence of transformation functions. #+latex: (See Figure \ref{fig::comment-transformation}.) The list of transformation functions is stored in a customizable variable described [[Define Customization Variable for Transforms][below]]. Several predefined transformations are given [[Define A Collection of Transform Functions][below]] as well. #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-process-comment-text (text &optional state) "Apply list of transforms to comment TEXT assuming bindings in alist STATE. Returns the modified text string, which may have text properties. See `org-babel-comment-processing-functions' for the transforms to be applied and details on the allowed keys in the STATE alist." (let ((funcs org-babel-comment-processing-functions)) (with-temp-buffer (insert text) (let ((org-drawers (or (cdr (assoc 'org-drawers state)) org-drawers)) (after-heading (cdr (assoc 'after-heading state))) (first-line (cdr (assoc 'first-line state)))) (while funcs (goto-char (point-min)) (funcall (car funcs)) (setq funcs (cdr funcs)))) (buffer-substring (point-min) (point-max))))) #+end_src #+latex: \caption{Better Fix, comment transformation driver.} #+latex: \label{fig::comment-transformation} #+latex: \end{figure} ** Define Customization Variable for Transforms A list of nullary functions applied in order to the comment text. The text is inserted in a temporary buffer, so these functions can use the entire Emacs library for operating on buffer text. #+latex: See Figure \ref{fig::comment-transformation-function-list}. #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defcustom org-babel-comment-processing-functions '(org-babel-comment-delete-file-variables-line org-babel-comment-delete-org-control-lines org-babel-comment-delete-drawers org-babel-comment-trim-blank-lines org-babel-comment-trim-indent-prefix) "List of functions to transform source-block comment text before insertion. Each function will be called with no arguments with point at the beginning of a buffer containing only the comment text. Each function can modify the text at will and leave point anywhere, but it should *not* modify the narrowing state of the buffer. Several dynamic state variables are set prior to execution that each function can reference. These currently include: + org-drawers: names of drawers in the original org buffer. + from-heading: t if comment starts at an org heading line, nil otherwise. + first-line: t if initial comment starts on first line of the original org buffer, nil otherwise. If a function changes the value of these state variables, the new value will be seen by all following functions in the list, but this is not generally recommended. The functions in this list are called *in order*, and this order can influence the form of the resulting comment text." :group 'org-babel :type 'list) #+end_src #+latex: \caption{Better Fix, customizable transform list.} #+latex: \label{fig::comment-transformation-function-list} #+latex: \end{figure} ** Define A Collection of Transform Functions An advantage of this design is that transformation of the comments is modular and customizable. We can include in =ob-tangle.el= a collection of pre-defined transforms. The default processing stream in =org-babel-comment-processing-functions= is as follows: 1. Delete a file variables if on the first line of the buffer. 2. Delete all drawers and their contents. 3. Delete all org control lines from the comment text. 4. Trim blank lines from the beginning and end. 5. Reindent the text by removing the longest common leading string of spaces. #+ TANGLE: end-comment These and several other useful transforms are given below (e.g., deleting drawer delimiters but not contents).. #+latex: See Figures \ref{fig::transformA}--\ref{fig::transformZ}. It is easy to define new transforms; any function that operates on text in the current buffer beginning at point-min will work. #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-delete-file-variables-line () "Delete file variables comment line if at beginning of buffer. This only checks the first line of the buffer, and so should be placed first (or at least early enough) in the list `org-babel-comment-processing-functions' to ensure that the no other text has been inserted earlier." (when (and first-line (looking-at ; file-variables line "^#[ \t]*-\\*-.*:.*;[ \t]**-\\*-[ \t]*$")) (let ((kill-whole-line t)) (kill-line)))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \label{fig::transformA} #+latex: \end{figure} #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-delete-org-control-lines () "Remove all org #+ control lines from comment." (let ((control-regexp "^[ \t]*#\\+.*$")) (delete-matching-lines control-regexp))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \end{figure} #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-delete-org-in-buffer-settings () "Remove all org #+ in-buffer setting lines, leaving other control lines. In-buffer setting lines begin with #+ and have all caps keyword names." (let ((setting-regexp "^#\\+[ \t]*[A-Z_]+:.*$")) (delete-matching-lines setting-regexp))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \end{figure} #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-delete-drawers () "Delete drawer delimiters and contents from comment. Drawer names are restricted to those in the `org-drawers' state." (let ((drawer-start-regexp (format "^[ \t]*:\\(?:%s\\):[ \t]*$" (mapconcat 'identity org-drawers "\\|"))) (drawer-end-regexp "^[ \t]*:END:[ \t]*$")) (while (re-search-forward drawer-start-regexp nil t) (let ((beg (save-excursion (forward-line 0) (point))) (end (save-excursion (re-search-forward drawer-end-regexp nil t) (forward-line 1) (point)))) (goto-char end) (delete-region beg end))))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \end{figure} #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-delete-drawer-delimiters () "Delete drawer delimiters from comment leaving content. Drawer names are restricted to those given by the `org-drawers' state." (let ((drawer-delim-regexp (format "^[ \t]*:\\(?:%s\\)" (mapconcat 'identity (cons "END" org-drawers) "\\|")))) (delete-matching-lines drawer-delim-regexp))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \end{figure} #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-trim-blank-lines () "Trim whitespace-only lines from beginning and end of text." (while (and (looking-at "^[ \t\f]*$") (< (point) (point-max))) (forward-line 1)) (delete-region (point-min) (point)) (when (< (point) (point-max)) (goto-char (point-max)) (let ((last-point (point))) (forward-line 0) (while (and (looking-at "^[ \t\f]*$") (> (point) (point-min))) (setq last-point (point)) (forward-line -1)) (delete-region last-point (point-max))))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \end{figure} #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-trim-indent-prefix () "Remove longest common leading prefix of spaces from each line of TEXT. Prefix is computed from the initial whitespace on each line with tabs converted to spaces, preserving indentation." (let* ((common-indent nil) (common-length (1+ (- (point-max) (point-min)))) (current-indent "") ; enter first loop (current-length common-length)) ; skip first assignment (goto-char (point-min)) (while current-indent (when (< current-length common-length) (setq common-indent current-indent common-length current-length)) (setq current-indent (let* ((found (re-search-forward "^\\([ \t]*\\)\\S-" nil t)) (bol (match-beginning 0)) (eos (match-end 1)) (space-str (match-string 1)) (indent-tabs-mode nil)) (cond ((not found) nil) ((not (string-match "\t" space-str)) space-str) (t ; detabify indent string (goto-char eos) (let ((col (current-column))) (delete-region bol eos) (indent-to col)) (buffer-substring-no-properties bol (point)))))) (setq current-length (length current-indent))) (when (and common-indent (> common-length 0)) (let ((indent-re (concat "^" common-indent))) (goto-char (point-min)) (while (re-search-forward indent-re nil t) (replace-match "" nil nil)))))) #+end_src #+latex: \caption{Comment Transform.} #+latex: \label{fig::transformZ} #+latex: \end{figure} #+latex: \end{itemize} #+latex: \noindent This kind of customization offers some nice possibilities, including controlling indentation, eliminating or transforming org markup, eliminating trailing whitespace, and automating specialized comment formatting (e.g., javadoc). As an additional illustration, consider the transform =org-babel-comment-restrict-comment-range= #+latex: in Figure \ref{fig::transform-illustration} below. The idea is that it is sometimes useful to select from the text under a headline a /part/ of the text for the comment. We want some org markup that will not affect either the export or the structure of the org file itself. To do this, we use the fact that =#+=\textvisiblespace lines are not exported.[fn:2] So, we can /de facto/ use the =#+ TANGLE:= construct to control various aspects of tangling. Here, we use the =#+ TANGLE: start-comment= and =#+ TANGLE: end-comment= to delimit the comment text. (This function needs to come earlier in the function list than the functions that eliminate org control lines. It is sufficient to prepend it to that list.) This is used in the current file, for example. #+latex: \begin{figure}[h] #+begin_src emacs-lisp (defun org-babel-comment-restrict-comment-range () "Remove all comment text outside start-comment and end-comment delimiters. Comment delimiters are #+TANGLE lines with respective keywords start-comment and end-comment. THE #+TANGLE lines are also deleted. To be effective, this function should be positioned in the list `org-babel-comment-processing-functions' before any functions that remove org control lines or process other co-occuring attributes of #+TANGLE lines." (when (re-search-forward "^[ \t]*#\\+[ \t]*TANGLE:.*start-comment.*$" nil t) (forward-line 1) (delete-region (point-min) (point))) (when (re-search-forward "^[ \t]*#\\+[ \t]*TANGLE:.*end-comment.*$" nil t) (forward-line 0) (delete-region (point) (point-max)))) #+end_src #+latex: \caption{Transform to illustrate some customization possibilities.} #+latex: \label{fig::transform-illustration} #+latex: \end{figure} #+latex: \begin{itemize} [fn:1] In the original code and in the simple fix above, the comment starts /immediately/ after the =#+end_src= rather than at the start of the next line. Starting at the next line seems more natural to me because the comment being constructed relates to the /following/ code block. But the original behavior is easily restored if people disagree. [fn:2] A feature request: I would propose that the =#+tangle:= construct be recognized as non-exported even with spaces preceding the =#= and no spaces after the =+=. This would enable a variety of interesting customization for tangled comments. Alternatively, a generic construct such as =#+noop:= or =#+generic:= could be a valuable for user-based tags in an org file that serves a similar purpose -- allow customized processing without directly being exported.