From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Dokos Subject: Re: Erroneous "No such file or directory" with babel and remote dir Date: Thu, 15 Nov 2012 01:55:29 -0500 Message-ID: <6907.1352962529@alphaville> References: <87y5kiq71h.fsf@slate.zedat.fu-berlin.de> <87vcfapgfy.fsf@bzg.ath.cx> <87a9wd56i6.fsf@slate.zedat.fu-berlin.de> <878vbwhczx.fsf@Rainer.invalid> <87a9wc1uki.fsf@slate.zedat.fu-berlin.de> <87vcf0ft1s.fsf@Rainer.invalid> <87mx0chswg.fsf@slate.zedat.fu-berlin.de> <87r4pnpehe.fsf@Rainer.invalid> <87fw4dlei8.fsf@slate.zedat.fu-berlin.de> <7557.1352869451@alphaville> <9608.1352874199@alphaville> Reply-To: nicholas.dokos@hp.com Return-path: Received: from eggs.gnu.org ([208.118.235.92]:58120) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TYtMT-0002aT-S9 for emacs-orgmode@gnu.org; Thu, 15 Nov 2012 01:55:44 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TYtMQ-0006lV-Pl for emacs-orgmode@gnu.org; Thu, 15 Nov 2012 01:55:41 -0500 Received: from g4t0014.houston.hp.com ([15.201.24.17]:27246) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TYtMQ-0006kc-Da for emacs-orgmode@gnu.org; Thu, 15 Nov 2012 01:55:38 -0500 In-Reply-To: Message from Nick Dokos of "Wed, 14 Nov 2012 01:23:19 EST." <9608.1352874199@alphaville> List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: Loris Bennett Cc: emacs-orgmode@gnu.org In a previous mail, I wrote: ,---- | Which tells me that the tmp file error is a red herring and the real | breakage occurred after 5cb80c7, probably through a commit that touched | ob-sh.el (although that's far from guaranteed). Here's that list: | | $ git log --oneline -- lisp/ob-sh.el | 86e515d fix remote execution w/empty shebang header arg | 70dd119 Massive code clean-up. | 966447c Don't use `org-flet' in ob-awk.el and ob-sh.el | 8eb5843 Add punctuation at the end of the first line of docstrings. Code cleanup. | 63b5f8f replace flet/labels with org-flet/org-labels | ecd0562 Fix the master branch. | 6e306f6 Fix copyright years in maint. | de42649 Manually revert maint to e85080. | 73bb18b Manually revert to the Release 7.8.04 tag. | 38c5045 Fix copyright years. | 6e534f9 Manually revert back to commit e85080. | 5cb80c7 apply :shebang and :padline to shell script execution | ... `---- Continuing along these lines, I think I've figured it out and it's not pretty. The executive summary is that the last commit in the list introduced a small bug that was fixed by the first commit in the list. But neither of those has much to do with the remote-dir breakage (except that the bug made things harder to bisect). The principal culprits are two sets of commits for code transformations that were supposed to do nothing functionally: they were just supposed to get away from using flet/labels (which are deprecated and will be obsolete by emacs version 24.3.) The end result is that one change fixes the remote dir problem in Loris's example. I'm not sure that it solves every such problem though: I haven't audited all the code. The change is a one-liner: in org-babel-shell-command-on-region, replace the line (call-process-region start end shell-file-name t by (funcall call-process-region start end shell-file-name t BTW, I'm using Org-mode version 7.9.2 (release_7.9.2-582-g6d099e.dirty @ /home/nick/elisp/org-mode/lisp/). The rest of the email explains why the change is needed (in excruciating detail: grab a beer or maybe a cup of coffee before starting on it.) I hope this is (mostly) correct but corrections would be more than welcome. The problem seems multi-faceted and therefore needs as many eyes on it as possible. Nick P.S. I'm not sure whether to thank Loris or to curse him for pushing me on this path, but there is no question that he is responsible for finding the bug, providing the reproducer and then beating on the gates with pitchfork and torch :-) =========================================================================== * Root cause The root cause of the problem was a set of code transformations that were supposed to leave the functionality intact. The code transformations were driven by the need to replace the flet/labels constructs which were declared obsolete (as of 24.3 - they are still available, but cause warnings to be issued). The problem was hard to find because there were four commits (at widely varying times) that contributed to various manifestations of the problem and it was difficult to bisect. In chronological order, they were - commit 5cb80c7 apply :shebang and :padline to shell script execution This did not cause the problem, but it introduced a bug that causes errors when executing a source block with a remote :dir spec. That confused the issue (at least it confused me: I fingered this as the culprit in a bisect, but it was only guilty of the bug fixed by commit 86e515d - see below -, not of the remote-dir problem). - commit 63b5f8f replace flet/labels with org-flet/org-labels Section [[flet --> org-flet]] describes this. - commit e85479a Don't use `org-flet' in some functions and several others that slowly got rid of org-flet in favor of let, and org-labels (somehow - I didn't check this carefully), the latter reverted and reapplied, presumably because problems were found and fixed in several iterations. I've only skimmed the surface here: I think this has the potential to be a minefield of problems waiting to explode - see the [[org-flet --> let]] section below. - commit 86e515d fix remote execution w/empty shebang header arg This finally fixed the little bug that was introduced by 5cb80c7. The first and the last of these commits are irrelevant to the remote-dir problem, except that the bug gets in the way of testing: any version later than 5cb80c7 exhibits the bug and that bug hides whether the remote-dir problem is present or not. The general procedure I followed was to make a branch with some commit as its tip and then manually apply the patch of commit 86e515d. Only then could I test for the remote-dir problem. In the following, when I say commit X, I mean commit X *plus* the manually applied patch from 86e515d. ** flet --> org-flet # < org-flet>> The first set of code transformations (implemented as commit 63b5f8f2e85b3059a2d30041db6939347a7a2d7d) dealt with the situation by doing a mass substitution: flet --> org-flet and labels --> org-labels (and in at least one case, flet --> org-labels to deal with a recursive definition - I presume that was a preexisting bug that was fixed by this substitution) and adding compatibility aliases in org-compat.el to use the cl-flet/cl-labels macros from cl.el in emacs versions >= 24.1.50. I found out that already this broke the remote dir functionality. Since this is a large but straightforward commit, I split it up into separate patches for each affected file and applied each patch in order: org-compat.el implements the org-flet/org-labels functionality, but since the other patches were not applied yet, I was still using flet/labels (obsolete but still present). I then applied the patch to org-macs.el, org.el and ob.el, testing after each one. At first, things were working but applying the patch to ob.el broke remote-dir. I then determined that most of the transformation in ob.el were benign (at least as far as *this* problem is concerned), except one: when I reverted that single one from org-flet back to flet, remote-dir started working again. The "bad" transformation was in this code segment in org-babel-execute-src-block around line 544 in ob.el (but before you go looking for it, be warned: you will not find it if you have a reasonably current version of org - read the next section): ... (unwind-protect (flet ((call-process-region (&rest args) (apply 'org-babel-tramp-handle-call-process-region args))) (org-flet ((lang-check (f) (let ((f (intern (concat "org-babel-execute:" f)))) (when (fboundp f) f)))) ... If I left the first flet alone, remote-dir worked; if I change it to org-flet it does not. Here's the procedure I used to prove this to my satisfaction, just in case anybody wants to try to duplicate my results: - create a branch with 63b5f8f as its tip and switch to it: git checkout -b foo 63b5f8f - make a patch for the "little" bug and apply it: git show 86e515d > patch.to.apply git apply patch.to.apply - Put Loris's example in a file loris.org (with appropriate modifications - user id and remote machine name): #+BEGIN_SRC sh :dir /user@host: hostname #+END_SRC - Add a (require 'ob-sh) to your minimal .emacs - Start emacs with emacs -q -l /path/to/minimal/.emacs /path/to/loris.org and C-c C-c in the code block. If it behaves as I expect, it should print the local hostname, not the remote one. - Now edit lisp/ob.el and change that single org-flet back to an flet (remember you are in a branch, so changes here won't affect the files in other branches) and test again: it should now print the remote hostname. - Before you can switch to some other branch, you'll need to either commit these changes or throw them away - I'll leave that up to you. So the moral of the story is that the code transformations have *not* left functionality unchanged. Something went awry but to be honest, I don't know what. I didn't spend much time on it because of what I found out next. ** org-flet --> let # < let>> The second set of code transformations is more difficult to describe because it's not just a (more or less) straight substitution: it tries to completely eliminate the need for an flet-like construct and replace it with a simple let. The trouble here is that in many, but by no means all lisps, a symbol can have *two* bindings - emacs lisp is one such lisp). The two are a function binding and a value binding. Which binding is used depends on the context: (f a b c) is usually rendered as "call the function f with arguments a, b and c" but it really says "look up the function binding of the symbol f and apply the resulting function to the values obtained by looking up the value bindings of the symbols a, b and c". setq is the usual way to set the value binding of a symbol and defun is the usual way to set the function binding of a symbol. So you could do the following (although it's probably a bad idea): (defun f (x) (* x x)) to set the function binding of the symbol f and (setq f 3) to set the value binding of the symbol f. Then you could do (f f) calling the function in the function binding of f with argument the value binding of the symbol f. Since the former is the squaring function and the latter is the value 3, the result is 9. That may be confusing but it is legal. Now the *values* bound to symbols can be locally overridden with `let' and the functions bound to symbols can be locally overridden with `flet'. But now that `flet' is going away, how do you override the function binding using `let'? Without `flet' (or something similar), you just cannot: trying to use `let' to change the function binding of a symbol is impossible. So how exactly is this second set of transformations supposed to work? It uses `let' to locally override the *value* binding of a symbol with a *function*: (defun f (x y z)) (let ((g (lambda (x) (* x x)))) ....) would bind the squaring function to the symbol g - but it is the value binding of g, not its function binding. When the time comes to call that function, you cannot just say (defun f (x y z) (let ((g (lambda (x) (* x x)))) (+ (g x) (g y) (g z)))) There is no function binding for the symbol g so you cannot use it in the function position. That's where funcall (or apply in a different context) can help. The functtion can be written like this (defun f (x y z) (let ((g (lambda (x) (* x x)))) (+ (funcall g x)) (funcall g y) (funcall g z))) That's the price one has to pay in order to eliminate flet and replace it with let. As you can see, this is a rather intrusive transformation. And it's even worse: an flet in some function would define a function that could be passed down an arbitrarily long call chain and then called at the lowest level. If you use let, you can still pass the symbol down the chain but when it is time to call the function, you need to use the funcall trick. And even more obscurely, you don't have to pass it down explicitly: dynamic binding will take care of finding the definition of the symbol an arbitrary distance up the call chain. But if you have let-bound a function to a symbol, you have to funcall it explicitly. And if the symbol *has* a function value already, then calling it directly will also work but it will call a completely different function! That's an easy thing to overlook and that's exactly what happened here I believe: In org-babel-execute-src-block, the code looks like this: ... (unwind-protect (let ((call-process-region (lambda (&rest args) (apply 'org-babel-tramp-handle-call-process-region args)))) (let ((lang-check (lambda (f) (let ((f (intern (concat "org-babel-execute:" f)))) (when (fboundp f) f))))) (setq cmd (or (funcall lang-check lang) (funcall lang-check (symbol-name (cdr (assoc lang org-src-lang-modes)))) (error "No org-babel-execute function for %s!" lang)))) ... (setq result ((lambda (result) (if (and (eq (cdr (assoc :result-type params)) 'value) (or (member "vector" result-params) (member "table" result-params)) (not (listp result))) (list (list result)) result)) (funcall cmd body params))) ... Here we get the src block language (sh in Loris's example) and construct a symbol based on it (org-babel-execute:sh in this case). Because of the let, lang-check cannot be called as a function: it needs to be funcall'ed. It binds cmd to the symbol org-babel-execute-sh and later funcalls it. But note that call-process-region is also let-bound to a function. So we have the call chain org-babel-execute-src-block --> [through (funcall cmd body params)] org-babel-execute:sh --> org-babel-sh-evaluate --> org-babel-eval --> org-babel-shell-command-on-region --> call-process-region And here we have a symbol that has a function binding as well as a value binding (also a function, but a *different* function). The symbol is used in the function position of the function application so the function binding (i.e the usual call-process-region function) is used. It runs hostname locally - Loris loses. The correct thing to do (in this particular case at least) is to funcall the symbol at the lowest level. That would call the let-bound function (org-babel-tramp-handle-call-process-region) which would call hostname remotely (making Loris happy). Is it always correct to do that? I don't know but I suspect not: it would probably be safer to have an org symbol (org-call-process-region-function maybe) that's value-set to the standard call-process-region function, but can then be let-bound and dynamically passed all over the place. And is this the only problem? Probably not: every flet->let transformation would have to be scrutinized.