emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Problems with org publish cache checking
@ 2015-11-24 15:14 Matt Lundin
  2015-11-25 16:56 ` Matt Lundin
  2015-11-26  2:30 ` [PATCH] " Matt Lundin
  0 siblings, 2 replies; 6+ messages in thread
From: Matt Lundin @ 2015-11-24 15:14 UTC (permalink / raw)
  To: Org Mode

I've been doing some testing of org-publish functions and have found a
few problems with org-publish-cache-file-needs-publishing. They arise
from the fact that it attempts to take included files into account.

The logic is simple enough: while a file may not have changed, the files
it includes may have. So during the publishing process the function
scans every file in a project for #+INCLUDE keywords, comparing the last
modified time of those included files against the timestamps of the
included files stored in the cache. However, there are several
limitations:

1. Unlike org-export-expand-include-keyword,
   org-publish-cache-file-needs-publishing takes no account of recursive
   includes: i.e., included files within included files.

2. It does not cache timestamps for included files that are not also
   project files (i.e.,, files stored outside of the project or excluded
   via the :exclude plist option). Since org-publish caches the
   timestamps of only those files that are published directly (i.e., not
   as includes), the result is that files that files that include files
   outside of a publishing project are always republished. 
   
3. It is slow!!! The function visits every file in a project to check
   for #+INCLUDE declarations, thus offsetting much of the benefit of
   caching timestamps. To test this, I created a dummy project with over
   1000 pages (not typical usage, of course, but possible for someone
   writing a blog over several years or creating a large interlinked
   wiki).

   During the first publishing run on an old (2007) duo-core machine,
   org-mode generated the entire site in 3 minutes (not bad). However,
   over 40 seconds of that time was spent by
   org-publish-cache-file-needs-publishing (something that is entirely
   redundant on the first publishing run).

--8<---------------cut here---------------start------------->8---
 org-publish-all                          1           180.82396367  180.82396367
 org-publish-projects                     1           180.82375580  180.82375580
 org-publish-file                         1008        180.41644274  0.1789845662
 org-publish-org-to                       1000        138.45729874  0.1384572987
 org-publish-needed-p                     1008        41.538426420  0.0412087563
 org-publish-cache-file-needs-publishing  1008        41.210540305  0.040883472
--8<---------------cut here---------------end--------------->8---

  During subsequent runs, publishing still took over 40 seconds, despite
  the existence of the cache. This is chiefly because
  org-publishing-cache-file-needs-publishing checks every file for includes:

--8<---------------cut here---------------start------------->8---
 org-publish-all                          1           41.335711491  41.335711491
 org-publish-projects                     1           41.335444938  41.335444938
 org-publish-file                         1008        40.918752137  0.0405940001
 org-publish-needed-p                     1008        40.669991543  0.0403472138
 org-publish-cache-file-needs-publishing  1008        40.566117665  0.040244164
--8<---------------cut here---------------end--------------->8---

Perhaps the simplest solution to all this would be to give users an
option to turn off checking for #+INCLUDE declarations. This would
reduce subsequent publishing runs to a mere second, so long as one does
not use included files.

A more complex solution would be to cache the names of included files
and to store timestamps for the included files if they are outside of
the project (optionally including recursive logic). I am still trying to
figure out the best way to do this.

Advice on how to proceed would be greatly appreciated.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems with org publish cache checking
  2015-11-24 15:14 Problems with org publish cache checking Matt Lundin
@ 2015-11-25 16:56 ` Matt Lundin
  2015-11-26  2:30 ` [PATCH] " Matt Lundin
  1 sibling, 0 replies; 6+ messages in thread
From: Matt Lundin @ 2015-11-25 16:56 UTC (permalink / raw)
  To: Org Mode

Matt Lundin <mdl@imapmail.org> writes:

> 2. It does not cache timestamps for included files that are not also
>    project files (i.e.,, files stored outside of the project or excluded
>    via the :exclude plist option). Since org-publish caches the
>    timestamps of only those files that are published directly (i.e., not
>    as includes), the result is that files that files that include files
>    outside of a publishing project are always republished. 

I did not quite diagnose this properly. Org-publish does not check the
cache of included files at all. It simply compares the last modified
time of an included file with the last modified time of the
master/including file. The result is that a master file will perpetually
be republished if an included file happened to be changed afterwards
(even if both files were changed years ago and the project has been
published 100s of times since then). I'll work up a patch to fix this
and submit it in a day or two.

Matt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] Re: Problems with org publish cache checking
  2015-11-24 15:14 Problems with org publish cache checking Matt Lundin
  2015-11-25 16:56 ` Matt Lundin
@ 2015-11-26  2:30 ` Matt Lundin
  2015-11-26  8:25   ` Nicolas Goaziou
  1 sibling, 1 reply; 6+ messages in thread
From: Matt Lundin @ 2015-11-26  2:30 UTC (permalink / raw)
  To: Org Mode

[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]

Matt Lundin <mdl@imapmail.org> writes:

> I've been doing some testing of org-publish functions and have found a
> few problems with org-publish-cache-file-needs-publishing. They arise
> from the fact that it attempts to take included files into account.

OK, I've worked up a patch that solves several of these issues. The
basic idea is to check when publishing an org file whether it includes
other org files and then to store that data in the cache. That way,
org-publish-cache-file-needs-publishing does not need to open each
buffer but rather can compare the stored timestamp data against the
actual modified times of the included files.

> Org-publish does not check the cache of included files at all. It
> simply compares the last modified time of an included file with the
> last modified time of the master/including file. The result is that a
> master file will perpetually be republished if an included file
> happened to be changed afterwards (even if both files were changed
> years ago and the project has been published 100s of times since
> then).

This patch fixes this by caching timestamps for included files, thus
allowing org-publish to track changes in included files.

> 3. It is slow!!! The function visits every file in a project to check
>    for #+INCLUDE declarations, thus offsetting much of the benefit of
>    caching timestamps. To test this, I created a dummy project with over
>    1000 pages (not typical usage, of course, but possible for someone
>    writing a blog over several years or creating a large interlinked
>    wiki).

This patch should make things much faster, since we only need to scan
for included files during publishing (when the buffer is already
active). Org-publish no longer has to visit each file individually
during publishing (which takes a lot of time); rather, it can just use
the cache.

Matt


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Speed-up-publishing-by-caching-included-file-data.patch --]
[-- Type: text/x-diff, Size: 4167 bytes --]

From 7a69052334416309802c861a7b6b72865c331a37 Mon Sep 17 00:00:00 2001
From: Matt Lundin <mdl@imapmail.org>
Date: Wed, 25 Nov 2015 20:23:39 -0600
Subject: [PATCH] Speed up publishing by caching included file data

* lisp/ox-publish.el: (org-publish-cache-get-included-files): New function
  (org-publish-org-to): Use new function
  (org-publish-cache-file-needs-publishing): Use cache instead of
  visiting every file in a project.

Org-publish can now quickly determine a) whether an org source includes
other files and b) whether those files have changed. This speeds up the
publishing process and makes tracking of changes in included files more
reliable.
---
 lisp/ox-publish.el | 62 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/lisp/ox-publish.el b/lisp/ox-publish.el
index 90f307c..ba85c7e 100644
--- a/lisp/ox-publish.el
+++ b/lisp/ox-publish.el
@@ -574,6 +574,7 @@ Return output file name."
 	     (let ((output-file
 		    (org-export-output-file-name extension nil pub-dir))
 		   (body-p (plist-get plist :body-only)))
+	       (when org-publish-cache (org-publish-cache-get-included-files))
 	       (org-export-to-file backend output-file
 		 nil nil nil body-p
 		 ;; Add `org-publish--collect-references' and
@@ -1227,36 +1228,41 @@ the file including them will be republished as well."
   (unless org-publish-cache
     (error
      "`org-publish-cache-file-needs-publishing' called, but no cache present"))
-  (let* ((case-fold-search t)
-	 (key (org-publish-timestamp-filename filename pub-dir pub-func))
+  (let* ((key (org-publish-timestamp-filename filename pub-dir pub-func))
 	 (pstamp (org-publish-cache-get key))
-	 (org-inhibit-startup t)
-	 (visiting (find-buffer-visiting filename))
-	 included-files-ctime buf)
-    (when (equal (file-name-extension filename) "org")
-      (setq buf (find-file (expand-file-name filename)))
-      (with-current-buffer buf
-	(goto-char (point-min))
-	(while (re-search-forward "^[ \t]*#\\+INCLUDE:" nil t)
-	  (let* ((element (org-element-at-point))
-		 (included-file
-		  (and (eq (org-element-type element) 'keyword)
-		       (let ((value (org-element-property :value element)))
-			 (and value
-			      (string-match "^\\(\".+?\"\\|\\S-+\\)" value)
-			      ;; Ignore search suffix.
-			      (car (split-string
-				    (org-remove-double-quotes
-				     (match-string 1 value)))))))))
-	    (when included-file
-	      (push (org-publish-cache-ctime-of-src
-		     (expand-file-name included-file))
-		    included-files-ctime)))))
-      (unless visiting (kill-buffer buf)))
+	 (ctime (when pstamp (org-publish-cache-ctime-of-src filename))))
     (or (null pstamp)
-	(let ((ctime (org-publish-cache-ctime-of-src filename)))
-	  (or (< pstamp ctime)
-	      (cl-some (lambda (ct) (< ctime ct)) included-files-ctime))))))
+	(< pstamp ctime)
+	(cl-some (lambda (incl)
+		   ;; See if cached time is before modification time.
+		   (< (cdr incl)
+		      (org-publish-cache-ctime-of-src (car incl))))
+	  (org-publish-cache-get-file-property filename :includes)))))
+
+(defun org-publish-cache-get-included-files ()
+  "Get names and last modified times of included files in current buffer."
+  (let ((case-fold-search t)
+	included)
+    (save-excursion
+      (goto-char (point-min))
+      (while (re-search-forward "^[ \t]*#\\+INCLUDE:" nil t)
+	(let* ((element (org-element-at-point))
+	       (included-file
+		(and (eq (org-element-type element) 'keyword)
+		     (let ((value (org-element-property :value element)))
+		       (and value
+			    (string-match "^\\(\".+?\"\\|\\S-+\\)" value)
+			    ;; Ignore search suffix.
+			    (car (split-string
+				  (org-remove-double-quotes
+				   (match-string 1 value)))))))))
+	  (when included-file
+	    (let ((iname (expand-file-name included-file)))
+	      (push (cons iname (org-publish-cache-ctime-of-src
+				 (expand-file-name iname)))
+		    included))))))
+    (org-publish-cache-set-file-property (buffer-file-name)
+					 :includes included)))
 
 (defun org-publish-cache-set-file-property
   (filename property value &optional project-name)
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Re: Problems with org publish cache checking
  2015-11-26  2:30 ` [PATCH] " Matt Lundin
@ 2015-11-26  8:25   ` Nicolas Goaziou
  2015-11-27  1:30     ` Matt Lundin
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Goaziou @ 2015-11-26  8:25 UTC (permalink / raw)
  To: Matt Lundin; +Cc: Org Mode

Hello,

Matt Lundin <mdl@imapmail.org> writes:

> OK, I've worked up a patch that solves several of these issues. The
> basic idea is to check when publishing an org file whether it includes
> other org files and then to store that data in the cache. That way,
> org-publish-cache-file-needs-publishing does not need to open each
> buffer but rather can compare the stored timestamp data against the
> actual modified times of the included files.

This is much better, indeed. Thank you.

One suggestion: wouldn't it make sense to also apply check to SETUPFILE
keywords?


Regards,

-- 
Nicolas Goaziou

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Re: Problems with org publish cache checking
  2015-11-26  8:25   ` Nicolas Goaziou
@ 2015-11-27  1:30     ` Matt Lundin
  2015-11-29 16:18       ` Nicolas Goaziou
  0 siblings, 1 reply; 6+ messages in thread
From: Matt Lundin @ 2015-11-27  1:30 UTC (permalink / raw)
  To: Org Mode

[-- Attachment #1: Type: text/plain, Size: 879 bytes --]

Nicolas Goaziou <mail@nicolasgoaziou.fr> writes:

> Hello,
>
> Matt Lundin <mdl@imapmail.org> writes:
>
>> OK, I've worked up a patch that solves several of these issues. The
>> basic idea is to check when publishing an org file whether it includes
>> other org files and then to store that data in the cache. That way,
>> org-publish-cache-file-needs-publishing does not need to open each
>> buffer but rather can compare the stored timestamp data against the
>> actual modified times of the included files.
>
> This is much better, indeed. Thank you.
>
> One suggestion: wouldn't it make sense to also apply check to SETUPFILE
> keywords?

Yes, that's a great idea. I've added it to the patch.

One caveat: this patch does not implement recursive checking of included
files (i.e., included files that include other files), but this could be
added in the future.

Thanks,
Matt


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Speed-up-publishing-by-caching-included-file-data.patch --]
[-- Type: text/x-diff, Size: 4773 bytes --]

From cc66884f2836bee1203d06618828c0339ea2e4e2 Mon Sep 17 00:00:00 2001
From: Matt Lundin <mdl@imapmail.org>
Date: Thu, 26 Nov 2015 19:22:00 -0600
Subject: [PATCH] Speed up publishing by caching included file data

* lisp/ox-publish.el: (org-publish-cache-get-included-files): New function
  (org-publish-org-to): Use new function
  (org-publish-cache-file-needs-publishing): Use cache instead of
  visiting every file in a project.

Org-publish can now quickly determine a) whether an org source includes
other files (either via #+INCLUDE or #+SETUPFILE) and b) whether those
files have changed. This speeds up the publishing process and makes
tracking of changes in included files more reliable.
---
 lisp/ox-publish.el | 70 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 31 deletions(-)

diff --git a/lisp/ox-publish.el b/lisp/ox-publish.el
index 90f307c..02eb06d 100644
--- a/lisp/ox-publish.el
+++ b/lisp/ox-publish.el
@@ -574,6 +574,7 @@ Return output file name."
 	     (let ((output-file
 		    (org-export-output-file-name extension nil pub-dir))
 		   (body-p (plist-get plist :body-only)))
+	       (when org-publish-cache (org-publish-cache-get-included-files))
 	       (org-export-to-file backend output-file
 		 nil nil nil body-p
 		 ;; Add `org-publish--collect-references' and
@@ -1221,42 +1222,49 @@ If FREE-CACHE, empty the cache."
 (defun org-publish-cache-file-needs-publishing
     (filename &optional pub-dir pub-func _base-dir)
   "Check the timestamp of the last publishing of FILENAME.
-Return non-nil if the file needs publishing.  Also check if
-any included files have been more recently published, so that
-the file including them will be republished as well."
+Return non-nil if the file needs publishing.  Also use the cache
+to check if any included files have changed, so that the file
+including them will be republished."
   (unless org-publish-cache
     (error
      "`org-publish-cache-file-needs-publishing' called, but no cache present"))
-  (let* ((case-fold-search t)
-	 (key (org-publish-timestamp-filename filename pub-dir pub-func))
+  (let* ((key (org-publish-timestamp-filename filename pub-dir pub-func))
 	 (pstamp (org-publish-cache-get key))
-	 (org-inhibit-startup t)
-	 (visiting (find-buffer-visiting filename))
-	 included-files-ctime buf)
-    (when (equal (file-name-extension filename) "org")
-      (setq buf (find-file (expand-file-name filename)))
-      (with-current-buffer buf
-	(goto-char (point-min))
-	(while (re-search-forward "^[ \t]*#\\+INCLUDE:" nil t)
-	  (let* ((element (org-element-at-point))
-		 (included-file
-		  (and (eq (org-element-type element) 'keyword)
-		       (let ((value (org-element-property :value element)))
-			 (and value
-			      (string-match "^\\(\".+?\"\\|\\S-+\\)" value)
-			      ;; Ignore search suffix.
-			      (car (split-string
-				    (org-remove-double-quotes
-				     (match-string 1 value)))))))))
-	    (when included-file
-	      (push (org-publish-cache-ctime-of-src
-		     (expand-file-name included-file))
-		    included-files-ctime)))))
-      (unless visiting (kill-buffer buf)))
+	 (ctime (when pstamp (org-publish-cache-ctime-of-src filename))))
     (or (null pstamp)
-	(let ((ctime (org-publish-cache-ctime-of-src filename)))
-	  (or (< pstamp ctime)
-	      (cl-some (lambda (ct) (< ctime ct)) included-files-ctime))))))
+	(< pstamp ctime)
+	(cl-some (lambda (incl)
+		   ;; See if cached time is before modification time.
+		   (< (cdr incl)
+		      (org-publish-cache-ctime-of-src (car incl))))
+	  (org-publish-cache-get-file-property filename :includes)))))
+
+(defun org-publish-cache-get-included-files ()
+  "Get data about included files in current buffer.
+Store file names and modification times in cache. Also store data
+about setupfiles."
+  (let ((case-fold-search t)
+	included)
+    (save-excursion
+      (goto-char (point-min))
+      (while (re-search-forward "^[ \t]*#\\+\\(INCLUDE\\|SETUPFILE\\):" nil t)
+	(let* ((element (org-element-at-point))
+	       (included-file
+		(and (eq (org-element-type element) 'keyword)
+		     (let ((value (org-element-property :value element)))
+		       (and value
+			    (string-match "^\\(\".+?\"\\|\\S-+\\)" value)
+			    ;; Ignore search suffix.
+			    (car (split-string
+				  (org-remove-double-quotes
+				   (match-string 1 value)))))))))
+	  (when included-file
+	    (let ((iname (expand-file-name included-file)))
+	      (push (cons iname (org-publish-cache-ctime-of-src
+				 (expand-file-name iname)))
+		    included))))))
+    (org-publish-cache-set-file-property (buffer-file-name)
+					 :includes included)))
 
 (defun org-publish-cache-set-file-property
   (filename property value &optional project-name)
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Re: Problems with org publish cache checking
  2015-11-27  1:30     ` Matt Lundin
@ 2015-11-29 16:18       ` Nicolas Goaziou
  0 siblings, 0 replies; 6+ messages in thread
From: Nicolas Goaziou @ 2015-11-29 16:18 UTC (permalink / raw)
  To: Matt Lundin; +Cc: Org Mode

Matt Lundin <mdl@imapmail.org> writes:

> Yes, that's a great idea. I've added it to the patch.

Thank you. LGTM.

> One caveat: this patch does not implement recursive checking of included
> files (i.e., included files that include other files), but this could be
> added in the future.

Well, if you're not out of steam (or time) yet, it would be nice to add
this bugfix in the same patch instead.

Of course, the current patch is already a great step forward.

WDYT?

Regards,

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-11-29 16:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-24 15:14 Problems with org publish cache checking Matt Lundin
2015-11-25 16:56 ` Matt Lundin
2015-11-26  2:30 ` [PATCH] " Matt Lundin
2015-11-26  8:25   ` Nicolas Goaziou
2015-11-27  1:30     ` Matt Lundin
2015-11-29 16:18       ` Nicolas Goaziou

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).