From: Matt Lundin <mdl@imapmail.org>
To: Org Mode <emacs-orgmode@gnu.org>
Subject: Problems with org publish cache checking
Date: Tue, 24 Nov 2015 09:14:47 -0600 [thread overview]
Message-ID: <87r3jfh1js.fsf@fastmail.fm> (raw)
I've been doing some testing of org-publish functions and have found a
few problems with org-publish-cache-file-needs-publishing. They arise
from the fact that it attempts to take included files into account.
The logic is simple enough: while a file may not have changed, the files
it includes may have. So during the publishing process the function
scans every file in a project for #+INCLUDE keywords, comparing the last
modified time of those included files against the timestamps of the
included files stored in the cache. However, there are several
limitations:
1. Unlike org-export-expand-include-keyword,
org-publish-cache-file-needs-publishing takes no account of recursive
includes: i.e., included files within included files.
2. It does not cache timestamps for included files that are not also
project files (i.e.,, files stored outside of the project or excluded
via the :exclude plist option). Since org-publish caches the
timestamps of only those files that are published directly (i.e., not
as includes), the result is that files that files that include files
outside of a publishing project are always republished.
3. It is slow!!! The function visits every file in a project to check
for #+INCLUDE declarations, thus offsetting much of the benefit of
caching timestamps. To test this, I created a dummy project with over
1000 pages (not typical usage, of course, but possible for someone
writing a blog over several years or creating a large interlinked
wiki).
During the first publishing run on an old (2007) duo-core machine,
org-mode generated the entire site in 3 minutes (not bad). However,
over 40 seconds of that time was spent by
org-publish-cache-file-needs-publishing (something that is entirely
redundant on the first publishing run).
--8<---------------cut here---------------start------------->8---
org-publish-all 1 180.82396367 180.82396367
org-publish-projects 1 180.82375580 180.82375580
org-publish-file 1008 180.41644274 0.1789845662
org-publish-org-to 1000 138.45729874 0.1384572987
org-publish-needed-p 1008 41.538426420 0.0412087563
org-publish-cache-file-needs-publishing 1008 41.210540305 0.040883472
--8<---------------cut here---------------end--------------->8---
During subsequent runs, publishing still took over 40 seconds, despite
the existence of the cache. This is chiefly because
org-publishing-cache-file-needs-publishing checks every file for includes:
--8<---------------cut here---------------start------------->8---
org-publish-all 1 41.335711491 41.335711491
org-publish-projects 1 41.335444938 41.335444938
org-publish-file 1008 40.918752137 0.0405940001
org-publish-needed-p 1008 40.669991543 0.0403472138
org-publish-cache-file-needs-publishing 1008 40.566117665 0.040244164
--8<---------------cut here---------------end--------------->8---
Perhaps the simplest solution to all this would be to give users an
option to turn off checking for #+INCLUDE declarations. This would
reduce subsequent publishing runs to a mere second, so long as one does
not use included files.
A more complex solution would be to cache the names of included files
and to store timestamps for the included files if they are outside of
the project (optionally including recursive logic). I am still trying to
figure out the best way to do this.
Advice on how to proceed would be greatly appreciated.
Thanks,
Matt
next reply other threads:[~2015-11-24 15:15 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-24 15:14 Matt Lundin [this message]
2015-11-25 16:56 ` Problems with org publish cache checking Matt Lundin
2015-11-26 2:30 ` [PATCH] " Matt Lundin
2015-11-26 8:25 ` Nicolas Goaziou
2015-11-27 1:30 ` Matt Lundin
2015-11-29 16:18 ` Nicolas Goaziou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.orgmode.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r3jfh1js.fsf@fastmail.fm \
--to=mdl@imapmail.org \
--cc=emacs-orgmode@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).