emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Matt Lundin <mdl@imapmail.org>
To: Org Mode <emacs-orgmode@gnu.org>
Subject: Problems with org publish cache checking
Date: Tue, 24 Nov 2015 09:14:47 -0600	[thread overview]
Message-ID: <87r3jfh1js.fsf@fastmail.fm> (raw)

I've been doing some testing of org-publish functions and have found a
few problems with org-publish-cache-file-needs-publishing. They arise
from the fact that it attempts to take included files into account.

The logic is simple enough: while a file may not have changed, the files
it includes may have. So during the publishing process the function
scans every file in a project for #+INCLUDE keywords, comparing the last
modified time of those included files against the timestamps of the
included files stored in the cache. However, there are several
limitations:

1. Unlike org-export-expand-include-keyword,
   org-publish-cache-file-needs-publishing takes no account of recursive
   includes: i.e., included files within included files.

2. It does not cache timestamps for included files that are not also
   project files (i.e.,, files stored outside of the project or excluded
   via the :exclude plist option). Since org-publish caches the
   timestamps of only those files that are published directly (i.e., not
   as includes), the result is that files that files that include files
   outside of a publishing project are always republished. 
   
3. It is slow!!! The function visits every file in a project to check
   for #+INCLUDE declarations, thus offsetting much of the benefit of
   caching timestamps. To test this, I created a dummy project with over
   1000 pages (not typical usage, of course, but possible for someone
   writing a blog over several years or creating a large interlinked
   wiki).

   During the first publishing run on an old (2007) duo-core machine,
   org-mode generated the entire site in 3 minutes (not bad). However,
   over 40 seconds of that time was spent by
   org-publish-cache-file-needs-publishing (something that is entirely
   redundant on the first publishing run).

--8<---------------cut here---------------start------------->8---
 org-publish-all                          1           180.82396367  180.82396367
 org-publish-projects                     1           180.82375580  180.82375580
 org-publish-file                         1008        180.41644274  0.1789845662
 org-publish-org-to                       1000        138.45729874  0.1384572987
 org-publish-needed-p                     1008        41.538426420  0.0412087563
 org-publish-cache-file-needs-publishing  1008        41.210540305  0.040883472
--8<---------------cut here---------------end--------------->8---

  During subsequent runs, publishing still took over 40 seconds, despite
  the existence of the cache. This is chiefly because
  org-publishing-cache-file-needs-publishing checks every file for includes:

--8<---------------cut here---------------start------------->8---
 org-publish-all                          1           41.335711491  41.335711491
 org-publish-projects                     1           41.335444938  41.335444938
 org-publish-file                         1008        40.918752137  0.0405940001
 org-publish-needed-p                     1008        40.669991543  0.0403472138
 org-publish-cache-file-needs-publishing  1008        40.566117665  0.040244164
--8<---------------cut here---------------end--------------->8---

Perhaps the simplest solution to all this would be to give users an
option to turn off checking for #+INCLUDE declarations. This would
reduce subsequent publishing runs to a mere second, so long as one does
not use included files.

A more complex solution would be to cache the names of included files
and to store timestamps for the included files if they are outside of
the project (optionally including recursive logic). I am still trying to
figure out the best way to do this.

Advice on how to proceed would be greatly appreciated.

Thanks,
Matt

             reply	other threads:[~2015-11-24 15:15 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-24 15:14 Matt Lundin [this message]
2015-11-25 16:56 ` Problems with org publish cache checking Matt Lundin
2015-11-26  2:30 ` [PATCH] " Matt Lundin
2015-11-26  8:25   ` Nicolas Goaziou
2015-11-27  1:30     ` Matt Lundin
2015-11-29 16:18       ` Nicolas Goaziou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r3jfh1js.fsf@fastmail.fm \
    --to=mdl@imapmail.org \
    --cc=emacs-orgmode@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).