From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matt Lundin Subject: Problems with org publish cache checking Date: Tue, 24 Nov 2015 09:14:47 -0600 Message-ID: <87r3jfh1js.fsf@fastmail.fm> Mime-Version: 1.0 Content-Type: text/plain Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:59091) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1FJI-0003Rn-8F for emacs-orgmode@gnu.org; Tue, 24 Nov 2015 10:15:16 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a1FIw-0008Lz-3W for emacs-orgmode@gnu.org; Tue, 24 Nov 2015 10:15:12 -0500 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:41908) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1FIv-0008KT-Qw for emacs-orgmode@gnu.org; Tue, 24 Nov 2015 10:14:49 -0500 Received: from archthink (c-50-172-132-15.hsd1.il.comcast.net [50.172.132.15]) by mail.messagingengine.com (Postfix) with ESMTPA id AAB8CC013FE for ; Tue, 24 Nov 2015 10:14:47 -0500 (EST) List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: Org Mode I've been doing some testing of org-publish functions and have found a few problems with org-publish-cache-file-needs-publishing. They arise from the fact that it attempts to take included files into account. The logic is simple enough: while a file may not have changed, the files it includes may have. So during the publishing process the function scans every file in a project for #+INCLUDE keywords, comparing the last modified time of those included files against the timestamps of the included files stored in the cache. However, there are several limitations: 1. Unlike org-export-expand-include-keyword, org-publish-cache-file-needs-publishing takes no account of recursive includes: i.e., included files within included files. 2. It does not cache timestamps for included files that are not also project files (i.e.,, files stored outside of the project or excluded via the :exclude plist option). Since org-publish caches the timestamps of only those files that are published directly (i.e., not as includes), the result is that files that files that include files outside of a publishing project are always republished. 3. It is slow!!! The function visits every file in a project to check for #+INCLUDE declarations, thus offsetting much of the benefit of caching timestamps. To test this, I created a dummy project with over 1000 pages (not typical usage, of course, but possible for someone writing a blog over several years or creating a large interlinked wiki). During the first publishing run on an old (2007) duo-core machine, org-mode generated the entire site in 3 minutes (not bad). However, over 40 seconds of that time was spent by org-publish-cache-file-needs-publishing (something that is entirely redundant on the first publishing run). --8<---------------cut here---------------start------------->8--- org-publish-all 1 180.82396367 180.82396367 org-publish-projects 1 180.82375580 180.82375580 org-publish-file 1008 180.41644274 0.1789845662 org-publish-org-to 1000 138.45729874 0.1384572987 org-publish-needed-p 1008 41.538426420 0.0412087563 org-publish-cache-file-needs-publishing 1008 41.210540305 0.040883472 --8<---------------cut here---------------end--------------->8--- During subsequent runs, publishing still took over 40 seconds, despite the existence of the cache. This is chiefly because org-publishing-cache-file-needs-publishing checks every file for includes: --8<---------------cut here---------------start------------->8--- org-publish-all 1 41.335711491 41.335711491 org-publish-projects 1 41.335444938 41.335444938 org-publish-file 1008 40.918752137 0.0405940001 org-publish-needed-p 1008 40.669991543 0.0403472138 org-publish-cache-file-needs-publishing 1008 40.566117665 0.040244164 --8<---------------cut here---------------end--------------->8--- Perhaps the simplest solution to all this would be to give users an option to turn off checking for #+INCLUDE declarations. This would reduce subsequent publishing runs to a mere second, so long as one does not use included files. A more complex solution would be to cache the names of included files and to store timestamps for the included files if they are outside of the project (optionally including recursive logic). I am still trying to figure out the best way to do this. Advice on how to proceed would be greatly appreciated. Thanks, Matt