Poor org-babel-tangle-file performance with more than 100 trivial noweb-references

emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed

* Poor org-babel-tangle-file performance with more than 100 trivial noweb-references
@ 2022-01-25  5:19 pareto optimal
  2022-01-25 14:03 ` Sébastien Miquel
  2022-01-25 14:11 ` Ihor Radchenko
  0 siblings, 2 replies; 6+ messages in thread
From: pareto optimal @ 2022-01-25  5:19 UTC (permalink / raw)
  To: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 6766 bytes --]

Using =emacs -Q= to tangle org files with more than over 100 noweb-refs gets slow fast.

Given this org code for N=2:

#+begin_src org :tangle 2.org
  ,#+begin_src elisp :noweb yes :tangle 2.el
  <<some-reference>>
  ,#+end_src

  ,#+begin_src elisp :noweb-ref some-reference :tangle no
  ;; comment
  ,#+end_src

  ,#+begin_src elisp :noweb-ref some-reference :tangle no
  ;; comment
  ,#+end_src
#+end_src

and this macro for benchmarking:

#+begin_src emacs-lisp
(cl-defmacro bench (&optional (times 100000) &rest body)
  (declare (indent defun))
  `(progn
     (garbage-collect)
     (list '("Total runtime" "# of GCs" "Total GC runtime")
           'hline
           (benchmark-run-compiled ,times
             (progn
               ,@body)))))
#+end_src

and the org-babel-tangle-file call with gc-cons-threshold at most-positive-fixnum:

#+begin_src emacs-lisp
  (let ((gc-cons-threshold most-positive-fixnum))
    (bench (org-babel-tangle-file "2.org")))
#+end_src

Using Gcc Emacs 28.0.91 (which I typicall use) I get these results:

| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.027 |        0 |
|       20 |   0.049 |        0 |
|       40 |   0.121 |        0 |
|       80 |   0.391 |        0 |
|      160 |   1.426 |        0 |
|      320 |   6.765 |        0 |
|      640 |  23.972 |        0 |

so roughly it scales like:

#+begin_example
0.8x (10-20)
2.5x  (20-40)
3x  (40-80)
3.5x  (160-320)
4x  (320-640)
#+end_example

Though I'm not sure how much that can tell us... my guess is noweb-ref's don't use a dictionary with =O(1)= lookups?

Out of curiousity I also benchmarked emacs 27.2 which came with org 9.4.4 and it gave roughly the same results:

* emacs -Q (emacs 27.2/org-9.4.4)
** =gc-cons-threshold= most-positive-fixnum

| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.031 |        0 |
|       20 |   0.054 |        0 |
|       40 |   0.142 |        0 |
|       80 |   0.445 |        0 |
|      160 |   1.641 |        0 |
|      320 |   6.478 |        0 |
|      640 |  25.729 |        0 |

I have the full benchmarking experiment here: https://www.paretooptimal.dev/poor-org-babel-performance-with-lots-of-noweb-references/

Emacs  : GNU Emacs 28.0.91 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.30, cairo version 1.16.0)
Package: Org mode version 9.5.2 (9.5.2-gfbff08 @ /nix/store/3ff9visj5fl1a3d4krpv9wgmd078f5l1-emacs-packages-deps/share/emacs/site-lisp/elpa/org-9.5.2/)

current state:
==============
(setq
 org-link-elisp-confirm-function 'yes-or-no-p
 org-bibtex-headline-format-function #[257 "\300\x01\236A\207" [:title] 3 "\n\n(fn ENTRY)"]
 org-export-before-parsing-hook '(org-attach-expand-links)
 org-archive-hook '(org-attach-archive-delete-maybe)
 org-cycle-hook '(org-cycle-hide-archived-subtrees org-cycle-hide-drawers org-cycle-show-empty-lines
          org-optimize-window-after-visibility-change)
 org-mode-hook '(#[0 "\300\301\302\303\304$\207" [add-hook change-major-mode-hook org-show-all append local] 5]
         #[0 "\300\301\302\303\304$\207"
           [add-hook change-major-mode-hook org-babel-show-result-all append local] 5]
         org-babel-result-hide-spec org-babel-hide-all-hashes)
 org-confirm-shell-link-function 'yes-or-no-p
 outline-isearch-open-invisible-function 'outline-isearch-open-invisible
 org-agenda-before-write-hook '(org-agenda-add-entry-text)
 org-src-mode-hook '(org-src-babel-configure-edit-buffer org-src-mode-configure-edit-buffer)
 org-confirm-elisp-link-function 'yes-or-no-p
 org-speed-command-hook '(org-speed-command-activate org-babel-speed-command-activate)
 org-tab-first-hook '(org-babel-hide-result-toggle-maybe org-babel-header-arg-expand)
 org-link-shell-confirm-function 'yes-or-no-p
 org-babel-pre-tangle-hook '(save-buffer)
 org-agenda-loop-over-headlines-in-active-region nil
 org-occur-hook '(org-first-headline-recenter)
 org-metadown-hook '(org-babel-pop-to-session-maybe)
 org-link-parameters '(("attachment" :follow org-attach-follow :complete org-attach-complete-link)
               ("id" :follow org-id-open) ("eww" :follow org-eww-open :store org-eww-store-link)
               ("rmail" :follow org-rmail-open :store org-rmail-store-link)
               ("mhe" :follow org-mhe-open :store org-mhe-store-link)
               ("irc" :follow org-irc-visit :store org-irc-store-link :export org-irc-export)
               ("info" :follow org-info-open :export org-info-export :store org-info-store-link)
               ("gnus" :follow org-gnus-open :store org-gnus-store-link)
               ("docview" :follow org-docview-open :export org-docview-export :store
            org-docview-store-link)
               ("bibtex" :follow org-bibtex-open :store org-bibtex-store-link)
               ("bbdb" :follow org-bbdb-open :export org-bbdb-export :complete org-bbdb-complete-link
            :store org-bbdb-store-link)
               ("w3m" :store org-w3m-store-link)
               ("doi" :follow org-link-doi-open :export org-link-doi-export) ("file+sys") ("file+emacs")
               ("shell" :follow org-link--open-shell)
               ("news" :follow
            #[514 "\301\300\302\x04Q\x02\"\207" ["news" browse-url ":"] 6 "\n\n(fn URL ARG)"])
               ("mailto" :follow
            #[514 "\301\300\302\x04Q\x02\"\207" ["mailto" browse-url ":"] 6 "\n\n(fn URL ARG)"])
               ("https" :follow
            #[514 "\301\300\302\x04Q\x02\"\207" ["https" browse-url ":"] 6 "\n\n(fn URL ARG)"])
               ("http" :follow
            #[514 "\301\300\302\x04Q\x02\"\207" ["http" browse-url ":"] 6 "\n\n(fn URL ARG)"])
               ("ftp" :follow
            #[514 "\301\300\302\x04Q\x02\"\207" ["ftp" browse-url ":"] 6 "\n\n(fn URL ARG)"])
               ("help" :follow org-link--open-help :store org-link--store-help)
               ("file" :complete org-link-complete-file) ("elisp" :follow org-link--open-elisp))
 org-metaup-hook '(org-babel-load-in-session-maybe)
 )

-- Sent with https://mailfence.com  Secure and private email

[-- Attachment #2: Type: text/html, Size: 9606 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references
  2022-01-25  5:19 Poor org-babel-tangle-file performance with more than 100 trivial noweb-references pareto optimal
@ 2022-01-25 14:03 ` Sébastien Miquel
  2022-01-25 14:25   ` Ihor Radchenko
  2022-01-25 14:11 ` Ihor Radchenko
  1 sibling, 1 reply; 6+ messages in thread
From: Sébastien Miquel @ 2022-01-25 14:03 UTC (permalink / raw)
  To: pareto optimal, emacs-orgmode; +Cc: Ihor Radchenko

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]

Hi,

pareto optimal writes:
> Using =emacs -Q= to tangle org files with more than over 100 
> noweb-refs gets slow fast.
I can reproduce the slow down with my config. The culprit is
~org-element--cache-verify-element~. Significantly decreasing
=org-element--cache-self-verify-frequency= yields a 5x speedup.

Perhaps the frequency of the second
: (< (random 1000) (* 1000 org-element--cache-self-verify-frequency))
call in ~org-element--cache-verify-element~ should be decreased.

Regards,

-- 
Sébastien Miquel

[-- Attachment #2: Type: text/html, Size: 1265 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references
  2022-01-25  5:19 Poor org-babel-tangle-file performance with more than 100 trivial noweb-references pareto optimal
  2022-01-25 14:03 ` Sébastien Miquel
@ 2022-01-25 14:11 ` Ihor Radchenko
  2022-01-25 16:25   ` Tim Cross
  2022-01-26 11:43   ` Ihor Radchenko
  1 sibling, 2 replies; 6+ messages in thread
From: Ihor Radchenko @ 2022-01-25 14:11 UTC (permalink / raw)
  To: pareto optimal; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]

pareto optimal <pareto.optimal@mailfence.com> writes:

> Using =emacs -Q= to tangle org files with more than over 100 noweb-refs gets slow fast.
>
> Given this org code for N=2:
> Using Gcc Emacs 28.0.91 (which I typicall use) I get these results:
>
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.027 |        0 |
> |       20 |   0.049 |        0 |
> |       40 |   0.121 |        0 |
> |       80 |   0.391 |        0 |
> |      160 |   1.426 |        0 |
> |      320 |   6.765 |        0 |
> |      640 |  23.972 |        0 |
>
> so roughly it scales like:
>
> #+begin_example
> 0.8x (10-20)
> 2.5x  (20-40)
> 3x  (40-80)
> 3.5x  (160-320)
> 4x  (320-640)
> #+end_example
>
> Though I'm not sure how much that can tell us... my guess is noweb-ref's don't use a dictionary with =O(1)= lookups?

Thanks for the nice benchmark and for providing a test case!

You are using stable version of Org. We did some important performance
improvements on main. The same benchmark using latest main (26decec):

| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.030 |        0 |
|       20 |   0.067 |        0 |
|       40 |   0.065 |        0 |
|       80 |   0.081 |        0 |
|      160 |   0.214 |        0 |
|      320 |   0.597 |        0 |
|      640 |   2.225 |        0 |
|     1280 |   9.221 |        0 |
|     2560 |  36.966 |        0 |

And with disabled self-verification code:
(setq org-element--cache-self-verify nil)
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.078 |        0 |
|       20 |   0.075 |        0 |
|       40 |   0.063 |        0 |
|       80 |   0.085 |        0 |
|      160 |   0.095 |        0 |
|      320 |   0.178 |        0 |
|      640 |   0.311 |        0 |
|     1280 |   0.819 |        0 |
|     2560 |   2.302 |        0 |
|     5120 |   8.878 |        0 |
|    10240 |  32.706 |        0 |

Graphical comparison:

[-- Attachment #2: benchmark1.png --]
[-- Type: image/png, Size: 8259 bytes --]

[-- Attachment #3: Type: text/plain, Size: 77 bytes --]


There is ~80x improvement.

However, the scaling is still not quite linear:

[-- Attachment #4: benchmark2.png --]
[-- Type: image/png, Size: 9706 bytes --]

[-- Attachment #5: Type: text/plain, Size: 5111 bytes --]


So, there is still some power-law nonlinearity in the tangle code.

Benchmarking revealed the following:
        8259  95%               - org-babel-tangle-file
        8259  95%                - org-babel-tangle
        8170  94%                 - org-babel-tangle-collect-blocks
        3620  41%                  - org-in-archived-heading-p
        3600  41%                     org-before-first-heading-p

Showing that nonlinear scaling comes from regexp search.

Caching org-before-first-heading-p improves the performance further (see
Org 9.6-dev no self-verify + patch curves):

| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.118 |        0 |
|       20 |   0.106 |        0 |
|       40 |   0.136 |        0 |
|       80 |   0.157 |        0 |
|      160 |   0.139 |        0 |
|      320 |   0.212 |        0 |
|      640 |   0.542 |        0 |
|     1280 |   0.797 |        0 |
|     2560 |   1.533 |        0 |
|     5120 |   4.651 |        0 |
|    10240 |  16.745 |        0 |

The profiling gives:

       16817  63%                    - org-babel-tangle-file
       16280  61%                     - org-babel-tangle
       16200  61%                      - org-babel-tangle-collect-blocks
        1360   5%                       + org-babel-tangle-single-block
        1210   4%                       + org-babel-get-src-block-info

Most of the CPU time is spent in org-babel-tangle-collect-blocks, which
is basically another regexp search for all the source blocks in buffer.
The scaling is still slightly non-linear - maybe your source block
regexp is too complex:

(defvar org-babel-src-block-regexp
  (concat
   ;; (1) indentation                 (2) lang
   "^\\([ \t]*\\)#\\+begin_src[ \t]+\\([^ \f\t\n\r\v]+\\)[ \t]*"
   ;; (3) switches
   "\\([^\":\n]*\"[^\"\n*]*\"[^\":\n]*\\|[^\":\n]*\\)"
   ;; (4) header arguments
   "\\([^\n]*\\)\n"
   ;; (5) body
   "\\([^\000]*?\n\\)??[ \t]*#\\+end_src")
  "Regexp used to identify code blocks.")

On the other hand, the test data is already within quite non-realistic
over 5000 number of blocks. Probably, further performance improvement is
not very useful (we already have some overheads at smaller N).

Best,
Ihor

-----
The Org file used to generate plots:

#+name: nocache
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.027 |        0 |
|       20 |   0.049 |        0 |
|       40 |   0.121 |        0 |
|       80 |   0.391 |        0 |
|      160 |   1.426 |        0 |
|      320 |   6.765 |        0 |
|      640 |  23.972 |        0 |

#+name: cache
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.030 |        0 |
|       20 |   0.067 |        0 |
|       40 |   0.065 |        0 |
|       80 |   0.081 |        0 |
|      160 |   0.214 |        0 |
|      320 |   0.597 |        0 |
|      640 |   2.225 |        0 |
|     1280 |   9.221 |        0 |
|     2560 |  36.966 |        0 |

#+name: cache-no-self
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.078 |        0 |
|       20 |   0.075 |        0 |
|       40 |   0.063 |        0 |
|       80 |   0.085 |        0 |
|      160 |   0.095 |        0 |
|      320 |   0.178 |        0 |
|      640 |   0.311 |        0 |
|     1280 |   0.819 |        0 |
|     2560 |   2.302 |        0 |
|     5120 |   8.878 |        0 |
|    10240 |  32.706 |        0 |

#+name: cache-no-self+fix
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.118 |        0 |
|       20 |   0.106 |        0 |
|       40 |   0.136 |        0 |
|       80 |   0.157 |        0 |
|      160 |   0.139 |        0 |
|      320 |   0.212 |        0 |
|      640 |   0.542 |        0 |
|     1280 |   0.797 |        0 |
|     2560 |   1.533 |        0 |
|     5120 |   4.651 |        0 |
|    10240 |  16.745 |        0 |


#+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :file benchmark1.png
set title 'Tangle code performance timing'
US='u 1:2 w lp ps 2'
set xlabel "N blocks"
set ylabel "Time, sec"
set key top right
plot d1 @US t'Org 9.5.2 stable', d2 @US t'Org 9.6-dev', d3 @US t'Org 9.6-dev no self-verify', d4 @US t'Org 9.6-dev no self-verify + patch'
#+end_src

#+RESULTS[edd2a2d5d80b31876917faee349ed71ba0fe239a]:
[[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark1.png]]

#+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :file benchmark2.png
set title 'Tangle code performance scaling (normalized)'
US='w lp ps 2'
set xlabel "N blocks"
set ylabel "Time, normalized by time at N=640"
set key top left
set xrange [0:6000]
plot d2 u 1:($2/2.225) @US t'Org 9.6-dev', d3 u 1:($2/0.311) @US t'Org 9.6-dev no self-verify', d1 u 1:($2/23.972) @US t'Org 9.5.2 stable', d4 u 1:($2/0.542) @US t'Org 9.6-dev no self-verify + patch', x*x/870000 t'x^2', 
#+end_src

#+RESULTS[b170ef78ac377f6d2ad1c2eabb20cd62dc0ea33f]:
[[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark2.png]]


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references
  2022-01-25 14:03 ` Sébastien Miquel
@ 2022-01-25 14:25   ` Ihor Radchenko
  0 siblings, 0 replies; 6+ messages in thread
From: Ihor Radchenko @ 2022-01-25 14:25 UTC (permalink / raw)
  To: sebastien.miquel; +Cc: emacs-orgmode, pareto optimal

Sébastien Miquel <sebastien.miquel@posteo.eu> writes:

> pareto optimal writes:
>> Using =emacs -Q= to tangle org files with more than over 100 
>> noweb-refs gets slow fast.
> I can reproduce the slow down with my config. The culprit is
> ~org-element--cache-verify-element~. Significantly decreasing
> =org-element--cache-self-verify-frequency= yields a 5x speedup.

Even with verification, we already have 10x speed improvement. I do not
recommend decreasing verification frequency. Not until we make sure that
there are no more lurking cache bugs on main. You might do it in your
personal config, but please be careful and make sure that you have
backups/version control your Org files.

Best,
Ihor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references
  2022-01-25 14:11 ` Ihor Radchenko
@ 2022-01-25 16:25   ` Tim Cross
  2022-01-26 11:43   ` Ihor Radchenko
  1 sibling, 0 replies; 6+ messages in thread
From: Tim Cross @ 2022-01-25 16:25 UTC (permalink / raw)
  To: emacs-orgmode


Ihor Radchenko <yantar92@gmail.com> writes:

> pareto optimal <pareto.optimal@mailfence.com> writes:
>
>> Using =emacs -Q= to tangle org files with more than over 100 noweb-refs gets slow fast.
>>
>> Given this org code for N=2:
>> Using Gcc Emacs 28.0.91 (which I typicall use) I get these results:
>>
>> | N blocks | runtime | # of GCs |
>> |----------+---------+----------|
>> |       10 |   0.027 |        0 |
>> |       20 |   0.049 |        0 |
>> |       40 |   0.121 |        0 |
>> |       80 |   0.391 |        0 |
>> |      160 |   1.426 |        0 |
>> |      320 |   6.765 |        0 |
>> |      640 |  23.972 |        0 |
>>
>> so roughly it scales like:
>>
>> #+begin_example
>> 0.8x (10-20)
>> 2.5x  (20-40)
>> 3x  (40-80)
>> 3.5x  (160-320)
>> 4x  (320-640)
>> #+end_example
>>
>> Though I'm not sure how much that can tell us... my guess is noweb-ref's don't use a dictionary with =O(1)= lookups?
>
> Thanks for the nice benchmark and for providing a test case!
>
> You are using stable version of Org. We did some important performance
> improvements on main. The same benchmark using latest main (26decec):
>
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.030 |        0 |
> |       20 |   0.067 |        0 |
> |       40 |   0.065 |        0 |
> |       80 |   0.081 |        0 |
> |      160 |   0.214 |        0 |
> |      320 |   0.597 |        0 |
> |      640 |   2.225 |        0 |
> |     1280 |   9.221 |        0 |
> |     2560 |  36.966 |        0 |
>
> And with disabled self-verification code:
> (setq org-element--cache-self-verify nil)
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.078 |        0 |
> |       20 |   0.075 |        0 |
> |       40 |   0.063 |        0 |
> |       80 |   0.085 |        0 |
> |      160 |   0.095 |        0 |
> |      320 |   0.178 |        0 |
> |      640 |   0.311 |        0 |
> |     1280 |   0.819 |        0 |
> |     2560 |   2.302 |        0 |
> |     5120 |   8.878 |        0 |
> |    10240 |  32.706 |        0 |
>
> Graphical comparison:
>
>
>
>
> There is ~80x improvement.
>
> However, the scaling is still not quite linear:
>
>
>
>
> So, there is still some power-law nonlinearity in the tangle code.
>
> Benchmarking revealed the following:
>         8259  95%               - org-babel-tangle-file
>         8259  95%                - org-babel-tangle
>         8170  94%                 - org-babel-tangle-collect-blocks
>         3620  41%                  - org-in-archived-heading-p
>         3600  41%                     org-before-first-heading-p
>
> Showing that nonlinear scaling comes from regexp search.
>
> Caching org-before-first-heading-p improves the performance further (see
> Org 9.6-dev no self-verify + patch curves):
>
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.118 |        0 |
> |       20 |   0.106 |        0 |
> |       40 |   0.136 |        0 |
> |       80 |   0.157 |        0 |
> |      160 |   0.139 |        0 |
> |      320 |   0.212 |        0 |
> |      640 |   0.542 |        0 |
> |     1280 |   0.797 |        0 |
> |     2560 |   1.533 |        0 |
> |     5120 |   4.651 |        0 |
> |    10240 |  16.745 |        0 |
>
> The profiling gives:
>
>        16817  63%                    - org-babel-tangle-file
>        16280  61%                     - org-babel-tangle
>        16200  61%                      - org-babel-tangle-collect-blocks
>         1360   5%                       + org-babel-tangle-single-block
>         1210   4%                       + org-babel-get-src-block-info
>
> Most of the CPU time is spent in org-babel-tangle-collect-blocks, which
> is basically another regexp search for all the source blocks in buffer.
> The scaling is still slightly non-linear - maybe your source block
> regexp is too complex:
>
> (defvar org-babel-src-block-regexp
>   (concat
>    ;; (1) indentation                 (2) lang
>    "^\\([ \t]*\\)#\\+begin_src[ \t]+\\([^ \f\t\n\r\v]+\\)[ \t]*"
>    ;; (3) switches
>    "\\([^\":\n]*\"[^\"\n*]*\"[^\":\n]*\\|[^\":\n]*\\)"
>    ;; (4) header arguments
>    "\\([^\n]*\\)\n"
>    ;; (5) body
>    "\\([^\000]*?\n\\)??[ \t]*#\\+end_src")
>   "Regexp used to identify code blocks.")
>
> On the other hand, the test data is already within quite non-realistic
> over 5000 number of blocks. Probably, further performance improvement is
> not very useful (we already have some overheads at smaller N).
>
> Best,
> Ihor
>
> -----
> The Org file used to generate plots:
>
> #+name: nocache
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.027 |        0 |
> |       20 |   0.049 |        0 |
> |       40 |   0.121 |        0 |
> |       80 |   0.391 |        0 |
> |      160 |   1.426 |        0 |
> |      320 |   6.765 |        0 |
> |      640 |  23.972 |        0 |
>
> #+name: cache
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.030 |        0 |
> |       20 |   0.067 |        0 |
> |       40 |   0.065 |        0 |
> |       80 |   0.081 |        0 |
> |      160 |   0.214 |        0 |
> |      320 |   0.597 |        0 |
> |      640 |   2.225 |        0 |
> |     1280 |   9.221 |        0 |
> |     2560 |  36.966 |        0 |
>
> #+name: cache-no-self
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.078 |        0 |
> |       20 |   0.075 |        0 |
> |       40 |   0.063 |        0 |
> |       80 |   0.085 |        0 |
> |      160 |   0.095 |        0 |
> |      320 |   0.178 |        0 |
> |      640 |   0.311 |        0 |
> |     1280 |   0.819 |        0 |
> |     2560 |   2.302 |        0 |
> |     5120 |   8.878 |        0 |
> |    10240 |  32.706 |        0 |
>
> #+name: cache-no-self+fix
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.118 |        0 |
> |       20 |   0.106 |        0 |
> |       40 |   0.136 |        0 |
> |       80 |   0.157 |        0 |
> |      160 |   0.139 |        0 |
> |      320 |   0.212 |        0 |
> |      640 |   0.542 |        0 |
> |     1280 |   0.797 |        0 |
> |     2560 |   1.533 |        0 |
> |     5120 |   4.651 |        0 |
> |    10240 |  16.745 |        0 |
>
> #+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :file benchmark1.png
> set title 'Tangle code performance timing'
> US='u 1:2 w lp ps 2'
> set xlabel "N blocks"
> set ylabel "Time, sec"
> set key top right
> plot d1 @US t'Org 9.5.2 stable', d2 @US t'Org 9.6-dev', d3 @US t'Org 9.6-dev no self-verify', d4 @US t'Org 9.6-dev no self-verify + patch'
> #+end_src
>
>
> #+RESULTS[edd2a2d5d80b31876917faee349ed71ba0fe239a]:
> [[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark1.png]]
>
> #+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :file benchmark2.png
> set title 'Tangle code performance scaling (normalized)'
> US='w lp ps 2'
> set xlabel "N blocks"
> set ylabel "Time, normalized by time at N=640"
> set key top left
> set xrange [0:6000]
> plot d2 u 1:($2/2.225) @US t'Org 9.6-dev', d3 u 1:($2/0.311) @US t'Org 9.6-dev no self-verify', d1 u 1:($2/23.972) @US t'Org 9.5.2 stable', d4 u 1:($2/0.542) @US t'Org 9.6-dev no self-verify + patch', x*x/870000 t'x^2', 
> #+end_src
>
> #+RESULTS[b170ef78ac377f6d2ad1c2eabb20cd62dc0ea33f]:
> [[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark2.png]]

This is nice work. Well done on all you and others have been doing along these
lines to improve performance. 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references
  2022-01-25 14:11 ` Ihor Radchenko
  2022-01-25 16:25   ` Tim Cross
@ 2022-01-26 11:43   ` Ihor Radchenko
  1 sibling, 0 replies; 6+ messages in thread
From: Ihor Radchenko @ 2022-01-26 11:43 UTC (permalink / raw)
  To: pareto optimal; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

Ihor Radchenko <yantar92@gmail.com> writes:

> Most of the CPU time is spent in org-babel-tangle-collect-blocks, which
> is basically another regexp search for all the source blocks in buffer.
> The scaling is still slightly non-linear - maybe your source block
> regexp is too complex:

After further investigation I found that it was not the problem with
source block regexp. The code was doing an extra backward regexp search
to find current headline. When there are no headlines in the Org file,
that search created quadratic scaling.

After caching the backwards search, the performance is further improved:
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.204 |        0 |
|       20 |   0.047 |        0 |
|       40 |   0.171 |        0 |
|       80 |   0.063 |        0 |
|      160 |   0.096 |        0 |
|      320 |   0.155 |        0 |
|      640 |   0.651 |        0 |
|     1280 |   0.419 |        0 |
|     2560 |   0.799 |        0 |
|     5120 |   1.628 |        0 |
|    10240 |   3.306 |        0 |
|    20480 |   5.633 |        0 |
|    40960 |  11.415 |        0 |

41k blocks in 12sec.

Graphical comparison:

[-- Attachment #2: benchmark1.png --]
[-- Type: image/png, Size: 8212 bytes --]

[-- Attachment #3: Type: text/plain, Size: 53 bytes --]


The scaling becomes strictly linear after this fix:

[-- Attachment #4: benchmark2.png --]
[-- Type: image/png, Size: 9648 bytes --]

[-- Attachment #5: Type: text/plain, Size: 3678 bytes --]


Best,
Ihor

-----
Code used to generate the final set of the graphs:

#+name: nocache
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.027 |        0 |
|       20 |   0.049 |        0 |
|       40 |   0.121 |        0 |
|       80 |   0.391 |        0 |
|      160 |   1.426 |        0 |
|      320 |   6.765 |        0 |
|      640 |  23.972 |        0 |

#+name: cache
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.030 |        0 |
|       20 |   0.067 |        0 |
|       40 |   0.065 |        0 |
|       80 |   0.081 |        0 |
|      160 |   0.214 |        0 |
|      320 |   0.597 |        0 |
|      640 |   2.225 |        0 |
|     1280 |   9.221 |        0 |
|     2560 |  36.966 |        0 |

#+name: cache-no-self
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.078 |        0 |
|       20 |   0.075 |        0 |
|       40 |   0.063 |        0 |
|       80 |   0.085 |        0 |
|      160 |   0.095 |        0 |
|      320 |   0.178 |        0 |
|      640 |   0.311 |        0 |
|     1280 |   0.819 |        0 |
|     2560 |   2.302 |        0 |
|     5120 |   8.878 |        0 |
|    10240 |  32.706 |        0 |

#+name: cache-no-self+fix
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.118 |        0 |
|       20 |   0.106 |        0 |
|       40 |   0.136 |        0 |
|       80 |   0.157 |        0 |
|      160 |   0.139 |        0 |
|      320 |   0.212 |        0 |
|      640 |   0.542 |        0 |
|     1280 |   0.797 |        0 |
|     2560 |   1.533 |        0 |
|     5120 |   4.651 |        0 |
|    10240 |  16.745 |        0 |

#+name: cache-no-self+fix+fix2
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.204 |        0 |
|       20 |   0.047 |        0 |
|       40 |   0.171 |        0 |
|       80 |   0.063 |        0 |
|      160 |   0.096 |        0 |
|      320 |   0.155 |        0 |
|      640 |   0.651 |        0 |
|     1280 |   0.419 |        0 |
|     2560 |   0.799 |        0 |
|     5120 |   1.628 |        0 |
|    10240 |   3.306 |        0 |
|    20480 |   5.633 |        0 |
|    40960 |  11.415 |        0 |


#+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :var d5=cache-no-self+fix+fix2 :file benchmark1.png
set title 'Tangle code performance timing'
US='u 1:2 w lp ps 2'
set xlabel "N blocks"
set ylabel "Time, sec"
set key top right
plot d1 @US t'Org 9.5.2 stable', d2 @US t'Org 9.6-dev', d3 @US t'Org 9.6-dev no self-verify', d4 @US t'Org 9.6-dev no self-verify + patch', d5 @US t'Org 9.6-dev no self-verify + 2xpatch'
#+end_src

#+RESULTS[e95dafc7253d218d2a9ed19caa75911660e72f77]:
[[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark1.png]]

#+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :var d5=cache-no-self+fix+fix2 :file benchmark2.png
set title 'Tangle code performance scaling (normalized)'
US='w lp ps 2'
set xlabel "N blocks"
set ylabel "Time, normalized by time at N=640"
set key top right
set yrange [0:100]
plot d2 u 1:($2/2.225) @US t'Org 9.6-dev', d3 u 1:($2/0.311) @US t'Org 9.6-dev no self-verify', d1 u 1:($2/23.972) @US t'Org 9.5.2 stable', d4 u 1:($2/0.542) @US t'Org 9.6-dev no self-verify + patch', d5 u 1:($2/0.651) @US t'Org 9.6-dev no self-verify + 2xpatch', x*x/870000 t'x^2', [0:10000] x*x/3500000 t'x^{2}', x/2400 t'x^{1}'
#+end_src

#+RESULTS[c69cdd0dfb08f37c73c6a202f415336155a390ab]:
[[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark2.png]]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-01-26 11:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-25  5:19 Poor org-babel-tangle-file performance with more than 100 trivial noweb-references pareto optimal
2022-01-25 14:03 ` Sébastien Miquel
2022-01-25 14:25   ` Ihor Radchenko
2022-01-25 14:11 ` Ihor Radchenko
2022-01-25 16:25   ` Tim Cross
2022-01-26 11:43   ` Ihor Radchenko

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).