So, there is still some power-law nonlinearity in the tangle code.

Benchmarking revealed the following:
        8259  95%               - org-babel-tangle-file
        8259  95%                - org-babel-tangle
        8170  94%                 - org-babel-tangle-collect-blocks
        3620  41%                  - org-in-archived-heading-p
        3600  41%                     org-before-first-heading-p

Showing that nonlinear scaling comes from regexp search.

Caching org-before-first-heading-p improves the performance further (see
Org 9.6-dev no self-verify + patch curves):

| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.118 |        0 |
|       20 |   0.106 |        0 |
|       40 |   0.136 |        0 |
|       80 |   0.157 |        0 |
|      160 |   0.139 |        0 |
|      320 |   0.212 |        0 |
|      640 |   0.542 |        0 |
|     1280 |   0.797 |        0 |
|     2560 |   1.533 |        0 |
|     5120 |   4.651 |        0 |
|    10240 |  16.745 |        0 |

The profiling gives:

       16817  63%                    - org-babel-tangle-file
       16280  61%                     - org-babel-tangle
       16200  61%                      - org-babel-tangle-collect-blocks
        1360   5%                       + org-babel-tangle-single-block
        1210   4%                       + org-babel-get-src-block-info

Most of the CPU time is spent in org-babel-tangle-collect-blocks, which
is basically another regexp search for all the source blocks in buffer.
The scaling is still slightly non-linear - maybe your source block
regexp is too complex:

(defvar org-babel-src-block-regexp
  (concat
   ;; (1) indentation                 (2) lang
   "^\\([ \t]*\\)#\\+begin_src[ \t]+\\([^ \f\t\n\r\v]+\\)[ \t]*"
   ;; (3) switches
   "\\([^\":\n]*\"[^\"\n*]*\"[^\":\n]*\\|[^\":\n]*\\)"
   ;; (4) header arguments
   "\\([^\n]*\\)\n"
   ;; (5) body
   "\\([^\000]*?\n\\)??[ \t]*#\\+end_src")
  "Regexp used to identify code blocks.")

On the other hand, the test data is already within quite non-realistic
over 5000 number of blocks. Probably, further performance improvement is
not very useful (we already have some overheads at smaller N).

Best,
Ihor

-----
The Org file used to generate plots:

#+name: nocache
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.027 |        0 |
|       20 |   0.049 |        0 |
|       40 |   0.121 |        0 |
|       80 |   0.391 |        0 |
|      160 |   1.426 |        0 |
|      320 |   6.765 |        0 |
|      640 |  23.972 |        0 |

#+name: cache
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.030 |        0 |
|       20 |   0.067 |        0 |
|       40 |   0.065 |        0 |
|       80 |   0.081 |        0 |
|      160 |   0.214 |        0 |
|      320 |   0.597 |        0 |
|      640 |   2.225 |        0 |
|     1280 |   9.221 |        0 |
|     2560 |  36.966 |        0 |

#+name: cache-no-self
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.078 |        0 |
|       20 |   0.075 |        0 |
|       40 |   0.063 |        0 |
|       80 |   0.085 |        0 |
|      160 |   0.095 |        0 |
|      320 |   0.178 |        0 |
|      640 |   0.311 |        0 |
|     1280 |   0.819 |        0 |
|     2560 |   2.302 |        0 |
|     5120 |   8.878 |        0 |
|    10240 |  32.706 |        0 |

#+name: cache-no-self+fix
| N blocks | runtime | # of GCs |
|----------+---------+----------|
|       10 |   0.118 |        0 |
|       20 |   0.106 |        0 |
|       40 |   0.136 |        0 |
|       80 |   0.157 |        0 |
|      160 |   0.139 |        0 |
|      320 |   0.212 |        0 |
|      640 |   0.542 |        0 |
|     1280 |   0.797 |        0 |
|     2560 |   1.533 |        0 |
|     5120 |   4.651 |        0 |
|    10240 |  16.745 |        0 |


#+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :file benchmark1.png
set title 'Tangle code performance timing'
US='u 1:2 w lp ps 2'
set xlabel "N blocks"
set ylabel "Time, sec"
set key top right
plot d1 @US t'Org 9.5.2 stable', d2 @US t'Org 9.6-dev', d3 @US t'Org 9.6-dev no self-verify', d4 @US t'Org 9.6-dev no self-verify + patch'
#+end_src

#+RESULTS[edd2a2d5d80b31876917faee349ed71ba0fe239a]:
[[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark1.png]]

#+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var d4=cache-no-self+fix :file benchmark2.png
set title 'Tangle code performance scaling (normalized)'
US='w lp ps 2'
set xlabel "N blocks"
set ylabel "Time, normalized by time at N=640"
set key top left
set xrange [0:6000]
plot d2 u 1:($2/2.225) @US t'Org 9.6-dev', d3 u 1:($2/0.311) @US t'Org 9.6-dev no self-verify', d1 u 1:($2/23.972) @US t'Org 9.5.2 stable', d4 u 1:($2/0.542) @US t'Org 9.6-dev no self-verify + patch', x*x/870000 t'x^2', 
#+end_src

#+RESULTS[b170ef78ac377f6d2ad1c2eabb20cd62dc0ea33f]:
[[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark2.png]]