Profiling build graphs honestly

Neul Labs · June 2, 2026 ·

profilingbuild graphsmeasurement

Every build engineer has done some version of this dance: a teammate says “the build is slow,” you run time ninja, get a number, change something, run time ninja again, get a smaller number, declare victory. A week later it is slow again. The change was real but the measurement was wrong, and the slowdown that came back was wearing a different hat.

This post is the inventory of how that goes wrong, and what we do at rninja to keep our own measurements honest. None of this is rninja-specific — it works just as well on stock ninja, bazel, buck, or make — but rninja’s tooling makes the harder parts cheaper, so the examples lean on it.

Wall clock is the wrong metric

Wall clock is the metric users feel, so it is the one teams optimize. It is also the noisiest signal in your toolbox. A time ninja run is influenced by:

Whether your laptop was thermally throttled. Sustained CPU runs degrade clocks on most machines under load.
Whether the file cache was warm. The first build after a reboot hits cold disk.
Whether another process was using the disk, the network, or the L3 cache during your measurement.
Whether your build is IO-bound, CPU-bound, or memory-bound — and whether the machine state at measurement time matches the user complaint.

The result is that two runs of the same target can differ by 20% on identical hardware without any actual change to the build. Trying to detect a 5% regression in this signal is fishing for noise.

The fix is to measure things wall clock cannot lie about. Total CPU seconds across actions is more stable than wall clock, because it sums work rather than time, and is less sensitive to scheduling. Critical-path length — the longest chain of dependent edges — is more stable still, because it ignores how many workers were running in parallel. We use both, and we trust them more than wall clock for regression detection.

Action granularity is what matters

The other common mistake is profiling at the wrong granularity. Wall-clock total tells you the build is slow. Per-target wall-clock tells you which final outputs are slow. Neither tells you which actions are slow. A 30-minute build that produces ten final targets is not necessarily ten 3-minute targets — it is more likely 8000 actions of varying duration distributed across them.

The unit of optimization is the action: a single compile, a single link, a single codegen pass. Until you have action-level numbers, every conversation about “why is X slow” is going to be wrong.

rninja’s executor records a row per action when you pass -t timeline: start time, end time, edge id, command, cache outcome, peak memory if measured. The result is a JSON file that drops cleanly into any flame-graph or Gantt-chart tool. Stock ninja’s .ninja_log covers the basics here too — it gives you start and end times per edge — but it does not record cache state because stock ninja has no cache. The point is the same: you need a row per action, not a total at the bottom.

Critical path versus total work

The two numbers that actually decide your build time are critical path length and available parallelism. Critical path is the longest chain of actions where each depends on the previous; you cannot finish the build sooner than the critical path. Available parallelism is how much CPU/IO you can saturate during that critical path.

Most “make the build faster” conversations are about total work — “reduce the number of files compiled” — when they should be about critical path. Cutting 100 actions out of the build does not move the wall clock if those actions were never on the critical path. Cutting one action that was on the critical path can save more time than cutting fifty that weren’t.

The honest way to profile is:

Compute the critical path of the last build. Most build graph tools can do this from the timeline data.
Look at the actions on the critical path. These are the only places where optimization will move wall clock.
Look at total CPU seconds across all actions. This tells you whether you have available parallelism to spend if you can break up the critical path.

If your critical path is one long link step at the end, no amount of parallelism upstream will help. If your critical path is a chain of ten medium actions, breaking one of them up reduces wall clock by however much that action took. The geometry matters more than the gross count.

The cache changes what “slow” means

When you introduce a cache, the meaning of “the build is slow” gets a second axis: was it slow because of the work it did, or because of the work it would have skipped if the cache had hit? A build that takes ten minutes with a 90% cache hit rate is different from a build that takes ten minutes with a 0% hit rate. The first is mostly dispatch and downloading artifacts; the second is mostly compilers. Both are “ten minutes,” but the things you should change about them are different.

rninja’s stats output breaks this down: total actions, cache hits, cache misses, and the cumulative time spent on each. If your hit rate is high and the build is still slow, the bottleneck is the miss path (slow compilers, large links, or pathologies in the long tail of the graph). If your hit rate is low and the build is slow, the bottleneck is whatever is invalidating the cache (header churn, non-deterministic generators, env var leakage into the action). They look the same from the outside; they require different fixes.

The trap is reporting “cache hit rate” without reporting the cumulative time of the hits and misses. A 95% hit rate that masks 5% of misses that account for 90% of build time looks great in the dashboard and tells you nothing about why the build is slow. Always look at the time, not the count.

Pathologies that hide in averages

A profile that shows totals can hide pathologies that show up in distributions. Three patterns we look for every time:

Long-tail actions. A small number of actions accounting for a disproportionate share of total time. Usually a giant compile-time-templated header or a link step that wants four gigabytes of RAM. These don’t average out across the build; they sit on the critical path.
Cache thrashing. An action that misses on every build despite identical-looking inputs. Usually means the cache key is including something it shouldn’t (a timestamp, a random ID, a counter). The fix is to remove the leak, not to disable the cache.
Cliff effects on -j. A build that takes 30s at -j8, 25s at -j16, and 28s at -j32. The increase past the peak is a contention signal — usually disk, sometimes memory pressure. Profiling with peak resource usage per action surfaces this.

All three are invisible in time ninja. All three are obvious in a Gantt-chart of the timeline. The cost of switching to the better tool is a JSON parse and a couple of plotting calls. Teams that have done it once never go back.

Honest reporting

The last piece is presentation. If you measure five times and average, report the average and the standard deviation. If you compare two configurations, report both numbers and the delta with a confidence interval (or at least a min/max across runs). If you make a change that improves the average by 1.5% with a standard deviation of 3%, you have not made the build faster — you have found noise.

It is uncomfortable to publish numbers that say “I optimized this and it didn’t move.” It is more uncomfortable to publish numbers that say “this got 8% faster” and have someone else discover next month that the 8% was noise and the real change was zero. The build engineer who shows their work, including the uncertainty, ends up trusted more than the one who claims wins they cannot reproduce.

We try to hold ourselves to this in the rninja documentation. The performance table on the README quotes ranges: 2× to 5× for warm incremental, 1.3× to 2× for cold builds. Those are honest ranges with real variance behind them. They are not the headline numbers a marketing page would prefer. They are the numbers the build is actually going to give you, and they are the ones we want you to be able to verify.