Skip to content
rninja

← back to writing

Where ninja leaves perf on the table

Neul Labs · ·
schedulingninjaperformance

ninja’s reputation for speed is deserved. It parses a flat graph format, dispatches edges with almost no per-action overhead, and gets out of the way of the compiler. That minimalism is the reason ninja outran make and the reason it became the default backend for CMake, GN, and Meson. It is also the reason there is still a lot of performance left on the table — because most of what makes a build slow today is not where ninja decided to optimize a decade ago.

This post is a tour of those places. None of it is a criticism of ninja’s design; it is an observation that the world has changed underneath that design. We rebuilt the executor as rninja because the gaps are now bigger than the dispatch loop, and that is where the work has moved.

The dispatch loop is already optimal

Start with what ninja does extraordinarily well. The build graph is parsed once into a compact in-memory representation. When you ask it to build a target, it walks the graph, finds the leaves that need rebuilding, and pushes them into a ready queue. As edges complete, it walks the reverse-dependency edges and pushes the newly-ready ones in. The per-edge cost of this is small enough to disappear behind anything the toolchain does.

If you look at a flamegraph of ninja running a non-trivial build, you will see ninja itself accounting for low single-digit percent of total time. The rest is the compiler, the linker, and the kernel waiting for IO. The dispatch loop is not the problem to solve. Anything that adds milliseconds per edge here is a net loss.

This is the boundary condition for any drop-in replacement. You cannot beat ninja by adding cleverness to the part it already does well. You can only beat it by addressing the parts it doesn’t touch.

The single-threaded planner

Here is the first crack. ninja’s status display, the bookkeeping that decides what to schedule next, and the work of marking edges done after a process exits all run on the main thread. With small builds this is invisible. With twelve thousand edges and a sixty-four-core machine, you can watch the planner become a serializing bottleneck — the scheduler has more work to dispatch than the main thread can dispatch in the time the workers take to ask for more.

The fix is to move the bookkeeping off the hot path. rninja’s executor is built on tokio; the planner runs as an async task, the workers run on their own runtime, and the channel between them is lock-free. On a twelve-thousand-edge graph with cheap actions (think header generation), that change alone is worth a noticeable fraction of total time, because the workers stop blocking on the main thread.

You also see this gap with -j numbers above the machine’s core count. ninja’s -j defaults are conservative because the planner cannot dispatch fast enough to take advantage of higher concurrency. rninja’s scheduler can.

Scheduling without a model of contention

ninja’s scheduling decision is essentially: “is there a ready edge? Is there a worker slot? Dispatch.” That is fine when the bottleneck is CPU and you have plenty of disk and memory. It stops being fine when you have eight linker invocations all hitting the same SSD, or four codegen passes each demanding two gigabytes of RAM, or a dozen tests fighting for the same network port.

A modern executor needs a model of the resources actions actually consume. ninja has a partial answer for this with pool declarations — you can put expensive actions into a named pool with a fixed concurrency limit. But pools are static and have to be declared by the generator. They cannot adapt to machine size, and they cannot reason across pool boundaries.

rninja’s scheduler tracks resources the action actually claims (CPU, IO, memory) and dispatches against the machine’s actual capacity. On a CI runner with 16 cores and 32 GB of RAM, a graph that wants to run six 8-GB link steps in parallel will get throttled before it OOMs and restarts the build. Stock ninja will dispatch all six and let the kernel sort it out.

No memory of past work

This is the big one. ninja has no memory. Every time you run it, it walks the graph, finds the dirty edges, and runs them. “Dirty” means “the output mtime is older than the input mtime or the depfile says so.” That is a fast check, but it is the only check. ninja never asks: have I run this exact action with these exact inputs before?

So consider the common case in a monorepo: you check out main, build, write some code, build, throw the branch away, check out main, build. The third build is almost identical to the first. ninja will redo every action that mtime says is dirty. The compiler is invoked, hashes its inputs internally, computes the same output, writes it to disk, and walks away. ninja moves on. The compiler doesn’t tell ninja anything about its work, and ninja doesn’t ask.

A content-addressed action cache solves this by remembering. Hash the inputs (sources, headers, command line, environment) up front; check the cache; if there’s a hit, skip the action entirely. On the third build above, almost every action is a cache hit. The build time drops to the cost of walking the graph and confirming the hits — which is exactly the speed at which ninja can dispatch when there is no work to do.

The footnote here is that you can do this in user space by wrapping the compiler with sccache. That gives you compile-level caching but doesn’t help with linking, codegen, asset processing, or any other action type in your graph. The executor is the right place for an action-level cache because the executor already knows the action exists.

No memory across machines

The same argument generalizes. If your CI runs the same commit on three runners — release, debug, sanitizer — they all do the same compile work for the parts of the graph that don’t depend on the variant. The compiler runs three times to produce three identical object files. ninja, having no memory, has nothing to share.

A remote cache solves this. Whoever produces the artifact first uploads it; everyone else downloads it. The first compile on a new machine becomes a download instead of a compile. The numbers here are dramatic: a build that takes thirty minutes from cold can take three when the remote cache is warm and the network is good.

ninja has no opinion about remote caching because remote caching is outside its scope. That is consistent — ninja is an executor for one machine. But “one machine” is no longer the unit of work in most modern CI setups.

Subtools that nearly answer the right question

ninja ships subtools that look like they answer build-engineering questions: -t graph produces a graphviz dot file, -t deps shows stored dependencies, -t query shows inputs and outputs for a path. They are useful. They are also static views of the graph; they don’t tell you what actually happened during the last build.

To answer “what was the critical path of my last build?” with stock ninja, you parse .ninja_log, correlate timestamps, and write your own analysis. To answer “why did my build take twice as long today?” you compare two such analyses. Most teams that care never get around to building this tooling because the gap between “exists” and “useful enough to lean on” is too wide for a side project.

This isn’t a perf gap directly; it is a perf-debugging gap. You cannot fix what you can’t see, and ninja’s introspection ends where the build does.

Why this all sits in the executor

The pattern across these gaps is the same: each of them is something the executor already knows about. The executor knows which edges are ready, what they consume, what their inputs are, and when they start and end. Putting the cache, the scheduler, the resource model, and the timeline collection at that layer means none of them have to re-derive what the executor already has.

That is rninja’s bet. Stay drop-in for the format and the CLI — because those are the parts ninja got right — and close the gaps in the executor where the modern wins live. The dispatch loop stays fast. The bookkeeping moves off the main thread. The cache joins the graph walk. The result is a build tool that fits in ninja’s slot and pays for itself on warm incremental builds the same day you install it.