Build Systems a la carte

Goal of the paper is to develop a generic build system which can be later instantiated.
It should be minimal
It should be correct
Minimalism is achieved with a Scheduler. Correctness with Rebuilder.

The Scheduler: Respecting the Dependency Order

Topological (but doesn't handle dynamic dependencies)
Restarting (abort on encountering fetch and restart once it's fetched)
Suspending (wait for fetch to resolve the value. Like promises, OCaml's Lwt etc)

Rebuilder: Determining Out-of-date Keys

Initial thoughts: timestamp of the files. But this paper is developing a generic framework

A Dirty bit

:CUSTOM_ID: 4

An additional bit set to true if the key changes. And set to false after every successful build. Any key in the transitive closure with bit to true will trigger the build for that key

Examples: Excel, Make

Verifying traces

:CUSTOM_ID: 5

Recording hashes/values of dependencies and determining if they changed

TODO Constructive traces

:CUSTOM_ID: 6

example: bazel

TODO Deep constructive traces

:CUSTOM_ID: 7

example: nix

TODO Concrete implementations of build systems

(very dense. Mostly code explanations)

Engineering challenges

Partial Stores and Exceptions

v -> Maybe v because we often run into file not found etc. Task is polymorphic enough to handle this change

Parallelism

Purely a scheduler's problem.

Topological: simplest case. Everything is statically available.
Restarting: Queue is dequeued with n readers
Suspending: static are resolved parallely. Dynamics ones as in Marlow et al

TODO Marlow et al

Impure Computations

C compiler version not accounted for. Different version may produce different errrors.
Non-determinism from parallelism: different but semantically identical results because tasks executed in different order
Volatility: PHONY rules in Make. Random() in Excel

Mitigation
1. Bazel uses sandboxes to makes sure missing dependencies are caught
2. Untracked tasks are marked volatile to ensure correctness
3. Non-determinism could be handled by tweaking the definition of correctness from exactly equal to one of the possible values.
- TODO What is MonadPlus. It seems it can be used to address non-determinism

Cloud implementation details overlooked

Communication protocol details (What is TBs of data transfer is needed)
Offloading: remotely run a task instead of local machine
Eviction of old cache entries
Shallow builds: discarding intermediate package builds and only retaining end products. Then correctness definition will need a tweak.

We have to define a new store - shallow which only contains the subset of key/values we care about and update the equality condition for this subset
Deep Constructive traces with non-determinism is problematic: eg: Frankenbuild

profilers can emit different outputs for same executable. Simply hydrating the local store with a remote one will lead to incorrect results by all our definitions of correctness (main, non-determinism, shallow)

Self tracking

When users edit the build rules, a simple approach is to recompute everything

Self-tracked systems only to compute only the changed rules.

Ex: Excel

sprsh4 "B1" = Just $ Task $ \fetch -> do
  formula <- fetch "B1-formula"
  evalFormula fetch formula

Loops for computing tasks

Ex: excel and Latex

TODO recurrence relations as solutions for Excel

We dont deal with cyclic dependencies, but some frameworks do.

Ex: Lazy Least Fixed Points in ML https://gallium.inria.fr/~fpottier/publis/fpottier-fix.pdf

Polymorphic kv stores

It seems some build systems (Shake) provide ways to specify keys (and values) with more than one types.

Ex: gcc --version as a volatile key to pin gcc dependency version

Ex: GHC producing Foo.hi and Foo.o could be represented as a tuple. I dont fully understand this. (Feel free to enlighten!)

Related work

TODO Arrows

Other build systems

Dune. Uses arrows
Ninja topological scheduler with verifying traces
Nix: coarse grained. Deals with packages. Not files.
Pluto allow cycles in build graph
Redo: polymorphic dependencies
TUP: improved dirty-bit checking avoiding entire re-checking of graph
Fabricate. Never heard of it. Nor do I understand it

Self adjusting computation (Incr)

Often a graph of computations that automatically adjust to change in inputs.
No concrete implementation explored in the paper yet. Future Work.

Memoization

Nearly the same as Dynamic programming.
Ends up being a self-adjusting computation system. Are the abstractions similar then? Question for further exploration.

Conclusion

Investigated multiple build systems
Properties of build systems are a function of two choices: order of tasks and scheduler
One of the combinations that hasn't been implemented: Cloud based suspending monadic build. A discovery!

Thank you.

I'm @ManasJayanth on twitter.

You can email me at hello [at] manas-jayanth [dot] com

These slides can be found at slides.manas-jayanth.com