Build Systems a la carte

  1. Goal of the paper is to develop a generic build system which can be later instantiated.
  2. It should be minimal
  3. It should be correct
  4. Minimalism is achieved with a Scheduler. Correctness with Rebuilder.

The Scheduler: Respecting the Dependency Order

  1. Topological (but doesn't handle dynamic dependencies)
  2. Restarting (abort on encountering fetch and restart once it's fetched)
  3. Suspending (wait for fetch to resolve the value. Like promises, OCaml's Lwt etc)

Rebuilder: Determining Out-of-date Keys

Initial thoughts: timestamp of the files. But this paper is developing a generic framework

A Dirty bit

:CUSTOMID: 4

An additional bit set to true if the key changes. And set to false after every successful build. Any key in the transitive closure with bit to true will trigger the build for that key

Examples: Excel, Make

Verifying traces

:CUSTOMID: 5

Recording hashes/values of dependencies and determining if they changed

TODO Constructive traces

:CUSTOMID: 6

example: bazel

TODO Deep constructive traces

:CUSTOMID: 7

example: nix

TODO Concrete implementations of build systems

(very dense. Mostly code explanations)

Engineering challenges

Partial Stores and Exceptions

v -> Maybe v because we often run into file not found etc. Task is polymorphic enough to handle this change

Parallelism

Purely a scheduler's problem.

  1. Topological: simplest case. Everything is statically available.
  2. Restarting: Queue is dequeued with n readers
  3. Suspending: static are resolved parallely. Dynamics ones as in Marlow et al
  • TODO Marlow et al

Impure Computations

  1. C compiler version not accounted for. Different version may produce different errrors.
  2. Non-determinism from parallelism: different but semantically identical results because tasks executed in different order
  3. Volatility: PHONY rules in Make. Random() in Excel
  • Mitigation
    1. Bazel uses sandboxes to makes sure missing dependencies are caught
    2. Untracked tasks are marked volatile to ensure correctness
    3. Non-determinism could be handled by tweaking the definition of correctness from exactly equal to one of the possible values.
    • TODO What is MonadPlus. It seems it can be used to address non-determinism

Cloud implementation details overlooked

  1. Communication protocol details (What is TBs of data transfer is needed)
  2. Offloading: remotely run a task instead of local machine
  3. Eviction of old cache entries
  4. Shallow builds: discarding intermediate package builds and only retaining end products. Then correctness definition will need a tweak.

    We have to define a new store - shallow which only contains the subset of key/values we care about and update the equality condition for this subset

  5. Deep Constructive traces with non-determinism is problematic: eg: Frankenbuild

    profilers can emit different outputs for same executable. Simply hydrating the local store with a remote one will lead to incorrect results by all our definitions of correctness (main, non-determinism, shallow)

Self tracking

When users edit the build rules, a simple approach is to recompute everything

Self-tracked systems only to compute only the changed rules.

Ex: Excel

sprsh4 "B1" = Just $ Task $ \fetch -> do
  formula <- fetch "B1-formula"
  evalFormula fetch formula

Loops for computing tasks

Ex: excel and Latex

Polymorphic kv stores

It seems some build systems (Shake) provide ways to specify keys (and values) with more than one types.

Ex: gcc --version as a volatile key to pin gcc dependency version

Ex: GHC producing Foo.hi and Foo.o could be represented as a tuple. I dont fully understand this. (Feel free to enlighten!)

Related work

TODO Arrows

Other build systems

  1. Dune. Uses arrows
  2. Ninja topological scheduler with verifying traces
  3. Nix: coarse grained. Deals with packages. Not files.
  4. Pluto allow cycles in build graph
  5. Redo: polymorphic dependencies
  6. TUP: improved dirty-bit checking avoiding entire re-checking of graph
  7. Fabricate. Never heard of it. Nor do I understand it

Self adjusting computation (Incr)

  1. Often a graph of computations that automatically adjust to change in inputs.
  2. No concrete implementation explored in the paper yet. Future Work.

Memoization

  1. Nearly the same as Dynamic programming.
  2. Ends up being a self-adjusting computation system. Are the abstractions similar then? Question for further exploration.

Conclusion

  1. Investigated multiple build systems
  2. Properties of build systems are a function of two choices: order of tasks and scheduler
  3. One of the combinations that hasn't been implemented: Cloud based suspending monadic build. A discovery!

Thank you.

I'm @ManasJayanth on twitter.

You can email me at hello [at] manas-jayanth [dot] com

These slides can be found at slides.manas-jayanth.com