# A Grammar of Benchmarking
## The Grammar of Graphics
In 1999, Leland Wilkinson published *The Grammar of Graphics*, arguing that statistical
visualizations are not a fixed taxonomy of chart types (bar, line, scatter, pie) but
compositions of a small set of orthogonal components:
- **Data** — the variables to visualize
- **Aesthetics** — mappings from data to visual properties (position, color, size)
- **Geometry** — the visual marks (points, lines, bars, areas)
- **Statistics** — transformations applied before rendering (binning, smoothing, aggregation)
- **Scales** — functions that map data values to aesthetic values (linear, log, discrete)
- **Coordinates** — the coordinate system (Cartesian, polar, geographic)
- **Facets** — splitting data into subplots by a categorical variable
The key insight is *decomposition*: instead of memorizing when to use each chart type, you
compose a visualization from independent building blocks. Hadley Wickham's **ggplot2** and
later **Vega-Lite** made this practical — you declare what you want, and the library assembles
the pieces.
The pattern generalizes: take a complex domain, decompose it into a small set of composable
primitives, and let combinations emerge from composition rather than enumeration.
## From Graphics to Data
The grammar of graphics addresses the *visualization* step — it assumes the data already
exists. But where does the data come from?
In benchmarking and parameter studies, data is generated by evaluating a function across
combinations of input parameters. The conventional approach is to write nested for-loops,
manually manage results arrays, and hand-pick plot types. This is the benchmarking equivalent
of manually drawing charts: tedious, error-prone, and tightly coupled to the specific
dimensionality of your experiment.
Bencher extends the grammar of graphics principle *upstream* to data generation. Just as
ggplot2 replaces "pick a chart type" with "compose visualization components", Bencher replaces
"write nested for-loops" with "declare parameter spaces and compose". The core abstraction is
`ParametrizedSweep`: a class where typed parameter declarations define the input space, and a
`__call__` method defines the function to evaluate. Bencher handles the rest — computing the
Cartesian product, caching results, selecting appropriate visualizations, and composing panels.
## Architecture Overview
Just as the grammar of graphics decomposes a chart into Data, Aesthetics, Geometry, and so on,
Bencher decomposes a benchmark into three stages — each mapping directly onto the grammar
primitives introduced below:
```{raw} html
```
```{mermaid}
flowchart LR
subgraph Problem [" "]
direction TB
PT["① Problem Definition"]
subgraph PS ["ParametrizedSweep"]
direction TB
Inputs[FloatSweep · IntSweep · EnumSweep]
Results[ResultFloat · ResultBool · ResultImage]
Fn["def benchmark(self)"]
Inputs ~~~ Results ~~~ Fn
end
PT ~~~ PS
end
subgraph Sweep [" "]
direction TB
ST["② Sweep Definition"]
subgraph SW ["plot_sweep()"]
direction TB
IV[input_vars]
RV[result_vars]
CV[const_vars]
IV ~~~ RV ~~~ CV
end
ST ~~~ SW
end
subgraph Run [" "]
direction TB
RT["③ Run Definition"]
subgraph RN ["bn.run()"]
direction TB
Level[subsampling_divisions]
Repeats[repeats]
Opts[save · optimise · over_time]
Level ~~~ Repeats ~~~ Opts
end
RT ~~~ RN
end
Problem == .to_bench() ==> Sweep == bn.run() ==> Run
classDef title fill:none,stroke:none,color:#2a2a2a,font-size:16px
classDef blueLight fill:#e8f4fc,stroke:#c8dff0,color:#3a6a8a
classDef purpleLight fill:#f4ecf8,stroke:#e0d0ea,color:#5a4068
classDef greenLight fill:#ecf6ee,stroke:#cce6d2,color:#3a5e40
class PT,ST,RT title
class Inputs,Results,Fn blueLight
class IV,RV,CV purpleLight
class Level,Repeats,Opts greenLight
style Problem fill:#fafcfe,stroke:#c8dff0,stroke-width:2px
style Sweep fill:#fdfafe,stroke:#e0d0ea,stroke-width:2px
style Run fill:#fafefa,stroke:#cce6d2,stroke-width:2px
style PS fill:#e8f4fc,stroke:#8ec0e4,stroke-width:2px,color:#2c5f7a
style SW fill:#f4ecf8,stroke:#c4a4dc,stroke-width:2px,color:#5a3d6e
style RN fill:#ecf6ee,stroke:#98d0a4,stroke-width:2px,color:#2e5e3a
```
```{raw} html
```
Every auto-generated example follows this pattern:
```python
# 1. Problem Definition — declare the parameter space and benchmark logic
class MyBench(bn.ParametrizedSweep):
x = bn.FloatSweep(default=0, bounds=[0, 10])
score = bn.ResultFloat(units="pts")
def benchmark(self):
self.score = f(self.x)
# 2. Sweep Definition — choose what to sweep and what to measure
def example_my_bench(run_cfg=None):
bench = MyBench().to_bench(run_cfg)
bench.plot_sweep(input_vars=["x"], result_vars=["score"])
return bench
# 3. Run Definition — set sampling density, repeats, and output options
if __name__ == "__main__":
bn.run(example_my_bench, subsampling_divisions=4, repeats=5)
```
1. **Problem Definition** (`ParametrizedSweep`) — Declares the grammar's *Data* and
*Aesthetics*: typed input parameters (`FloatSweep`, `EnumSweep`, …) define the input space,
result variables (`ResultFloat`, `ResultImage`, …) define the output space, and a
`benchmark` method holds the evaluation logic.
2. **Sweep Definition** (`plot_sweep`) — Declares *Scales* and *Statistics*: configures which
input parameters to vary (`input_vars`), which metrics to collect (`result_vars`), which
parameters to pin (`const_vars`), and adds descriptions for the report.
3. **Run Definition** (`bn.run()` / `BenchRunCfg`) — Controls *Scales* and *Statistics*:
`subsampling_divisions` sets sampling density, `repeats` determines statistical power, and flags like
`save`, `optimise`, `over_time`, and `publish` control output and execution behavior.
### Iterative Workflow
The three stages above support a natural iterative workflow — you change one stage at a time
while holding the others fixed:
```{raw} html
```
```{mermaid}
flowchart TD
Define(["① Define — ParametrizedSweep"])
Configure(["② Configure — plot_sweep()"])
Debug(["③ Debug — bn.run( subsampling_divisions=2 )"])
Check{"Works?"}
Refine(["④ Refine — bn.run( subsampling_divisions=5, repeats=10 )"])
Done{"Add params?"}
Define --> Configure --> Debug --> Check
Check -- "No — fix & rerun (cached)" --> Debug
Check -- "Yes" --> Refine --> Done
Done -- "Yes" --> Define
Done -- "No" --> Stop([Done])
style Define fill:#e8f4fc,stroke:#8ec0e4,color:#2c5f7a,stroke-width:2px
style Configure fill:#f4ecf8,stroke:#c4a4dc,color:#5a3d6e,stroke-width:2px
style Debug fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
style Refine fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
style Check fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
style Done fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
style Stop fill:#f0f0f0,stroke:#c0c0c0,color:#505050,stroke-width:2px
```
```{raw} html
```
1. **Define** — Write a `ParametrizedSweep` subclass (Stage 1) with your inputs, outputs,
and benchmark function.
2. **Configure** — Set up `plot_sweep()` calls (Stage 2) to choose which parameters to vary
and which results to collect.
3. **Debug** — Run at a low value with few repeats (Stage 3: `subsampling_divisions=2, repeats=1`) to verify
the pipeline works end-to-end. Because results are cached, fixing and re-running is cheap.
4. **Refine** — Increase `subsampling_divisions` and `repeats` (Stage 3 only) to get publication-quality
statistics. The subsampling_divisions system's binary subdivision means higher subsampling_divisions reuses all previously
cached points — you only pay for the new midpoints.
## Bencher's Primitives
Bencher's design maps onto six primitives, each paralleling a grammar of graphics concept:
| Grammar of Graphics | Bencher Equivalent | Role |
|---|---|---|
| Data | xarray Dataset (N-D tensor) | The Cartesian product of all input parameters |
| Aesthetics | Input variable types | Float maps to axis, categorical maps to color/facet |
| Geometry | Plot result classes | `LineResult`, `BarResult`, `HeatmapResult`, etc. |
| Statistics | Repeats + `ReduceType` | Mean/std/min/max over repeated measurements |
| Facets | Recursive panel slicing | Extra dimensions beyond plot capacity become nested subplots |
| Scales | `SweepBase` + subsampling divisions system | Bounds, sampling density, type-aware ranges |
### Parameters (Input Space)
Bencher provides typed sweep classes that declare the input space:
- `FloatSweep` — continuous float range with bounds
- `IntSweep` — discrete integer range with bounds
- `EnumSweep` — Python enum members
- `BoolSweep` — True/False
- `StringSweep` — categorical string values
Each sweep carries metadata: bounds, default value, units, and sampling density. Parameters are
defined as class attributes on a `ParametrizedSweep` subclass using the `param` library,
making them introspectable and hashable. See the gallery to explore how the number of input
parameters changes the visualization:
[0 float](reference/meta/0_float/no_repeats/index),
[1 float](reference/meta/1_float/no_repeats/index),
[2 float](reference/meta/2_float/no_repeats/index),
[3 float](reference/meta/3_float/no_repeats/index).
### Results (Output Space)
Result types declare what a benchmark function returns:
- `ResultFloat` — a numeric scalar with units and an optimization direction (minimize/maximize)
- `ResultBool` — a boolean result (stored as 0/1 numeric)
- `ResultVec` — a fixed-size numeric vector
- `ResultImage` — a file path to an image
- `ResultVideo` — a file path to a video
- `ResultPath` — an arbitrary file path
- `ResultString` — a string result
- `ResultDataSet` — an xarray Dataset
- `ResultVolume` — volume data
Bencher distinguishes inputs from results by type: anything that is a subclass of a result type
is an output; everything else is an input. This split drives the entire downstream pipeline.
See the [Result Types gallery](reference/meta/result_types/index) for examples of each type.
### Design (Sampling Strategy)
Bencher computes the full Cartesian product of all input parameter values using
`itertools.product`. Given N parameters with sizes `[s1, s2, ..., sN]`, the total number of
evaluations is `s1 * s2 * ... * sN * repeats`. Each combination is represented as both an
index tuple (for storage in the N-D array) and a value tuple (for passing to the benchmark
function).
See the [Cartesian Animation](reference/meta/cartesian_animation/index) gallery for an
animated visualization of how each dimension builds on the last — from a single point to a
line, grid, 3D stack, repeated measurements, and time-series film strip. The
[Sampling Strategies gallery](reference/meta/sampling/index) shows how different sweep types
(uniform, custom values, int vs float) produce different sample distributions.
### Execution
Each parameter combination is hashed to produce a persistent cache key. Results are stored
using `diskcache`, so re-running a benchmark with the same parameters skips already-computed
points. The `repeats` meta-variable controls how many times each combination is evaluated,
enabling statistical analysis of stochastic functions. Compare the
[no repeats](reference/meta/1_float/no_repeats/index) and
[with repeats](reference/meta/1_float/with_repeats/index) galleries to see how repeats
add confidence intervals to plots. The
[Statistics gallery](reference/meta/statistics/index) shows distributions, error bands,
and the effect of different repeat counts. For caching patterns, see the
[Cache Patterns example](reference/meta/advanced/example_advanced_cache_patterns).
### Presentation (Automatic Plot Selection)
Bencher classifies each input parameter as either **continuous** (float, int) or **categorical**
(enum, bool, string). The counts of each type, along with the number of repeats, form a
*data signature* — a tuple `(float_count, cat_count, repeats)`. Each plot type declares
which signatures it can handle via a `PlotFilter` with `VarRange` bounds on each dimension.
The general mapping:
- **0 float + categories + 1 repeat** — Bar chart
([gallery](reference/meta/0_float/no_repeats/index))
- **1 float + categories + 1 repeat** — Line plot
([gallery](reference/meta/1_float/no_repeats/index))
- **1 float + categories + N repeats** — Curve with spread (mean +/- std)
([gallery](reference/meta/1_float/with_repeats/index))
- **2 float** — Heatmap
([gallery](reference/meta/2_float/no_repeats/index))
- **3+ float** — Surface / Volume
([gallery](reference/meta/3_float/no_repeats/index))
- **0 inputs + N repeats** — Histogram / Distribution
([gallery](reference/meta/0_float/with_repeats/index))
When there are more dimensions than a plot type can display, the extra dimensions become
**facets** — nested panels arranged in rows and columns, automatically labeled. Users can
override automatic selection with explicit `.to_*()` calls on the result object.
### Composition
The `ComposableContainerBase` framework handles spatial layout of multi-dimensional results. It
supports four composition methods:
- **right** — append horizontally (row)
- **down** — append vertically (column)
- **sequence** — display sequentially (animation/video)
- **overlay** — alpha-blend on top of each other
Different backends implement these operations: `ComposableContainerPanel` uses Panel's `Row`
and `Column` widgets for interactive dashboards, `ComposableContainerVideo` uses `moviepy` for
video compositing, and `ComposableContainerDataset` uses `xr.concat` for data merging.
See the [Composable Containers gallery](reference/meta/composable_containers/index) for
interactive examples of each backend and composition mode.
## Automatic Plot Selection
The type signature system deserves elaboration because it is central to Bencher's
"declare, don't configure" philosophy.
When you define a `ParametrizedSweep` with, say, one `FloatSweep`, one `EnumSweep`, and one
`ResultFloat`, Bencher counts: 1 continuous input, 1 categorical input, and determines the
repeat count from the run configuration. This signature `(1, 1, ...)` is matched against each
registered plot type's `PlotFilter`.
A `PlotFilter` specifies acceptable ranges for:
- `float_range` — how many continuous inputs
- `cat_range` — how many categorical inputs
- `repeats_range` — how many repeats
- `panel_range` — how many panel-type results (images, videos)
- `result_vars` — how many numeric result variables
- `input_range` — total input count
Each range is a `VarRange(lower, upper)` where `None` for upper means unbounded. A plot type
matches only when *all* ranges are satisfied. When multiple plot types match, Bencher renders
all of them, giving a multi-perspective view of the data.
This mechanism means that adding a dimension to your sweep — say, adding a second float
parameter — automatically changes the visualization from line plots to heatmaps without any
code changes to the plotting logic. See the [Plot Types gallery](reference/meta/plot_types/index)
for every available plot type, and the
[Bool Plot Types gallery](reference/meta/bool_plot_types/index) for boolean-specific variants.
## The Subsampling Divisions System
The `subsampling_divisions` parameter provides a single knob to control sampling density across all dimensions
simultaneously. It indexes into a predefined sample count table:
| Subsampling Divisions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Samples | 1 | 2 | 3 | 5 | 9 | 17 | 33 | 65 | 129 | 257 | 513 | 1025 |
From subsampling_divisions 4 onward, the count follows the formula `2^(subsampling_divisions-2) + 1`: subsampling_divisions 4 gives
`2^2 + 1 = 5`, subsampling_divisions 5 gives `2^3 + 1 = 9`, subsampling_divisions 6 gives `2^4 + 1 = 17`, and so on.
Samples are distributed evenly across each parameter's range using `numpy.linspace`.
The `2n - 1` relationship between consecutive counts is deliberate. Because each subsampling_divisions value has
exactly twice-minus-one the samples of the previous one, the new samples land at the
midpoints between existing ones. For example, on a `[0, 1]` range:
- **Subsampling Divisions 5** (9 samples): 0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0
- **Subsampling Divisions 6** (17 samples): 0, 0.0625, 0.125, 0.1875, 0.25, ...
Every sample from subsampling_divisions 5 appears at an even index in the subsampling_divisions 6 grid. The odd indices are
new points filling the gaps between previous samples. This is binary subdivision — the same
principle used in multigrid methods and progressive image rendering.
This enables a natural workflow: start at a low subsampling_divisions for quick iteration, then increase for
publication-quality results. Because higher subsampling_divisions values are strict supersets of lower ones, cached
results from earlier runs are reused automatically — you only pay for the new midpoints.
See the [Subsampling Divisions System gallery](reference/meta/levels/index) for an interactive demo showing how
increasing the subsampling_divisions progressively refines the sample grid.
## Connections to Related Ideas
Bencher sits at the intersection of several established concepts:
- **Design of Experiments** — Factorial designs are exactly Cartesian products of factor levels.
Bencher's sweep system is a programmatic way to define full factorial experiments, with the
subsampling_divisions system providing fractional-factorial-like progressive refinement.
- **Tidy Data** (Wickham, 2014) — Bencher's xarray output is inherently tidy: each variable
forms a dimension or coordinate, each observation is a point in the N-D grid, and each type
of observational unit forms a Dataset.
- **Hyperparameter Tuning** — Frameworks like Optuna and Ray Tune solve the optimization
problem over parameter spaces. Bencher solves the *visualization and analysis* problem: not
"find the best point" but "understand the landscape".
- **Relational Algebra** — The Cartesian product is a fundamental operation. Bencher applies it
to typed parameter domains and extends it with caching, reduction, and visualization.
Bencher occupies the space between design-of-experiments frameworks (which focus on statistical
efficiency of sampling) and visualization grammars (which focus on rendering). It connects the
two: declare the experiment, generate the data, and visualize the results — all from a single
typed specification.