# A Grammar of Benchmarking

## The Grammar of Graphics

In 1999, Leland Wilkinson published *The Grammar of Graphics*, arguing that statistical
visualizations are not a fixed taxonomy of chart types (bar, line, scatter, pie) but
compositions of a small set of orthogonal components:

- **Data** — the variables to visualize
- **Aesthetics** — mappings from data to visual properties (position, color, size)
- **Geometry** — the visual marks (points, lines, bars, areas)
- **Statistics** — transformations applied before rendering (binning, smoothing, aggregation)
- **Scales** — functions that map data values to aesthetic values (linear, log, discrete)
- **Coordinates** — the coordinate system (Cartesian, polar, geographic)
- **Facets** — splitting data into subplots by a categorical variable

The key insight is *decomposition*: instead of memorizing when to use each chart type, you
compose a visualization from independent building blocks. Hadley Wickham's **ggplot2** and
later **Vega-Lite** made this practical — you declare what you want, and the library assembles
the pieces.

The pattern generalizes: take a complex domain, decompose it into a small set of composable
primitives, and let combinations emerge from composition rather than enumeration.

## From Graphics to Data

The grammar of graphics addresses the *visualization* step — it assumes the data already
exists. But where does the data come from?

In benchmarking and parameter studies, data is generated by evaluating a function across
combinations of input parameters. The conventional approach is to write nested for-loops,
manually manage results arrays, and hand-pick plot types. This is the benchmarking equivalent
of manually drawing charts: tedious, error-prone, and tightly coupled to the specific
dimensionality of your experiment.

Bencher extends the grammar of graphics principle *upstream* to data generation. Just as
ggplot2 replaces "pick a chart type" with "compose visualization components", Bencher replaces
"write nested for-loops" with "declare parameter spaces and compose". The core abstraction is
`ParametrizedSweep`: a class where typed parameter declarations define the input space, and a
`__call__` method defines the function to evaluate. Bencher handles the rest — computing the
Cartesian product, caching results, selecting appropriate visualizations, and composing panels.

## Architecture Overview

Just as the grammar of graphics decomposes a chart into Data, Aesthetics, Geometry, and so on,
Bencher decomposes a benchmark into three stages — each mapping directly onto the grammar
primitives introduced below:

```{raw} html
<div style="max-width: 800px;">
```

```{mermaid}
flowchart LR
    subgraph Problem [" "]
        direction TB
        PT["① Problem Definition"]
        subgraph PS ["ParametrizedSweep"]
            direction TB
            Inputs[FloatSweep · IntSweep · EnumSweep]
            Results[ResultFloat · ResultBool · ResultImage]
            Fn["def benchmark(self)"]
            Inputs ~~~ Results ~~~ Fn
        end
        PT ~~~ PS
    end

    subgraph Sweep [" "]
        direction TB
        ST["② Sweep Definition"]
        subgraph SW ["plot_sweep()"]
            direction TB
            IV[input_vars]
            RV[result_vars]
            CV[const_vars]
            IV ~~~ RV ~~~ CV
        end
        ST ~~~ SW
    end

    subgraph Run [" "]
        direction TB
        RT["③ Run Definition"]
        subgraph RN ["bn.run()"]
            direction TB
            Level[subsampling_divisions]
            Repeats[repeats]
            Opts[save · optimise · over_time]
            Level ~~~ Repeats ~~~ Opts
        end
        RT ~~~ RN
    end

    Problem == .to_bench() ==> Sweep == bn.run() ==> Run

    classDef title fill:none,stroke:none,color:#2a2a2a,font-size:16px
    classDef blueLight fill:#e8f4fc,stroke:#c8dff0,color:#3a6a8a
    classDef purpleLight fill:#f4ecf8,stroke:#e0d0ea,color:#5a4068
    classDef greenLight fill:#ecf6ee,stroke:#cce6d2,color:#3a5e40

    class PT,ST,RT title
    class Inputs,Results,Fn blueLight
    class IV,RV,CV purpleLight
    class Level,Repeats,Opts greenLight

    style Problem fill:#fafcfe,stroke:#c8dff0,stroke-width:2px
    style Sweep fill:#fdfafe,stroke:#e0d0ea,stroke-width:2px
    style Run fill:#fafefa,stroke:#cce6d2,stroke-width:2px
    style PS fill:#e8f4fc,stroke:#8ec0e4,stroke-width:2px,color:#2c5f7a
    style SW fill:#f4ecf8,stroke:#c4a4dc,stroke-width:2px,color:#5a3d6e
    style RN fill:#ecf6ee,stroke:#98d0a4,stroke-width:2px,color:#2e5e3a
```

```{raw} html
</div>
```

Every auto-generated example follows this pattern:

```python
# 1. Problem Definition — declare the parameter space and benchmark logic
class MyBench(bn.ParametrizedSweep):
    x = bn.FloatSweep(default=0, bounds=[0, 10])
    score = bn.ResultFloat(units="pts")
    def benchmark(self):
        self.score = f(self.x)

# 2. Sweep Definition — choose what to sweep and what to measure
def example_my_bench(run_cfg=None):
    bench = MyBench().to_bench(run_cfg)
    bench.plot_sweep(input_vars=["x"], result_vars=["score"])
    return bench

# 3. Run Definition — set sampling density, repeats, and output options
if __name__ == "__main__":
    bn.run(example_my_bench, subsampling_divisions=4, repeats=5)
```

1. **Problem Definition** (`ParametrizedSweep`) — Declares the grammar's *Data* and
   *Aesthetics*: typed input parameters (`FloatSweep`, `EnumSweep`, …) define the input space,
   result variables (`ResultFloat`, `ResultImage`, …) define the output space, and a
   `benchmark` method holds the evaluation logic.
2. **Sweep Definition** (`plot_sweep`) — Declares *Scales* and *Statistics*: configures which
   input parameters to vary (`input_vars`), which metrics to collect (`result_vars`), which
   parameters to pin (`const_vars`), and adds descriptions for the report.
3. **Run Definition** (`bn.run()` / `BenchRunCfg`) — Controls *Scales* and *Statistics*:
   `subsampling_divisions` sets sampling density, `repeats` determines statistical power, and flags like
   `save`, `optimise`, `over_time`, and `publish` control output and execution behavior.

### Iterative Workflow

The three stages above support a natural iterative workflow — you change one stage at a time
while holding the others fixed:

```{raw} html
<div style="max-width: 600px;">
```

```{mermaid}
flowchart TD
    Define(["① Define — ParametrizedSweep"])
    Configure(["② Configure — plot_sweep()"])
    Debug(["③ Debug — bn.run( subsampling_divisions=2 )"])
    Check{"Works?"}
    Refine(["④ Refine — bn.run( subsampling_divisions=5, repeats=10 )"])
    Done{"Add params?"}

    Define --> Configure --> Debug --> Check
    Check -- "No — fix & rerun (cached)" --> Debug
    Check -- "Yes" --> Refine --> Done
    Done -- "Yes" --> Define
    Done -- "No" --> Stop([Done])

    style Define fill:#e8f4fc,stroke:#8ec0e4,color:#2c5f7a,stroke-width:2px
    style Configure fill:#f4ecf8,stroke:#c4a4dc,color:#5a3d6e,stroke-width:2px
    style Debug fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
    style Refine fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
    style Check fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
    style Done fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
    style Stop fill:#f0f0f0,stroke:#c0c0c0,color:#505050,stroke-width:2px
```

```{raw} html
</div>
```

1. **Define** — Write a `ParametrizedSweep` subclass (Stage 1) with your inputs, outputs,
   and benchmark function.
2. **Configure** — Set up `plot_sweep()` calls (Stage 2) to choose which parameters to vary
   and which results to collect.
3. **Debug** — Run at a low value with few repeats (Stage 3: `subsampling_divisions=2, repeats=1`) to verify
   the pipeline works end-to-end. Because results are cached, fixing and re-running is cheap.
4. **Refine** — Increase `subsampling_divisions` and `repeats` (Stage 3 only) to get publication-quality
   statistics. The subsampling_divisions system's binary subdivision means higher subsampling_divisions reuses all previously
   cached points — you only pay for the new midpoints.

## Bencher's Primitives

Bencher's design maps onto six primitives, each paralleling a grammar of graphics concept:

| Grammar of Graphics | Bencher Equivalent | Role |
|---|---|---|
| Data | xarray Dataset (N-D tensor) | The Cartesian product of all input parameters |
| Aesthetics | Input variable types | Float maps to axis, categorical maps to color/facet |
| Geometry | Plot result classes | `LineResult`, `BarResult`, `HeatmapResult`, etc. |
| Statistics | Repeats + `ReduceType` | Mean/std/min/max over repeated measurements |
| Facets | Recursive panel slicing | Extra dimensions beyond plot capacity become nested subplots |
| Scales | `SweepBase` + subsampling divisions system | Bounds, sampling density, type-aware ranges |

### Parameters (Input Space)

Bencher provides typed sweep classes that declare the input space:

- `FloatSweep` — continuous float range with bounds
- `IntSweep` — discrete integer range with bounds
- `EnumSweep` — Python enum members
- `BoolSweep` — True/False
- `StringSweep` — categorical string values

Each sweep carries metadata: bounds, default value, units, and sampling density. Parameters are
defined as class attributes on a `ParametrizedSweep` subclass using the `param` library,
making them introspectable and hashable. See the gallery to explore how the number of input
parameters changes the visualization:
[0 float](reference/meta/0_float/no_repeats/index),
[1 float](reference/meta/1_float/no_repeats/index),
[2 float](reference/meta/2_float/no_repeats/index),
[3 float](reference/meta/3_float/no_repeats/index).

### Results (Output Space)

Result types declare what a benchmark function returns:

- `ResultFloat` — a numeric scalar with units and an optimization direction (minimize/maximize)
- `ResultBool` — a boolean result (stored as 0/1 numeric)
- `ResultVec` — a fixed-size numeric vector
- `ResultImage` — a file path to an image
- `ResultVideo` — a file path to a video
- `ResultPath` — an arbitrary file path
- `ResultString` — a string result
- `ResultDataSet` — an xarray Dataset
- `ResultVolume` — volume data

Bencher distinguishes inputs from results by type: anything that is a subclass of a result type
is an output; everything else is an input. This split drives the entire downstream pipeline.
See the [Result Types gallery](reference/meta/result_types/index) for examples of each type.

### Design (Sampling Strategy)

Bencher computes the full Cartesian product of all input parameter values using
`itertools.product`. Given N parameters with sizes `[s1, s2, ..., sN]`, the total number of
evaluations is `s1 * s2 * ... * sN * repeats`. Each combination is represented as both an
index tuple (for storage in the N-D array) and a value tuple (for passing to the benchmark
function).

See the [Cartesian Animation](reference/meta/cartesian_animation/index) gallery for an
animated visualization of how each dimension builds on the last — from a single point to a
line, grid, 3D stack, repeated measurements, and time-series film strip. The
[Sampling Strategies gallery](reference/meta/sampling/index) shows how different sweep types
(uniform, custom values, int vs float) produce different sample distributions.

### Execution

Each parameter combination is hashed to produce a persistent cache key. Results are stored
using `diskcache`, so re-running a benchmark with the same parameters skips already-computed
points. The `repeats` meta-variable controls how many times each combination is evaluated,
enabling statistical analysis of stochastic functions. Compare the
[no repeats](reference/meta/1_float/no_repeats/index) and
[with repeats](reference/meta/1_float/with_repeats/index) galleries to see how repeats
add confidence intervals to plots. The
[Statistics gallery](reference/meta/statistics/index) shows distributions, error bands,
and the effect of different repeat counts. For caching patterns, see the
[Cache Patterns example](reference/meta/advanced/example_advanced_cache_patterns).

### Presentation (Automatic Plot Selection)

Bencher classifies each input parameter as either **continuous** (float, int) or **categorical**
(enum, bool, string). The counts of each type, along with the number of repeats, form a
*data signature* — a tuple `(float_count, cat_count, repeats)`. Each plot type declares
which signatures it can handle via a `PlotFilter` with `VarRange` bounds on each dimension.

The general mapping:

- **0 float + categories + 1 repeat** — Bar chart
  ([gallery](reference/meta/0_float/no_repeats/index))
- **1 float + categories + 1 repeat** — Line plot
  ([gallery](reference/meta/1_float/no_repeats/index))
- **1 float + categories + N repeats** — Curve with spread (mean +/- std)
  ([gallery](reference/meta/1_float/with_repeats/index))
- **2 float** — Heatmap
  ([gallery](reference/meta/2_float/no_repeats/index))
- **3+ float** — Surface / Volume
  ([gallery](reference/meta/3_float/no_repeats/index))
- **0 inputs + N repeats** — Histogram / Distribution
  ([gallery](reference/meta/0_float/with_repeats/index))

When there are more dimensions than a plot type can display, the extra dimensions become
**facets** — nested panels arranged in rows and columns, automatically labeled. Users can
override automatic selection with explicit `.to_*()` calls on the result object.

### Composition

The `ComposableContainerBase` framework handles spatial layout of multi-dimensional results. It
supports four composition methods:

- **right** — append horizontally (row)
- **down** — append vertically (column)
- **sequence** — display sequentially (animation/video)
- **overlay** — alpha-blend on top of each other

Different backends implement these operations: `ComposableContainerPanel` uses Panel's `Row`
and `Column` widgets for interactive dashboards, `ComposableContainerVideo` uses `moviepy` for
video compositing, and `ComposableContainerDataset` uses `xr.concat` for data merging.
See the [Composable Containers gallery](reference/meta/composable_containers/index) for
interactive examples of each backend and composition mode.

## Automatic Plot Selection

The type signature system deserves elaboration because it is central to Bencher's
"declare, don't configure" philosophy.

When you define a `ParametrizedSweep` with, say, one `FloatSweep`, one `EnumSweep`, and one
`ResultFloat`, Bencher counts: 1 continuous input, 1 categorical input, and determines the
repeat count from the run configuration. This signature `(1, 1, ...)` is matched against each
registered plot type's `PlotFilter`.

A `PlotFilter` specifies acceptable ranges for:

- `float_range` — how many continuous inputs
- `cat_range` — how many categorical inputs
- `repeats_range` — how many repeats
- `panel_range` — how many panel-type results (images, videos)
- `result_vars` — how many numeric result variables
- `input_range` — total input count

Each range is a `VarRange(lower, upper)` where `None` for upper means unbounded. A plot type
matches only when *all* ranges are satisfied. When multiple plot types match, Bencher renders
all of them, giving a multi-perspective view of the data.

This mechanism means that adding a dimension to your sweep — say, adding a second float
parameter — automatically changes the visualization from line plots to heatmaps without any
code changes to the plotting logic. See the [Plot Types gallery](reference/meta/plot_types/index)
for every available plot type, and the
[Bool Plot Types gallery](reference/meta/bool_plot_types/index) for boolean-specific variants.

## The Subsampling Divisions System

The `subsampling_divisions` parameter provides a single knob to control sampling density across all dimensions
simultaneously. It indexes into a predefined sample count table:

| Subsampling Divisions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Samples | 1 | 2 | 3 | 5 | 9 | 17 | 33 | 65 | 129 | 257 | 513 | 1025 |

From subsampling_divisions 4 onward, the count follows the formula `2^(subsampling_divisions-2) + 1`: subsampling_divisions 4 gives
`2^2 + 1 = 5`, subsampling_divisions 5 gives `2^3 + 1 = 9`, subsampling_divisions 6 gives `2^4 + 1 = 17`, and so on.
Samples are distributed evenly across each parameter's range using `numpy.linspace`.

The `2n - 1` relationship between consecutive counts is deliberate. Because each subsampling_divisions value has
exactly twice-minus-one the samples of the previous one, the new samples land at the
midpoints between existing ones. For example, on a `[0, 1]` range:

- **Subsampling Divisions 5** (9 samples): 0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0
- **Subsampling Divisions 6** (17 samples): 0, 0.0625, 0.125, 0.1875, 0.25, ...

Every sample from subsampling_divisions 5 appears at an even index in the subsampling_divisions 6 grid. The odd indices are
new points filling the gaps between previous samples. This is binary subdivision — the same
principle used in multigrid methods and progressive image rendering.

This enables a natural workflow: start at a low subsampling_divisions for quick iteration, then increase for
publication-quality results. Because higher subsampling_divisions values are strict supersets of lower ones, cached
results from earlier runs are reused automatically — you only pay for the new midpoints.

See the [Subsampling Divisions System gallery](reference/meta/levels/index) for an interactive demo showing how
increasing the subsampling_divisions progressively refines the sample grid.

## Connections to Related Ideas

Bencher sits at the intersection of several established concepts:

- **Design of Experiments** — Factorial designs are exactly Cartesian products of factor levels.
  Bencher's sweep system is a programmatic way to define full factorial experiments, with the
  subsampling_divisions system providing fractional-factorial-like progressive refinement.

- **Tidy Data** (Wickham, 2014) — Bencher's xarray output is inherently tidy: each variable
  forms a dimension or coordinate, each observation is a point in the N-D grid, and each type
  of observational unit forms a Dataset.

- **Hyperparameter Tuning** — Frameworks like Optuna and Ray Tune solve the optimization
  problem over parameter spaces. Bencher solves the *visualization and analysis* problem: not
  "find the best point" but "understand the landscape".

- **Relational Algebra** — The Cartesian product is a fundamental operation. Bencher applies it
  to typed parameter domains and extends it with caching, reduction, and visualization.

Bencher occupies the space between design-of-experiments frameworks (which focus on statistical
efficiency of sampling) and visualization grammars (which focus on rendering). It connects the
two: declare the experiment, generate the data, and visualize the results — all from a single
typed specification.