A Grammar of Benchmarking

The Grammar of Graphics

In 1999, Leland Wilkinson published The Grammar of Graphics, arguing that statistical visualizations are not a fixed taxonomy of chart types (bar, line, scatter, pie) but compositions of a small set of orthogonal components:

Data — the variables to visualize
Aesthetics — mappings from data to visual properties (position, color, size)
Geometry — the visual marks (points, lines, bars, areas)
Statistics — transformations applied before rendering (binning, smoothing, aggregation)
Scales — functions that map data values to aesthetic values (linear, log, discrete)
Coordinates — the coordinate system (Cartesian, polar, geographic)
Facets — splitting data into subplots by a categorical variable

The key insight is decomposition: instead of memorizing when to use each chart type, you compose a visualization from independent building blocks. Hadley Wickham’s ggplot2 and later Vega-Lite made this practical — you declare what you want, and the library assembles the pieces.

The pattern generalizes: take a complex domain, decompose it into a small set of composable primitives, and let combinations emerge from composition rather than enumeration.

From Graphics to Data

The grammar of graphics addresses the visualization step — it assumes the data already exists. But where does the data come from?

In benchmarking and parameter studies, data is generated by evaluating a function across combinations of input parameters. The conventional approach is to write nested for-loops, manually manage results arrays, and hand-pick plot types. This is the benchmarking equivalent of manually drawing charts: tedious, error-prone, and tightly coupled to the specific dimensionality of your experiment.

Bencher extends the grammar of graphics principle upstream to data generation. Just as ggplot2 replaces “pick a chart type” with “compose visualization components”, Bencher replaces “write nested for-loops” with “declare parameter spaces and compose”. The core abstraction is ParametrizedSweep: a class where typed parameter declarations define the input space, and a __call__ method defines the function to evaluate. Bencher handles the rest — computing the Cartesian product, caching results, selecting appropriate visualizations, and composing panels.

Architecture Overview

Just as the grammar of graphics decomposes a chart into Data, Aesthetics, Geometry, and so on, Bencher decomposes a benchmark into three stages — each mapping directly onto the grammar primitives introduced below:

        flowchart LR
    subgraph Problem [" "]
        direction TB
        PT["① Problem Definition"]
        subgraph PS ["ParametrizedSweep"]
            direction TB
            Inputs[FloatSweep · IntSweep · EnumSweep]
            Results[ResultFloat · ResultBool · ResultImage]
            Fn["def benchmark(self)"]
            Inputs ~~~ Results ~~~ Fn
        end
        PT ~~~ PS
    end

    subgraph Sweep [" "]
        direction TB
        ST["② Sweep Definition"]
        subgraph SW ["plot_sweep()"]
            direction TB
            IV[input_vars]
            RV[result_vars]
            CV[const_vars]
            IV ~~~ RV ~~~ CV
        end
        ST ~~~ SW
    end

    subgraph Run [" "]
        direction TB
        RT["③ Run Definition"]
        subgraph RN ["bn.run()"]
            direction TB
            Level[subsampling_divisions]
            Repeats[repeats]
            Opts[save · optimise · over_time]
            Level ~~~ Repeats ~~~ Opts
        end
        RT ~~~ RN
    end

    Problem == .to_bench() ==> Sweep == bn.run() ==> Run

    classDef title fill:none,stroke:none,color:#2a2a2a,font-size:16px
    classDef blueLight fill:#e8f4fc,stroke:#c8dff0,color:#3a6a8a
    classDef purpleLight fill:#f4ecf8,stroke:#e0d0ea,color:#5a4068
    classDef greenLight fill:#ecf6ee,stroke:#cce6d2,color:#3a5e40

    class PT,ST,RT title
    class Inputs,Results,Fn blueLight
    class IV,RV,CV purpleLight
    class Level,Repeats,Opts greenLight

    style Problem fill:#fafcfe,stroke:#c8dff0,stroke-width:2px
    style Sweep fill:#fdfafe,stroke:#e0d0ea,stroke-width:2px
    style Run fill:#fafefa,stroke:#cce6d2,stroke-width:2px
    style PS fill:#e8f4fc,stroke:#8ec0e4,stroke-width:2px,color:#2c5f7a
    style SW fill:#f4ecf8,stroke:#c4a4dc,stroke-width:2px,color:#5a3d6e
    style RN fill:#ecf6ee,stroke:#98d0a4,stroke-width:2px,color:#2e5e3a

Every auto-generated example follows this pattern:

# 1. Problem Definition — declare the parameter space and benchmark logic
class MyBench(bn.ParametrizedSweep):
    x = bn.FloatSweep(default=0, bounds=[0, 10])
    score = bn.ResultFloat(units="pts")
    def benchmark(self):
        self.score = f(self.x)

# 2. Sweep Definition — choose what to sweep and what to measure
def example_my_bench(run_cfg=None):
    bench = MyBench().to_bench(run_cfg)
    bench.plot_sweep(input_vars=["x"], result_vars=["score"])
    return bench

# 3. Run Definition — set sampling density, repeats, and output options
if __name__ == "__main__":
    bn.run(example_my_bench, subsampling_divisions=4, repeats=5)

Problem Definition (ParametrizedSweep) — Declares the grammar’s Data and Aesthetics: typed input parameters (FloatSweep, EnumSweep, …) define the input space, result variables (ResultFloat, ResultImage, …) define the output space, and a benchmark method holds the evaluation logic.
Sweep Definition (plot_sweep) — Declares Scales and Statistics: configures which input parameters to vary (input_vars), which metrics to collect (result_vars), which parameters to pin (const_vars), and adds descriptions for the report.
Run Definition (bn.run() / BenchRunCfg) — Controls Scales and Statistics: subsampling_divisions sets sampling density, repeats determines statistical power, and flags like save, optimise, over_time, and publish control output and execution behavior.

Iterative Workflow

The three stages above support a natural iterative workflow — you change one stage at a time while holding the others fixed:

        flowchart TD
    Define(["① Define — ParametrizedSweep"])
    Configure(["② Configure — plot_sweep()"])
    Debug(["③ Debug — bn.run( subsampling_divisions=2 )"])
    Check{"Works?"}
    Refine(["④ Refine — bn.run( subsampling_divisions=5, repeats=10 )"])
    Done{"Add params?"}

    Define --> Configure --> Debug --> Check
    Check -- "No — fix & rerun (cached)" --> Debug
    Check -- "Yes" --> Refine --> Done
    Done -- "Yes" --> Define
    Done -- "No" --> Stop([Done])

    style Define fill:#e8f4fc,stroke:#8ec0e4,color:#2c5f7a,stroke-width:2px
    style Configure fill:#f4ecf8,stroke:#c4a4dc,color:#5a3d6e,stroke-width:2px
    style Debug fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
    style Refine fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
    style Check fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
    style Done fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
    style Stop fill:#f0f0f0,stroke:#c0c0c0,color:#505050,stroke-width:2px

Define — Write a ParametrizedSweep subclass (Stage 1) with your inputs, outputs, and benchmark function.
Configure — Set up plot_sweep() calls (Stage 2) to choose which parameters to vary and which results to collect.
Debug — Run at a low value with few repeats (Stage 3: subsampling_divisions=2, repeats=1) to verify the pipeline works end-to-end. Because results are cached, fixing and re-running is cheap.
Refine — Increase subsampling_divisions and repeats (Stage 3 only) to get publication-quality statistics. The subsampling_divisions system’s binary subdivision means higher subsampling_divisions reuses all previously cached points — you only pay for the new midpoints.

Bencher’s Primitives

Bencher’s design maps onto six primitives, each paralleling a grammar of graphics concept:

Grammar of Graphics	Bencher Equivalent	Role
Data	xarray Dataset (N-D tensor)	The Cartesian product of all input parameters
Aesthetics	Input variable types	Float maps to axis, categorical maps to color/facet
Geometry	Plot result classes	`LineResult`, `BarResult`, `HeatmapResult`, etc.
Statistics	Repeats + `ReduceType`	Mean/std/min/max over repeated measurements
Facets	Recursive panel slicing	Extra dimensions beyond plot capacity become nested subplots
Scales	`SweepBase` + subsampling divisions system	Bounds, sampling density, type-aware ranges

Parameters (Input Space)

Bencher provides typed sweep classes that declare the input space:

FloatSweep — continuous float range with bounds
IntSweep — discrete integer range with bounds
EnumSweep — Python enum members
BoolSweep — True/False
StringSweep — categorical string values

Each sweep carries metadata: bounds, default value, units, and sampling density. Parameters are defined as class attributes on a ParametrizedSweep subclass using the param library, making them introspectable and hashable. See the gallery to explore how the number of input parameters changes the visualization: 0 float, 1 float, 2 float, 3 float.

Results (Output Space)

Result types declare what a benchmark function returns:

ResultFloat — a numeric scalar with units and an optimization direction (minimize/maximize)
ResultBool — a boolean result (stored as 0/1 numeric)
ResultVec — a fixed-size numeric vector
ResultImage — a file path to an image
ResultVideo — a file path to a video
ResultPath — an arbitrary file path
ResultString — a string result
ResultDataSet — an xarray Dataset
ResultVolume — volume data

Bencher distinguishes inputs from results by type: anything that is a subclass of a result type is an output; everything else is an input. This split drives the entire downstream pipeline. See the Result Types gallery for examples of each type.

Design (Sampling Strategy)

Bencher computes the full Cartesian product of all input parameter values using itertools.product. Given N parameters with sizes [s1, s2, ..., sN], the total number of evaluations is s1 * s2 * ... * sN * repeats. Each combination is represented as both an index tuple (for storage in the N-D array) and a value tuple (for passing to the benchmark function).

See the Cartesian Animation gallery for an animated visualization of how each dimension builds on the last — from a single point to a line, grid, 3D stack, repeated measurements, and time-series film strip. The Sampling Strategies gallery shows how different sweep types (uniform, custom values, int vs float) produce different sample distributions.

Execution

Each parameter combination is hashed to produce a persistent cache key. Results are stored using diskcache, so re-running a benchmark with the same parameters skips already-computed points. The repeats meta-variable controls how many times each combination is evaluated, enabling statistical analysis of stochastic functions. Compare the no repeats and with repeats galleries to see how repeats add confidence intervals to plots. The Statistics gallery shows distributions, error bands, and the effect of different repeat counts. For caching patterns, see the Cache Patterns example.

Presentation (Automatic Plot Selection)

Bencher classifies each input parameter as either continuous (float, int) or categorical (enum, bool, string). The counts of each type, along with the number of repeats, form a data signature — a tuple (float_count, cat_count, repeats). Each plot type declares which signatures it can handle via a PlotFilter with VarRange bounds on each dimension.

The general mapping:

0 float + categories + 1 repeat — Bar chart (gallery)
1 float + categories + 1 repeat — Line plot (gallery)
1 float + categories + N repeats — Curve with spread (mean +/- std) (gallery)
2 float — Heatmap (gallery)
3+ float — Surface / Volume (gallery)
0 inputs + N repeats — Histogram / Distribution (gallery)

When there are more dimensions than a plot type can display, the extra dimensions become facets — nested panels arranged in rows and columns, automatically labeled. Users can override automatic selection with explicit .to_*() calls on the result object.

Composition

The ComposableContainerBase framework handles spatial layout of multi-dimensional results. It supports four composition methods:

right — append horizontally (row)
down — append vertically (column)
sequence — display sequentially (animation/video)
overlay — alpha-blend on top of each other

Different backends implement these operations: ComposableContainerPanel uses Panel’s Row and Column widgets for interactive dashboards, ComposableContainerVideo uses moviepy for video compositing, and ComposableContainerDataset uses xr.concat for data merging. See the Composable Containers gallery for interactive examples of each backend and composition mode.

Automatic Plot Selection

The type signature system deserves elaboration because it is central to Bencher’s “declare, don’t configure” philosophy.

When you define a ParametrizedSweep with, say, one FloatSweep, one EnumSweep, and one ResultFloat, Bencher counts: 1 continuous input, 1 categorical input, and determines the repeat count from the run configuration. This signature (1, 1, ...) is matched against each registered plot type’s PlotFilter.

A PlotFilter specifies acceptable ranges for:

float_range — how many continuous inputs
cat_range — how many categorical inputs
repeats_range — how many repeats
panel_range — how many panel-type results (images, videos)
result_vars — how many numeric result variables
input_range — total input count

Each range is a VarRange(lower, upper) where None for upper means unbounded. A plot type matches only when all ranges are satisfied. When multiple plot types match, Bencher renders all of them, giving a multi-perspective view of the data.

This mechanism means that adding a dimension to your sweep — say, adding a second float parameter — automatically changes the visualization from line plots to heatmaps without any code changes to the plotting logic. See the Plot Types gallery for every available plot type, and the Bool Plot Types gallery for boolean-specific variants.

The Subsampling Divisions System

The subsampling_divisions parameter provides a single knob to control sampling density across all dimensions simultaneously. It indexes into a predefined sample count table:

Subsampling Divisions	1	2	3	4	5	6	7	8	9	10	11	12
Samples	1	2	3	5	9	17	33	65	129	257	513	1025

From subsampling_divisions 4 onward, the count follows the formula 2^(subsampling_divisions-2) + 1: subsampling_divisions 4 gives 2^2 + 1 = 5, subsampling_divisions 5 gives 2^3 + 1 = 9, subsampling_divisions 6 gives 2^4 + 1 = 17, and so on. Samples are distributed evenly across each parameter’s range using numpy.linspace.

The 2n - 1 relationship between consecutive counts is deliberate. Because each subsampling_divisions value has exactly twice-minus-one the samples of the previous one, the new samples land at the midpoints between existing ones. For example, on a [0, 1] range:

Subsampling Divisions 5 (9 samples): 0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0
Subsampling Divisions 6 (17 samples): 0, 0.0625, 0.125, 0.1875, 0.25, …

Every sample from subsampling_divisions 5 appears at an even index in the subsampling_divisions 6 grid. The odd indices are new points filling the gaps between previous samples. This is binary subdivision — the same principle used in multigrid methods and progressive image rendering.

This enables a natural workflow: start at a low subsampling_divisions for quick iteration, then increase for publication-quality results. Because higher subsampling_divisions values are strict supersets of lower ones, cached results from earlier runs are reused automatically — you only pay for the new midpoints.

See the Subsampling Divisions System gallery for an interactive demo showing how increasing the subsampling_divisions progressively refines the sample grid.