A Grammar of Benchmarking
The Grammar of Graphics
In 1999, Leland Wilkinson published The Grammar of Graphics, arguing that statistical visualizations are not a fixed taxonomy of chart types (bar, line, scatter, pie) but compositions of a small set of orthogonal components:
Data — the variables to visualize
Aesthetics — mappings from data to visual properties (position, color, size)
Geometry — the visual marks (points, lines, bars, areas)
Statistics — transformations applied before rendering (binning, smoothing, aggregation)
Scales — functions that map data values to aesthetic values (linear, log, discrete)
Coordinates — the coordinate system (Cartesian, polar, geographic)
Facets — splitting data into subplots by a categorical variable
The key insight is decomposition: instead of memorizing when to use each chart type, you compose a visualization from independent building blocks. Hadley Wickham’s ggplot2 and later Vega-Lite made this practical — you declare what you want, and the library assembles the pieces.
The pattern generalizes: take a complex domain, decompose it into a small set of composable primitives, and let combinations emerge from composition rather than enumeration.
From Graphics to Data
The grammar of graphics addresses the visualization step — it assumes the data already exists. But where does the data come from?
In benchmarking and parameter studies, data is generated by evaluating a function across combinations of input parameters. The conventional approach is to write nested for-loops, manually manage results arrays, and hand-pick plot types. This is the benchmarking equivalent of manually drawing charts: tedious, error-prone, and tightly coupled to the specific dimensionality of your experiment.
Bencher extends the grammar of graphics principle upstream to data generation. Just as
ggplot2 replaces “pick a chart type” with “compose visualization components”, Bencher replaces
“write nested for-loops” with “declare parameter spaces and compose”. The core abstraction is
ParametrizedSweep: a class where typed parameter declarations define the input space, and a
__call__ method defines the function to evaluate. Bencher handles the rest — computing the
Cartesian product, caching results, selecting appropriate visualizations, and composing panels.
Architecture Overview
Just as the grammar of graphics decomposes a chart into Data, Aesthetics, Geometry, and so on, Bencher decomposes a benchmark into three stages — each mapping directly onto the grammar primitives introduced below:
flowchart LR
subgraph Problem [" "]
direction TB
PT["① Problem Definition"]
subgraph PS ["ParametrizedSweep"]
direction TB
Inputs[FloatSweep · IntSweep · EnumSweep]
Results[ResultFloat · ResultBool · ResultImage]
Fn["def benchmark(self)"]
Inputs ~~~ Results ~~~ Fn
end
PT ~~~ PS
end
subgraph Sweep [" "]
direction TB
ST["② Sweep Definition"]
subgraph SW ["plot_sweep()"]
direction TB
IV[input_vars]
RV[result_vars]
CV[const_vars]
IV ~~~ RV ~~~ CV
end
ST ~~~ SW
end
subgraph Run [" "]
direction TB
RT["③ Run Definition"]
subgraph RN ["bn.run()"]
direction TB
Level[subsampling_divisions]
Repeats[repeats]
Opts[save · optimise · over_time]
Level ~~~ Repeats ~~~ Opts
end
RT ~~~ RN
end
Problem == .to_bench() ==> Sweep == bn.run() ==> Run
classDef title fill:none,stroke:none,color:#2a2a2a,font-size:16px
classDef blueLight fill:#e8f4fc,stroke:#c8dff0,color:#3a6a8a
classDef purpleLight fill:#f4ecf8,stroke:#e0d0ea,color:#5a4068
classDef greenLight fill:#ecf6ee,stroke:#cce6d2,color:#3a5e40
class PT,ST,RT title
class Inputs,Results,Fn blueLight
class IV,RV,CV purpleLight
class Level,Repeats,Opts greenLight
style Problem fill:#fafcfe,stroke:#c8dff0,stroke-width:2px
style Sweep fill:#fdfafe,stroke:#e0d0ea,stroke-width:2px
style Run fill:#fafefa,stroke:#cce6d2,stroke-width:2px
style PS fill:#e8f4fc,stroke:#8ec0e4,stroke-width:2px,color:#2c5f7a
style SW fill:#f4ecf8,stroke:#c4a4dc,stroke-width:2px,color:#5a3d6e
style RN fill:#ecf6ee,stroke:#98d0a4,stroke-width:2px,color:#2e5e3a
Every auto-generated example follows this pattern:
# 1. Problem Definition — declare the parameter space and benchmark logic
class MyBench(bn.ParametrizedSweep):
x = bn.FloatSweep(default=0, bounds=[0, 10])
score = bn.ResultFloat(units="pts")
def benchmark(self):
self.score = f(self.x)
# 2. Sweep Definition — choose what to sweep and what to measure
def example_my_bench(run_cfg=None):
bench = MyBench().to_bench(run_cfg)
bench.plot_sweep(input_vars=["x"], result_vars=["score"])
return bench
# 3. Run Definition — set sampling density, repeats, and output options
if __name__ == "__main__":
bn.run(example_my_bench, subsampling_divisions=4, repeats=5)
Problem Definition (
ParametrizedSweep) — Declares the grammar’s Data and Aesthetics: typed input parameters (FloatSweep,EnumSweep, …) define the input space, result variables (ResultFloat,ResultImage, …) define the output space, and abenchmarkmethod holds the evaluation logic.Sweep Definition (
plot_sweep) — Declares Scales and Statistics: configures which input parameters to vary (input_vars), which metrics to collect (result_vars), which parameters to pin (const_vars), and adds descriptions for the report.Run Definition (
bn.run()/BenchRunCfg) — Controls Scales and Statistics:subsampling_divisionssets sampling density,repeatsdetermines statistical power, and flags likesave,optimise,over_time, andpublishcontrol output and execution behavior.
Iterative Workflow
The three stages above support a natural iterative workflow — you change one stage at a time while holding the others fixed:
flowchart TD
Define(["① Define — ParametrizedSweep"])
Configure(["② Configure — plot_sweep()"])
Debug(["③ Debug — bn.run( subsampling_divisions=2 )"])
Check{"Works?"}
Refine(["④ Refine — bn.run( subsampling_divisions=5, repeats=10 )"])
Done{"Add params?"}
Define --> Configure --> Debug --> Check
Check -- "No — fix & rerun (cached)" --> Debug
Check -- "Yes" --> Refine --> Done
Done -- "Yes" --> Define
Done -- "No" --> Stop([Done])
style Define fill:#e8f4fc,stroke:#8ec0e4,color:#2c5f7a,stroke-width:2px
style Configure fill:#f4ecf8,stroke:#c4a4dc,color:#5a3d6e,stroke-width:2px
style Debug fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
style Refine fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px
style Check fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
style Done fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px
style Stop fill:#f0f0f0,stroke:#c0c0c0,color:#505050,stroke-width:2px
Define — Write a
ParametrizedSweepsubclass (Stage 1) with your inputs, outputs, and benchmark function.Configure — Set up
plot_sweep()calls (Stage 2) to choose which parameters to vary and which results to collect.Debug — Run at a low value with few repeats (Stage 3:
subsampling_divisions=2, repeats=1) to verify the pipeline works end-to-end. Because results are cached, fixing and re-running is cheap.Refine — Increase
subsampling_divisionsandrepeats(Stage 3 only) to get publication-quality statistics. The subsampling_divisions system’s binary subdivision means higher subsampling_divisions reuses all previously cached points — you only pay for the new midpoints.
Bencher’s Primitives
Bencher’s design maps onto six primitives, each paralleling a grammar of graphics concept:
Grammar of Graphics |
Bencher Equivalent |
Role |
|---|---|---|
Data |
xarray Dataset (N-D tensor) |
The Cartesian product of all input parameters |
Aesthetics |
Input variable types |
Float maps to axis, categorical maps to color/facet |
Geometry |
Plot result classes |
|
Statistics |
Repeats + |
Mean/std/min/max over repeated measurements |
Facets |
Recursive panel slicing |
Extra dimensions beyond plot capacity become nested subplots |
Scales |
|
Bounds, sampling density, type-aware ranges |
Parameters (Input Space)
Bencher provides typed sweep classes that declare the input space:
FloatSweep— continuous float range with boundsIntSweep— discrete integer range with boundsEnumSweep— Python enum membersBoolSweep— True/FalseStringSweep— categorical string values
Each sweep carries metadata: bounds, default value, units, and sampling density. Parameters are
defined as class attributes on a ParametrizedSweep subclass using the param library,
making them introspectable and hashable. See the gallery to explore how the number of input
parameters changes the visualization:
0 float,
1 float,
2 float,
3 float.
Results (Output Space)
Result types declare what a benchmark function returns:
ResultFloat— a numeric scalar with units and an optimization direction (minimize/maximize)ResultBool— a boolean result (stored as 0/1 numeric)ResultVec— a fixed-size numeric vectorResultImage— a file path to an imageResultVideo— a file path to a videoResultPath— an arbitrary file pathResultString— a string resultResultDataSet— an xarray DatasetResultVolume— volume data
Bencher distinguishes inputs from results by type: anything that is a subclass of a result type is an output; everything else is an input. This split drives the entire downstream pipeline. See the Result Types gallery for examples of each type.
Design (Sampling Strategy)
Bencher computes the full Cartesian product of all input parameter values using
itertools.product. Given N parameters with sizes [s1, s2, ..., sN], the total number of
evaluations is s1 * s2 * ... * sN * repeats. Each combination is represented as both an
index tuple (for storage in the N-D array) and a value tuple (for passing to the benchmark
function).
See the Cartesian Animation gallery for an animated visualization of how each dimension builds on the last — from a single point to a line, grid, 3D stack, repeated measurements, and time-series film strip. The Sampling Strategies gallery shows how different sweep types (uniform, custom values, int vs float) produce different sample distributions.
Execution
Each parameter combination is hashed to produce a persistent cache key. Results are stored
using diskcache, so re-running a benchmark with the same parameters skips already-computed
points. The repeats meta-variable controls how many times each combination is evaluated,
enabling statistical analysis of stochastic functions. Compare the
no repeats and
with repeats galleries to see how repeats
add confidence intervals to plots. The
Statistics gallery shows distributions, error bands,
and the effect of different repeat counts. For caching patterns, see the
Cache Patterns example.
Presentation (Automatic Plot Selection)
Bencher classifies each input parameter as either continuous (float, int) or categorical
(enum, bool, string). The counts of each type, along with the number of repeats, form a
data signature — a tuple (float_count, cat_count, repeats). Each plot type declares
which signatures it can handle via a PlotFilter with VarRange bounds on each dimension.
The general mapping:
0 float + categories + 1 repeat — Bar chart (gallery)
1 float + categories + 1 repeat — Line plot (gallery)
1 float + categories + N repeats — Curve with spread (mean +/- std) (gallery)
2 float — Heatmap (gallery)
3+ float — Surface / Volume (gallery)
0 inputs + N repeats — Histogram / Distribution (gallery)
When there are more dimensions than a plot type can display, the extra dimensions become
facets — nested panels arranged in rows and columns, automatically labeled. Users can
override automatic selection with explicit .to_*() calls on the result object.
Composition
The ComposableContainerBase framework handles spatial layout of multi-dimensional results. It
supports four composition methods:
right — append horizontally (row)
down — append vertically (column)
sequence — display sequentially (animation/video)
overlay — alpha-blend on top of each other
Different backends implement these operations: ComposableContainerPanel uses Panel’s Row
and Column widgets for interactive dashboards, ComposableContainerVideo uses moviepy for
video compositing, and ComposableContainerDataset uses xr.concat for data merging.
See the Composable Containers gallery for
interactive examples of each backend and composition mode.
Automatic Plot Selection
The type signature system deserves elaboration because it is central to Bencher’s “declare, don’t configure” philosophy.
When you define a ParametrizedSweep with, say, one FloatSweep, one EnumSweep, and one
ResultFloat, Bencher counts: 1 continuous input, 1 categorical input, and determines the
repeat count from the run configuration. This signature (1, 1, ...) is matched against each
registered plot type’s PlotFilter.
A PlotFilter specifies acceptable ranges for:
float_range— how many continuous inputscat_range— how many categorical inputsrepeats_range— how many repeatspanel_range— how many panel-type results (images, videos)result_vars— how many numeric result variablesinput_range— total input count
Each range is a VarRange(lower, upper) where None for upper means unbounded. A plot type
matches only when all ranges are satisfied. When multiple plot types match, Bencher renders
all of them, giving a multi-perspective view of the data.
This mechanism means that adding a dimension to your sweep — say, adding a second float parameter — automatically changes the visualization from line plots to heatmaps without any code changes to the plotting logic. See the Plot Types gallery for every available plot type, and the Bool Plot Types gallery for boolean-specific variants.
The Subsampling Divisions System
The subsampling_divisions parameter provides a single knob to control sampling density across all dimensions
simultaneously. It indexes into a predefined sample count table:
Subsampling Divisions |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Samples |
1 |
2 |
3 |
5 |
9 |
17 |
33 |
65 |
129 |
257 |
513 |
1025 |
From subsampling_divisions 4 onward, the count follows the formula 2^(subsampling_divisions-2) + 1: subsampling_divisions 4 gives
2^2 + 1 = 5, subsampling_divisions 5 gives 2^3 + 1 = 9, subsampling_divisions 6 gives 2^4 + 1 = 17, and so on.
Samples are distributed evenly across each parameter’s range using numpy.linspace.
The 2n - 1 relationship between consecutive counts is deliberate. Because each subsampling_divisions value has
exactly twice-minus-one the samples of the previous one, the new samples land at the
midpoints between existing ones. For example, on a [0, 1] range:
Subsampling Divisions 5 (9 samples): 0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0
Subsampling Divisions 6 (17 samples): 0, 0.0625, 0.125, 0.1875, 0.25, …
Every sample from subsampling_divisions 5 appears at an even index in the subsampling_divisions 6 grid. The odd indices are new points filling the gaps between previous samples. This is binary subdivision — the same principle used in multigrid methods and progressive image rendering.
This enables a natural workflow: start at a low subsampling_divisions for quick iteration, then increase for publication-quality results. Because higher subsampling_divisions values are strict supersets of lower ones, cached results from earlier runs are reused automatically — you only pay for the new midpoints.
See the Subsampling Divisions System gallery for an interactive demo showing how increasing the subsampling_divisions progressively refines the sample grid.