# A Grammar of Benchmarking ## The Grammar of Graphics In 1999, Leland Wilkinson published *The Grammar of Graphics*, arguing that statistical visualizations are not a fixed taxonomy of chart types (bar, line, scatter, pie) but compositions of a small set of orthogonal components: - **Data** — the variables to visualize - **Aesthetics** — mappings from data to visual properties (position, color, size) - **Geometry** — the visual marks (points, lines, bars, areas) - **Statistics** — transformations applied before rendering (binning, smoothing, aggregation) - **Scales** — functions that map data values to aesthetic values (linear, log, discrete) - **Coordinates** — the coordinate system (Cartesian, polar, geographic) - **Facets** — splitting data into subplots by a categorical variable The key insight is *decomposition*: instead of memorizing when to use each chart type, you compose a visualization from independent building blocks. Hadley Wickham's **ggplot2** and later **Vega-Lite** made this practical — you declare what you want, and the library assembles the pieces. The pattern generalizes: take a complex domain, decompose it into a small set of composable primitives, and let combinations emerge from composition rather than enumeration. ## From Graphics to Data The grammar of graphics addresses the *visualization* step — it assumes the data already exists. But where does the data come from? In benchmarking and parameter studies, data is generated by evaluating a function across combinations of input parameters. The conventional approach is to write nested for-loops, manually manage results arrays, and hand-pick plot types. This is the benchmarking equivalent of manually drawing charts: tedious, error-prone, and tightly coupled to the specific dimensionality of your experiment. Bencher extends the grammar of graphics principle *upstream* to data generation. Just as ggplot2 replaces "pick a chart type" with "compose visualization components", Bencher replaces "write nested for-loops" with "declare parameter spaces and compose". The core abstraction is `ParametrizedSweep`: a class where typed parameter declarations define the input space, and a `__call__` method defines the function to evaluate. Bencher handles the rest — computing the Cartesian product, caching results, selecting appropriate visualizations, and composing panels. ## Architecture Overview Just as the grammar of graphics decomposes a chart into Data, Aesthetics, Geometry, and so on, Bencher decomposes a benchmark into three stages — each mapping directly onto the grammar primitives introduced below: ```{raw} html
``` ```{mermaid} flowchart LR subgraph Problem [" "] direction TB PT["① Problem Definition"] subgraph PS ["ParametrizedSweep"] direction TB Inputs[FloatSweep · IntSweep · EnumSweep] Results[ResultFloat · ResultBool · ResultImage] Fn["def benchmark(self)"] Inputs ~~~ Results ~~~ Fn end PT ~~~ PS end subgraph Sweep [" "] direction TB ST["② Sweep Definition"] subgraph SW ["plot_sweep()"] direction TB IV[input_vars] RV[result_vars] CV[const_vars] IV ~~~ RV ~~~ CV end ST ~~~ SW end subgraph Run [" "] direction TB RT["③ Run Definition"] subgraph RN ["bn.run()"] direction TB Level[subsampling_divisions] Repeats[repeats] Opts[save · optimise · over_time] Level ~~~ Repeats ~~~ Opts end RT ~~~ RN end Problem == .to_bench() ==> Sweep == bn.run() ==> Run classDef title fill:none,stroke:none,color:#2a2a2a,font-size:16px classDef blueLight fill:#e8f4fc,stroke:#c8dff0,color:#3a6a8a classDef purpleLight fill:#f4ecf8,stroke:#e0d0ea,color:#5a4068 classDef greenLight fill:#ecf6ee,stroke:#cce6d2,color:#3a5e40 class PT,ST,RT title class Inputs,Results,Fn blueLight class IV,RV,CV purpleLight class Level,Repeats,Opts greenLight style Problem fill:#fafcfe,stroke:#c8dff0,stroke-width:2px style Sweep fill:#fdfafe,stroke:#e0d0ea,stroke-width:2px style Run fill:#fafefa,stroke:#cce6d2,stroke-width:2px style PS fill:#e8f4fc,stroke:#8ec0e4,stroke-width:2px,color:#2c5f7a style SW fill:#f4ecf8,stroke:#c4a4dc,stroke-width:2px,color:#5a3d6e style RN fill:#ecf6ee,stroke:#98d0a4,stroke-width:2px,color:#2e5e3a ``` ```{raw} html
``` Every auto-generated example follows this pattern: ```python # 1. Problem Definition — declare the parameter space and benchmark logic class MyBench(bn.ParametrizedSweep): x = bn.FloatSweep(default=0, bounds=[0, 10]) score = bn.ResultFloat(units="pts") def benchmark(self): self.score = f(self.x) # 2. Sweep Definition — choose what to sweep and what to measure def example_my_bench(run_cfg=None): bench = MyBench().to_bench(run_cfg) bench.plot_sweep(input_vars=["x"], result_vars=["score"]) return bench # 3. Run Definition — set sampling density, repeats, and output options if __name__ == "__main__": bn.run(example_my_bench, subsampling_divisions=4, repeats=5) ``` 1. **Problem Definition** (`ParametrizedSweep`) — Declares the grammar's *Data* and *Aesthetics*: typed input parameters (`FloatSweep`, `EnumSweep`, …) define the input space, result variables (`ResultFloat`, `ResultImage`, …) define the output space, and a `benchmark` method holds the evaluation logic. 2. **Sweep Definition** (`plot_sweep`) — Declares *Scales* and *Statistics*: configures which input parameters to vary (`input_vars`), which metrics to collect (`result_vars`), which parameters to pin (`const_vars`), and adds descriptions for the report. 3. **Run Definition** (`bn.run()` / `BenchRunCfg`) — Controls *Scales* and *Statistics*: `subsampling_divisions` sets sampling density, `repeats` determines statistical power, and flags like `save`, `optimise`, `over_time`, and `publish` control output and execution behavior. ### Iterative Workflow The three stages above support a natural iterative workflow — you change one stage at a time while holding the others fixed: ```{raw} html
``` ```{mermaid} flowchart TD Define(["① Define — ParametrizedSweep"]) Configure(["② Configure — plot_sweep()"]) Debug(["③ Debug — bn.run( subsampling_divisions=2 )"]) Check{"Works?"} Refine(["④ Refine — bn.run( subsampling_divisions=5, repeats=10 )"]) Done{"Add params?"} Define --> Configure --> Debug --> Check Check -- "No — fix & rerun (cached)" --> Debug Check -- "Yes" --> Refine --> Done Done -- "Yes" --> Define Done -- "No" --> Stop([Done]) style Define fill:#e8f4fc,stroke:#8ec0e4,color:#2c5f7a,stroke-width:2px style Configure fill:#f4ecf8,stroke:#c4a4dc,color:#5a3d6e,stroke-width:2px style Debug fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px style Refine fill:#ecf6ee,stroke:#98d0a4,color:#2e5e3a,stroke-width:2px style Check fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px style Done fill:#fff8e8,stroke:#e0c878,color:#6a5a20,stroke-width:2px style Stop fill:#f0f0f0,stroke:#c0c0c0,color:#505050,stroke-width:2px ``` ```{raw} html
``` 1. **Define** — Write a `ParametrizedSweep` subclass (Stage 1) with your inputs, outputs, and benchmark function. 2. **Configure** — Set up `plot_sweep()` calls (Stage 2) to choose which parameters to vary and which results to collect. 3. **Debug** — Run at a low value with few repeats (Stage 3: `subsampling_divisions=2, repeats=1`) to verify the pipeline works end-to-end. Because results are cached, fixing and re-running is cheap. 4. **Refine** — Increase `subsampling_divisions` and `repeats` (Stage 3 only) to get publication-quality statistics. The subsampling_divisions system's binary subdivision means higher subsampling_divisions reuses all previously cached points — you only pay for the new midpoints. ## Bencher's Primitives Bencher's design maps onto six primitives, each paralleling a grammar of graphics concept: | Grammar of Graphics | Bencher Equivalent | Role | |---|---|---| | Data | xarray Dataset (N-D tensor) | The Cartesian product of all input parameters | | Aesthetics | Input variable types | Float maps to axis, categorical maps to color/facet | | Geometry | Plot result classes | `LineResult`, `BarResult`, `HeatmapResult`, etc. | | Statistics | Repeats + `ReduceType` | Mean/std/min/max over repeated measurements | | Facets | Recursive panel slicing | Extra dimensions beyond plot capacity become nested subplots | | Scales | `SweepBase` + subsampling divisions system | Bounds, sampling density, type-aware ranges | ### Parameters (Input Space) Bencher provides typed sweep classes that declare the input space: - `FloatSweep` — continuous float range with bounds - `IntSweep` — discrete integer range with bounds - `EnumSweep` — Python enum members - `BoolSweep` — True/False - `StringSweep` — categorical string values Each sweep carries metadata: bounds, default value, units, and sampling density. Parameters are defined as class attributes on a `ParametrizedSweep` subclass using the `param` library, making them introspectable and hashable. See the gallery to explore how the number of input parameters changes the visualization: [0 float](reference/meta/0_float/no_repeats/index), [1 float](reference/meta/1_float/no_repeats/index), [2 float](reference/meta/2_float/no_repeats/index), [3 float](reference/meta/3_float/no_repeats/index). ### Results (Output Space) Result types declare what a benchmark function returns: - `ResultFloat` — a numeric scalar with units and an optimization direction (minimize/maximize) - `ResultBool` — a boolean result (stored as 0/1 numeric) - `ResultVec` — a fixed-size numeric vector - `ResultImage` — a file path to an image - `ResultVideo` — a file path to a video - `ResultPath` — an arbitrary file path - `ResultString` — a string result - `ResultDataSet` — an xarray Dataset - `ResultVolume` — volume data Bencher distinguishes inputs from results by type: anything that is a subclass of a result type is an output; everything else is an input. This split drives the entire downstream pipeline. See the [Result Types gallery](reference/meta/result_types/index) for examples of each type. ### Design (Sampling Strategy) Bencher computes the full Cartesian product of all input parameter values using `itertools.product`. Given N parameters with sizes `[s1, s2, ..., sN]`, the total number of evaluations is `s1 * s2 * ... * sN * repeats`. Each combination is represented as both an index tuple (for storage in the N-D array) and a value tuple (for passing to the benchmark function). See the [Cartesian Animation](reference/meta/cartesian_animation/index) gallery for an animated visualization of how each dimension builds on the last — from a single point to a line, grid, 3D stack, repeated measurements, and time-series film strip. The [Sampling Strategies gallery](reference/meta/sampling/index) shows how different sweep types (uniform, custom values, int vs float) produce different sample distributions. ### Execution Each parameter combination is hashed to produce a persistent cache key. Results are stored using `diskcache`, so re-running a benchmark with the same parameters skips already-computed points. The `repeats` meta-variable controls how many times each combination is evaluated, enabling statistical analysis of stochastic functions. Compare the [no repeats](reference/meta/1_float/no_repeats/index) and [with repeats](reference/meta/1_float/with_repeats/index) galleries to see how repeats add confidence intervals to plots. The [Statistics gallery](reference/meta/statistics/index) shows distributions, error bands, and the effect of different repeat counts. For caching patterns, see the [Cache Patterns example](reference/meta/advanced/example_advanced_cache_patterns). ### Presentation (Automatic Plot Selection) Bencher classifies each input parameter as either **continuous** (float, int) or **categorical** (enum, bool, string). The counts of each type, along with the number of repeats, form a *data signature* — a tuple `(float_count, cat_count, repeats)`. Each plot type declares which signatures it can handle via a `PlotFilter` with `VarRange` bounds on each dimension. The general mapping: - **0 float + categories + 1 repeat** — Bar chart ([gallery](reference/meta/0_float/no_repeats/index)) - **1 float + categories + 1 repeat** — Line plot ([gallery](reference/meta/1_float/no_repeats/index)) - **1 float + categories + N repeats** — Curve with spread (mean +/- std) ([gallery](reference/meta/1_float/with_repeats/index)) - **2 float** — Heatmap ([gallery](reference/meta/2_float/no_repeats/index)) - **3+ float** — Surface / Volume ([gallery](reference/meta/3_float/no_repeats/index)) - **0 inputs + N repeats** — Histogram / Distribution ([gallery](reference/meta/0_float/with_repeats/index)) When there are more dimensions than a plot type can display, the extra dimensions become **facets** — nested panels arranged in rows and columns, automatically labeled. Users can override automatic selection with explicit `.to_*()` calls on the result object. ### Composition The `ComposableContainerBase` framework handles spatial layout of multi-dimensional results. It supports four composition methods: - **right** — append horizontally (row) - **down** — append vertically (column) - **sequence** — display sequentially (animation/video) - **overlay** — alpha-blend on top of each other Different backends implement these operations: `ComposableContainerPanel` uses Panel's `Row` and `Column` widgets for interactive dashboards, `ComposableContainerVideo` uses `moviepy` for video compositing, and `ComposableContainerDataset` uses `xr.concat` for data merging. See the [Composable Containers gallery](reference/meta/composable_containers/index) for interactive examples of each backend and composition mode. ## Automatic Plot Selection The type signature system deserves elaboration because it is central to Bencher's "declare, don't configure" philosophy. When you define a `ParametrizedSweep` with, say, one `FloatSweep`, one `EnumSweep`, and one `ResultFloat`, Bencher counts: 1 continuous input, 1 categorical input, and determines the repeat count from the run configuration. This signature `(1, 1, ...)` is matched against each registered plot type's `PlotFilter`. A `PlotFilter` specifies acceptable ranges for: - `float_range` — how many continuous inputs - `cat_range` — how many categorical inputs - `repeats_range` — how many repeats - `panel_range` — how many panel-type results (images, videos) - `result_vars` — how many numeric result variables - `input_range` — total input count Each range is a `VarRange(lower, upper)` where `None` for upper means unbounded. A plot type matches only when *all* ranges are satisfied. When multiple plot types match, Bencher renders all of them, giving a multi-perspective view of the data. This mechanism means that adding a dimension to your sweep — say, adding a second float parameter — automatically changes the visualization from line plots to heatmaps without any code changes to the plotting logic. See the [Plot Types gallery](reference/meta/plot_types/index) for every available plot type, and the [Bool Plot Types gallery](reference/meta/bool_plot_types/index) for boolean-specific variants. ## The Subsampling Divisions System The `subsampling_divisions` parameter provides a single knob to control sampling density across all dimensions simultaneously. It indexes into a predefined sample count table: | Subsampling Divisions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |---|---|---|---|---|---|---|---|---|---|---|---|---| | Samples | 1 | 2 | 3 | 5 | 9 | 17 | 33 | 65 | 129 | 257 | 513 | 1025 | From subsampling_divisions 4 onward, the count follows the formula `2^(subsampling_divisions-2) + 1`: subsampling_divisions 4 gives `2^2 + 1 = 5`, subsampling_divisions 5 gives `2^3 + 1 = 9`, subsampling_divisions 6 gives `2^4 + 1 = 17`, and so on. Samples are distributed evenly across each parameter's range using `numpy.linspace`. The `2n - 1` relationship between consecutive counts is deliberate. Because each subsampling_divisions value has exactly twice-minus-one the samples of the previous one, the new samples land at the midpoints between existing ones. For example, on a `[0, 1]` range: - **Subsampling Divisions 5** (9 samples): 0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0 - **Subsampling Divisions 6** (17 samples): 0, 0.0625, 0.125, 0.1875, 0.25, ... Every sample from subsampling_divisions 5 appears at an even index in the subsampling_divisions 6 grid. The odd indices are new points filling the gaps between previous samples. This is binary subdivision — the same principle used in multigrid methods and progressive image rendering. This enables a natural workflow: start at a low subsampling_divisions for quick iteration, then increase for publication-quality results. Because higher subsampling_divisions values are strict supersets of lower ones, cached results from earlier runs are reused automatically — you only pay for the new midpoints. See the [Subsampling Divisions System gallery](reference/meta/levels/index) for an interactive demo showing how increasing the subsampling_divisions progressively refines the sample grid. ## Connections to Related Ideas Bencher sits at the intersection of several established concepts: - **Design of Experiments** — Factorial designs are exactly Cartesian products of factor levels. Bencher's sweep system is a programmatic way to define full factorial experiments, with the subsampling_divisions system providing fractional-factorial-like progressive refinement. - **Tidy Data** (Wickham, 2014) — Bencher's xarray output is inherently tidy: each variable forms a dimension or coordinate, each observation is a point in the N-D grid, and each type of observational unit forms a Dataset. - **Hyperparameter Tuning** — Frameworks like Optuna and Ray Tune solve the optimization problem over parameter spaces. Bencher solves the *visualization and analysis* problem: not "find the best point" but "understand the landscape". - **Relational Algebra** — The Cartesian product is a fundamental operation. Bencher applies it to typed parameter domains and extends it with caching, reduction, and visualization. Bencher occupies the space between design-of-experiments frameworks (which focus on statistical efficiency of sampling) and visualization grammars (which focus on rendering). It connects the two: declare the experiment, generate the data, and visualize the results — all from a single typed specification.