bencher.regression

Benchmark regression detection for over-time benchmarks.

Provides statistical methods to detect if benchmark values have changed significantly between runs. Supports a percentage threshold and an adaptive MAD-based detector with an optional percent floor for dual-band suppression.

Attributes

_METHOD_DEFAULTS

_MAD_TO_SIGMA

_DRIFT_FRAC

_HAMPEL_K

Exceptions

RegressionError

Raised when regression detection finds regressions and regression_fail is True.

Classes

RegressionResult

Result of regression detection for a single variable.

RegressionReport

Aggregates regression results for all variables in a benchmark.

MethodCells

Per-method rendering of a single regression result.

Functions

method_cells(→ MethodCells)

Build the per-method cell bundle for a RegressionResult.

_format_summary_line(→ str)

_format_markdown_row(→ str)

_regression_plot_spec(→ dict)

Prepare the data + styling used by both the matplotlib and holoviews renderers.

_ensure_matplotlib_backend_loaded(→ None)

Register the holoviews matplotlib backend without changing the default.

build_regression_overlay(result[, historical, ...])

Build a holoviews.Overlay diagnostic of a regression result.

render_regression_png(, dpi)

Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib.

_clean_1d(→ numpy.ndarray)

Flatten to 1-D float and remove NaNs.

_safe_change_percent(→ float)

Calculate percentage change, handling zero baseline gracefully.

_is_regression(→ bool)

Determine if a change constitutes a regression given the optimization direction.

_exceeds_directional_threshold(→ bool)

Check if change exceeds threshold in the direction-appropriate sense.

detect_percentage(→ RegressionResult)

Compare current mean vs historical mean by percentage threshold.

_robust_scale(→ tuple[float, float])

Return (median, MAD-based sigma) for a 1-D numeric array.

_residual_sigma(→ float)

Estimate step-to-step noise via MAD of first differences.

detect_adaptive(→ RegressionResult)

Robust regression detection combining step and drift tests.

detect_delta(→ RegressionResult)

Fail when the current mean's delta from history exceeds max_delta.

detect_absolute(→ RegressionResult)

Fail when current mean violates an absolute limit in the direction of OptDir.

_compute_history_arrays(→ tuple[numpy.ndarray | None, ...)

Aggregate history into per-time means + per-sample scatter arrays.

_attach_plot_metadata(→ None)

Attach the history/current arrays a RegressionResult needs for replay plotting.

detect_regressions(→ RegressionReport)

Run regression detection on a dataset with over_time dimension.

Module Contents

bencher.regression._METHOD_DEFAULTS
bencher.regression._MAD_TO_SIGMA = 1.4826
bencher.regression._DRIFT_FRAC = 0.85
bencher.regression._HAMPEL_K = 5.0
exception bencher.regression.RegressionError

Bases: Exception

Raised when regression detection finds regressions and regression_fail is True.

class bencher.regression.RegressionResult

Result of regression detection for a single variable.

variable: str
method: str
regressed: bool
current_value: float
baseline_value: float
change_percent: float
threshold: float
direction: str
details: str
band_lower: float | None = None
band_upper: float | None = None
percent_band_lower: float | None = None
percent_band_upper: float | None = None
historical: numpy.ndarray | None = None
current_samples: numpy.ndarray | None = None
historical_all: numpy.ndarray | None = None
historical_all_x: numpy.ndarray | None = None
historical_x: numpy.ndarray | None = None
current_x: numpy.ndarray | None = None
render_png(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) str

Render this result as a diagnostic PNG (see render_regression_png()).

render_overlay(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None)

Build a holoviews.Overlay of this result (see build_regression_overlay()).

class bencher.regression.RegressionReport

Aggregates regression results for all variables in a benchmark.

results: list[RegressionResult] = []
property has_regressions: bool
property regressed_variables: list[RegressionResult]
summary() str
to_markdown() str

Return a nicely formatted Markdown summary of all regression results.

append_to_report(report) None

Append a formatted regression summary to a BenchReport.

prepend_to_result(report, bench_res) None

Insert a formatted regression summary at the top of bench_res’s tab.

class bencher.regression.MethodCells

Per-method rendering of a single regression result.

Each detector has a different gate — percent ratio, MAD-sigma, absolute delta, hard limit — so the report cells must describe it in its own units. This bundle is the single source of truth consumed by both the built-in text summary and the markdown table, and is exposed as public API so downstream report builders can produce their own layouts (custom columns, non-markdown output, templated HTML, GitHub PR comments with status decoration, etc.) without reimplementing method dispatch and drifting when new detection methods are added.

Example — building a minimal custom row from a RegressionResult:

from bencher import method_cells
cells = method_cells(result)
row = f"{result.variable}: {cells.change} (gate {cells.threshold})"
change

Change column (markdown) — gated quantity in its own units.

baseline

Baseline column (markdown) — em-dash for absolute (no historical baseline exists).

threshold

Threshold column (markdown) — carries the gate’s native units (±T%, , ±T, or a direction-aware inequality).

summary_lead

First clause of the summary line, before the details parenthesis. Captures the gated quantity in sentence form.

summary_standalone

When True, the summary line skips the (baseline=…, current=…, threshold=…) tail because summary_lead already contains the relevant values. Used by the absolute method (no baseline, limit is in the lead).

change: str
baseline: str
threshold: str
summary_lead: str
summary_standalone: bool = False
bencher.regression.method_cells(r: RegressionResult) MethodCells

Build the per-method cell bundle for a RegressionResult.

Returns a MethodCells with pre-rendered display strings for the result’s change, baseline, and threshold, plus the summary lead clause. Dispatches on r.method so each gate describes itself in its native units. Safe to call on any RegressionResult — unknown methods fall back to the percentage-style rendering.

Intended for consumers that want to embed regression results in a custom layout while staying consistent with how the built-in RegressionReport.summary() and RegressionReport.to_markdown() present each method.

Notes on the absolute branch: baseline_value and threshold both hold the limit for this detector (see detect_absolute()); the code reads from threshold to make the intent (“this is the gate value”) explicit.

bencher.regression._format_summary_line(r: RegressionResult) str
bencher.regression._format_markdown_row(r: RegressionResult) str
bencher.regression._regression_plot_spec(result: RegressionResult, historical: numpy.ndarray | None, current: numpy.ndarray | float | None) dict

Prepare the data + styling used by both the matplotlib and holoviews renderers.

Resolves the history and current arrays from the arguments first, falling back to anything stored on result. Returns a dict of primitives the backend-specific renderers consume. Keeping this shared guarantees the PNG and in-report plots stay in sync as the diagnostic evolves.

bencher.regression._ensure_matplotlib_backend_loaded() None

Register the holoviews matplotlib backend without changing the default.

render_regression_png needs matplotlib to export a PNG, but the report path uses bokeh — calling hv.extension(‘matplotlib’) naively would flip the global default mid-run. This loads the renderer if missing, then restores the prior default. Selects the non-interactive Agg backend when no matplotlib backend has been configured yet (force=False), so holoviews doesn’t pick up Tk/Qt on a fresh process (which leaks main thread is not in main loop tracebacks at interpreter shutdown). If the caller has already configured a backend (e.g., Jupyter’s inline backend), that choice is left alone.

bencher.regression.build_regression_overlay(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, width: int = 700, height: int = 350, fig_inches: tuple[float, float] = (7.0, 3.5))

Build a holoviews.Overlay diagnostic of a regression result.

Opts are applied per-backend so the same overlay renders correctly under both bokeh (for embedded HTML reports) and matplotlib (for PNG export via render_regression_png()). History always shows as mean line + raw alpha scatter; regression-specific layers (acceptance band, baseline, verdict-coloured current marker) are conditional on the data in result.

Parameters:
  • result – The RegressionResult to visualise.

  • historical – Optional 1-D array of historical per-time-point means. Falls back to result.historical if omitted.

  • current – Optional current-run sample array (or scalar). Falls back to result.current_samples / result.current_value.

  • width – Pixel dimensions for the bokeh backend.

  • height – Pixel dimensions for the bokeh backend.

  • fig_inches – Figure size in inches for the matplotlib backend.

bencher.regression.render_regression_png(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) str

Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib.

Produces the same plot as the in-report bokeh overlay — it calls build_regression_overlay() and hands the result to holoviews’ matplotlib renderer, so there’s a single source of truth for the diagnostic visual.

Parameters:
  • result – The RegressionResult produced by a detect_* call.

  • historical – 1-D array of historical per-time-point means. Falls back to result.historical.

  • current – Current-run sample(s). Falls back to result.current_samples / result.current_value.

  • path – Output PNG path. If None, a path is generated via bencher.utils.gen_image_path() so the file lives under the bencher cache directory.

  • figsize – Figure size in inches (matplotlib fig_inches).

  • dpi – Output DPI (500x320 at dpi=100 works well for GitHub comments).

Returns:

Absolute path to the saved PNG as a string.

bencher.regression._clean_1d(a: numpy.ndarray) numpy.ndarray

Flatten to 1-D float and remove NaNs.

bencher.regression._safe_change_percent(current: float, baseline: float) float

Calculate percentage change, handling zero baseline gracefully.

bencher.regression._is_regression(change_percent: float, direction: bencher.variables.results.OptDir) bool

Determine if a change constitutes a regression given the optimization direction.

bencher.regression._exceeds_directional_threshold(change_percent: float, threshold_percent: float, direction: bencher.variables.results.OptDir) bool

Check if change exceeds threshold in the direction-appropriate sense.

bencher.regression.detect_percentage(variable: str, historical: numpy.ndarray, current: numpy.ndarray, threshold_percent: float = 5.0, direction: bencher.variables.results.OptDir = OptDir.minimize) RegressionResult

Compare current mean vs historical mean by percentage threshold.

Simple escape hatch: one directional rule comparing the current mean against the historical mean. Same shape as detect_delta() and detect_absolute(); contrast with detect_adaptive() which layers noise modelling, drift test, and a dual-band AND gate.

bencher.regression._robust_scale(values: numpy.ndarray) tuple[float, float]

Return (median, MAD-based sigma) for a 1-D numeric array.

The MAD is scaled by 1.4826 so it matches the standard deviation for Gaussian data.

bencher.regression._residual_sigma(values: numpy.ndarray) float

Estimate step-to-step noise via MAD of first differences.

For data y[i] = trend[i] + eps[i] the diff y[i+1] - y[i] has variance 2 * sigma^2, so MAD(diff) * 1.4826 / sqrt(2) recovers sigma even when trend is non-stationary. This prevents a gradual drift from inflating its own noise estimate and masking itself.

bencher.regression.detect_adaptive(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, regression_mad: float = 3.5, drift_threshold: float | None = None, mk_alpha: float = 0.1, direction: bencher.variables.results.OptDir = OptDir.minimize, historical_samples: numpy.ndarray | None = None, regression_percentage: float | None = None) RegressionResult

Robust regression detection combining step and drift tests.

The method estimates the metric’s inherent noise from history using a median + MAD (median absolute deviation) scale and expresses the current run’s deviation in those noise units. Two orthogonal tests run in parallel:

  • Short-term step — flags if (current_mean - baseline) / noise_floor exceeds regression_mad in the regression direction.

  • Long-term drift — fits a Theil–Sen slope on the historical time-point means (after a Hampel filter removes isolated outliers) and flags if the total projected drift, scaled by noise_floor, exceeds drift_threshold and a Mann–Kendall test confirms monotonic trend with p < mk_alpha.

Parameters:
  • variable – Name of the result variable being checked.

  • historical_time_means – 1-D array of per-time-point mean values from history (one entry per prior run).

  • current – Current run values (will be averaged).

  • regression_mad – Step-test threshold in MAD-sigma units.

  • drift_threshold – Drift-test threshold in MAD-sigma units. If None, defaults to _DRIFT_FRAC * regression_mad so users need to tune only one knob.

  • mk_alpha – Significance level for the Mann–Kendall trend guard.

  • direction – Optimization direction from the result variable.

  • historical_samples – Optional flat array of all historical samples (not per-time means). Used for the sparse-history fallback so the delegated percentage detector sees the same input it would have received from detect_regressions directly. Falls back to historical_time_means when not provided.

  • regression_percentage – Optional minimum percent change required to flag a regression (directional, i.e. interpreted against direction). When set, acts as a second acceptance band: a regression fires only when BOTH the MAD test and the percent change exceed their thresholds. Suppresses noise-floor false positives on metrics with few repeats or very tight history.

bencher.regression.detect_delta(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, max_delta: float, direction: bencher.variables.results.OptDir = OptDir.minimize) RegressionResult

Fail when the current mean’s delta from history exceeds max_delta.

Simple escape hatch: one directional rule on the absolute-unit delta between the current mean and the mean of all historical per-time means. minimize fails when curr - hist_mean > max_delta; maximize fails when hist_mean - curr > max_delta; none uses |delta|. Same shape as detect_percentage() and detect_absolute(); contrast with detect_adaptive() which layers noise modelling and drift testing. Selected via regression_method='delta'.

bencher.regression.detect_absolute(variable: str, current: numpy.ndarray, limit: float, direction: bencher.variables.results.OptDir = OptDir.minimize) RegressionResult

Fail when current mean violates an absolute limit in the direction of OptDir.

Simple escape hatch: one directional rule against a fixed limit — no historical data required. For OptDir.minimize limit is a ceiling; for OptDir.maximize it’s a floor; OptDir.none records a non-regressed result and leaves it to the caller to log. Same shape as detect_percentage() and detect_delta(); contrast with detect_adaptive() which needs history to estimate noise.

bencher.regression._compute_history_arrays(da: xarray.DataArray) tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]

Aggregate history into per-time means + per-sample scatter arrays.

Returns (time_means, hist_samples_flat, hist_x_flat) or all-None when there is no history to summarise. Per-time means collapse every non-time dim into one scalar per run so detection and plotting both see a 1-D series; the scatter arrays preserve per-repeat spread broadcast against the historical over_time coords.

bencher.regression._attach_plot_metadata(result: RegressionResult, *, time_coord: numpy.ndarray, current_samples: numpy.ndarray, time_means: numpy.ndarray | None, hist_samples_flat: numpy.ndarray | None, hist_x_flat: numpy.ndarray | None) None

Attach the history/current arrays a RegressionResult needs for replay plotting.

bencher.regression.detect_regressions(dataset: xarray.Dataset, bench_cfg, run_cfg) RegressionReport

Run regression detection on a dataset with over_time dimension.

For each numeric result variable, dispatches to the detector chosen by run_cfg.regression_method (percentage, adaptive, delta, or absolute). absolute runs even with a single over_time point since it needs no baseline; every other method requires history.

Parameters:
  • dataset – xarray Dataset with an over_time dimension.

  • bench_cfg – BenchCfg with result_vars list.

  • run_cfg – BenchRunCfg. Reads regression_method and its method-specific threshold: regression_percentage for percentage; regression_mad (plus regression_percentage as a dual-band gate) for adaptive; regression_delta for delta; regression_absolute for absolute.

Returns:

RegressionReport with one result per variable per fired detector/guard.