bencher.regression

Benchmark regression detection for over-time benchmarks.

Provides statistical methods to detect if benchmark values have changed significantly between runs. Supports a percentage threshold and an adaptive MAD-based detector with an optional percent floor for dual-band suppression.

Attributes

`_METHOD_DEFAULTS`
`_MAD_TO_SIGMA`
`_DRIFT_FRAC`
`_HAMPEL_K`
`_YOUNG_MARKER`
`_METHOD_THRESHOLD_ATTR`
`_HISTORY_FREE_METHODS`

Exceptions

RegressionError

Raised when regression detection finds regressions and regression_fail is True.

Classes

`RegressionResult`	Result of regression detection for a single variable.
`RegressionReport`	Aggregates regression results for all variables in a benchmark.
`MethodCells`	Per-method rendering of a single regression result.

Functions

`method_cells`(→ MethodCells)	Build the per-method cell bundle for a `RegressionResult`.
`_format_summary_line`(→ str)
`_format_markdown_row`(→ str)
`_regression_plot_spec`(→ dict)	Prepare the data + styling used by both the matplotlib and holoviews renderers.
`_ensure_matplotlib_backend_loaded`(→ None)	Register the holoviews matplotlib backend without changing the default.
`build_regression_overlay`(result[, historical, ...])	Build a `holoviews.Overlay` diagnostic of a regression result.
`render_regression_png`(, dpi)	Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib.
`_clean_1d`(→ numpy.ndarray)	Flatten to 1-D float and remove NaNs.
`_finite_or_none`(→ float \| None)	Coerce a float to a strict-JSON-safe value: non-finite (NaN/inf) -> None.
`_safe_change_percent`(→ float)	Calculate percentage change, handling zero baseline gracefully.
`_is_regression`(→ bool)	Determine if a change constitutes a regression given the optimization direction.
`_exceeds_directional_threshold`(→ bool)	Check if change exceeds threshold in the direction-appropriate sense.
`detect_percentage`(→ RegressionResult)	Compare current mean vs historical mean by percentage threshold.
`_robust_scale`(→ tuple[float, float])	Return (median, MAD-based sigma) for a 1-D numeric array.
`_residual_sigma`(→ float)	Estimate step-to-step noise via MAD of first differences.
`detect_adaptive`(→ RegressionResult \| None)	Robust regression detection combining step and drift tests.
`detect_delta`(→ RegressionResult)	Fail when the current mean's delta from history exceeds `max_delta`.
`detect_absolute`(→ RegressionResult \| None)	Fail when current mean violates an absolute limit in the direction of OptDir.
`_compute_history_arrays`(→ tuple[numpy.ndarray \| None, ...)	Aggregate history into per-time means + per-sample scatter arrays.
`_attach_plot_metadata`(→ None)	Attach the history/current arrays a RegressionResult needs for replay plotting.
`_valid_threshold`(→ float \| None)	Return `value` as a finite float, or `None` if it isn't one.
`_normalize_overrides`(→ tuple[dict, dict])	Validate `run_cfg.regression_overrides` into `{var: {method: threshold}}`.
`_run_check`(→ RegressionResult \| None)	Dispatch one (method, threshold) check for a single variable.
`_history_points_since_birth`(→ int)	Historical over_time points available for da, excluding the current run.
`detect_regressions`(→ RegressionReport)	Run regression detection on a dataset with over_time dimension.

Module Contents

bencher.regression._METHOD_DEFAULTS

bencher.regression._MAD_TO_SIGMA = 1.4826

bencher.regression._DRIFT_FRAC = 0.85

bencher.regression._HAMPEL_K = 5.0

bencher.regression._YOUNG_MARKER = '†'

exception bencher.regression.RegressionError

Bases: Exception

Raised when regression detection finds regressions and regression_fail is True.

class bencher.regression.RegressionResult

Result of regression detection for a single variable.

variable: str

method: str

regressed: bool

current_value: float

baseline_value: float

change_percent: float

threshold: float

direction: str

details: str

band_lower: float | None = None

band_upper: float | None = None

percent_band_lower: float | None = None

percent_band_upper: float | None = None

historical: numpy.ndarray | None = None

current_samples: numpy.ndarray | None = None

historical_all: numpy.ndarray | None = None

historical_all_x: numpy.ndarray | None = None

historical_x: numpy.ndarray | None = None

current_x: numpy.ndarray | None = None

young_baseline: bool = False

render_png(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) → str: Render this result as a diagnostic PNG (see render_regression_png()).

render_overlay(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None): Build a holoviews.Overlay of this result (see build_regression_overlay()).

to_dict() → dict

Return a JSON-serializable summary of this result.

Emits only scalar fields — the numpy historical/current_samples arrays (kept for replotting) are intentionally omitted. Non-finite floats (NaN/inf, e.g. a zero-baseline percent change) become None so the output is strict, json.dumps-able JSON.

class bencher.regression.RegressionReport

Aggregates regression results for all variables in a benchmark.

results: list[RegressionResult] = []

property has_regressions: bool

property has_blocking_regressions: bool

True when any regression has a mature baseline and may fail the run.

Regressions on young baselines (see regression_min_history) are notify-only: reported in the summary/export but never blocking.

property regressed_variables: list[RegressionResult]

summary() → str

to_markdown() → str: Return a nicely formatted Markdown summary of all regression results.

to_dict() → dict

Return a JSON-serializable summary of all regression results.

Mirrors to_markdown()/summary() but emits structured data for agents and CI to consume instead of prose.

append_to_report(report) → None: Append a formatted regression summary to a BenchReport.

prepend_to_result(report, bench_res) → None: Insert a formatted regression summary at the top of bench_res’s tab.

class bencher.regression.MethodCells

Per-method rendering of a single regression result.

Each detector has a different gate — percent ratio, MAD-sigma, absolute delta, hard limit — so the report cells must describe it in its own units. This bundle is the single source of truth consumed by both the built-in text summary and the markdown table, and is exposed as public API so downstream report builders can produce their own layouts (custom columns, non-markdown output, templated HTML, GitHub PR comments with status decoration, etc.) without reimplementing method dispatch and drifting when new detection methods are added.

Example — building a minimal custom row from a RegressionResult:

from bencher import method_cells
cells = method_cells(result)
row = f"{result.variable}: {cells.change} (gate {cells.threshold})"

change: Change column (markdown) — gated quantity in its own units.

baseline: Baseline column (markdown) — em-dash for absolute (no historical baseline exists).

threshold: Threshold column (markdown) — carries the gate’s native units (±T%, Tσ, ±T, or a direction-aware inequality).

summary_lead: First clause of the summary line, before the details parenthesis. Captures the gated quantity in sentence form.

summary_standalone: When True, the summary line skips the (baseline=…, current=…, threshold=…) tail because summary_lead already contains the relevant values. Used by the absolute method (no baseline, limit is in the lead).

change: str

baseline: str

threshold: str

summary_lead: str

summary_standalone: bool = False

bencher.regression.method_cells(r: RegressionResult) → MethodCells

Build the per-method cell bundle for a RegressionResult.

Returns a MethodCells with pre-rendered display strings for the result’s change, baseline, and threshold, plus the summary lead clause. Dispatches on r.method so each gate describes itself in its native units. Safe to call on any RegressionResult — unknown methods fall back to the percentage-style rendering.

Intended for consumers that want to embed regression results in a custom layout while staying consistent with how the built-in RegressionReport.summary() and RegressionReport.to_markdown() present each method.

Notes on the absolute branch: baseline_value and threshold both hold the limit for this detector (see detect_absolute()); the code reads from threshold to make the intent (“this is the gate value”) explicit.

bencher.regression._format_summary_line(r: RegressionResult) → str

bencher.regression._format_markdown_row(r: RegressionResult) → str

bencher.regression._regression_plot_spec(result: RegressionResult, historical: numpy.ndarray | None, current: numpy.ndarray | float | None) → dict

Prepare the data + styling used by both the matplotlib and holoviews renderers.

Resolves the history and current arrays from the arguments first, falling back to anything stored on result. Returns a dict of primitives the backend-specific renderers consume. Keeping this shared guarantees the PNG and in-report plots stay in sync as the diagnostic evolves.

bencher.regression._ensure_matplotlib_backend_loaded() → None

Register the holoviews matplotlib backend without changing the default.

render_regression_png needs matplotlib to export a PNG, but the report path uses bokeh — calling hv.extension(‘matplotlib’) naively would flip the global default mid-run. This loads the renderer if missing, then restores the prior default. Selects the non-interactive Agg backend when no matplotlib backend has been configured yet (force=False), so holoviews doesn’t pick up Tk/Qt on a fresh process (which leaks main thread is not in main loop tracebacks at interpreter shutdown). If the caller has already configured a backend (e.g., Jupyter’s inline backend), that choice is left alone.

bencher.regression.build_regression_overlay(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, width: int = 700, height: int = 350, fig_inches: tuple[float, float] = (7.0, 3.5))

Build a holoviews.Overlay diagnostic of a regression result.

Opts are applied per-backend so the same overlay renders correctly under both bokeh (for embedded HTML reports) and matplotlib (for PNG export via render_regression_png()). History always shows as mean line + raw alpha scatter; regression-specific layers (acceptance band, baseline, verdict-coloured current marker) are conditional on the data in result.

Parameters:

result – The RegressionResult to visualise.
historical – Optional 1-D array of historical per-time-point means. Falls back to result.historical if omitted.
current – Optional current-run sample array (or scalar). Falls back to result.current_samples / result.current_value.
width – Pixel dimensions for the bokeh backend.
height – Pixel dimensions for the bokeh backend.
fig_inches – Figure size in inches for the matplotlib backend.

bencher.regression.render_regression_png(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) → str

Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib.

Produces the same plot as the in-report bokeh overlay — it calls build_regression_overlay() and hands the result to holoviews’ matplotlib renderer, so there’s a single source of truth for the diagnostic visual.

Parameters:

result – The RegressionResult produced by a detect_* call.
historical – 1-D array of historical per-time-point means. Falls back to result.historical.
current – Current-run sample(s). Falls back to result.current_samples / result.current_value.
path – Output PNG path. If None, a path is generated via bencher.utils.gen_image_path() so the file lives under the bencher cache directory.
figsize – Figure size in inches (matplotlib fig_inches).
dpi – Output DPI (500x320 at dpi=100 works well for GitHub comments).

Returns:

Absolute path to the saved PNG as a string.

bencher.regression._clean_1d(a: numpy.ndarray) → numpy.ndarray: Flatten to 1-D float and remove NaNs.

bencher.regression._finite_or_none(value: float | None) → float | None: Coerce a float to a strict-JSON-safe value: non-finite (NaN/inf) -> None.

bencher.regression._safe_change_percent(current: float, baseline: float) → float: Calculate percentage change, handling zero baseline gracefully.

bencher.regression._is_regression(change_percent: float, direction: bencher.variables.results.OptDir) → bool: Determine if a change constitutes a regression given the optimization direction.

bencher.regression._exceeds_directional_threshold(change_percent: float, threshold_percent: float, direction: bencher.variables.results.OptDir) → bool: Check if change exceeds threshold in the direction-appropriate sense.

bencher.regression.detect_percentage(variable: str, historical: numpy.ndarray, current: numpy.ndarray, threshold_percent: float = 5.0, direction: bencher.variables.results.OptDir = OptDir.minimize) → RegressionResult

Compare current mean vs historical mean by percentage threshold.

Simple escape hatch: one directional rule comparing the current mean against the historical mean. Same shape as detect_delta() and detect_absolute(); contrast with detect_adaptive() which layers noise modelling, drift test, and a dual-band AND gate.

bencher.regression._robust_scale(values: numpy.ndarray) → tuple[float, float]

Return (median, MAD-based sigma) for a 1-D numeric array.

The MAD is scaled by 1.4826 so it matches the standard deviation for Gaussian data.

bencher.regression._residual_sigma(values: numpy.ndarray) → float

Estimate step-to-step noise via MAD of first differences.

For data y[i] = trend[i] + eps[i] the diff y[i+1] - y[i] has variance 2 * sigma^2, so MAD(diff) * 1.4826 / sqrt(2) recovers sigma even when trend is non-stationary. This prevents a gradual drift from inflating its own noise estimate and masking itself.

bencher.regression.detect_adaptive(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, regression_mad: float = 3.5, drift_threshold: float | None = None, mk_alpha: float = 0.1, direction: bencher.variables.results.OptDir = OptDir.minimize, historical_samples: numpy.ndarray | None = None, regression_percentage: float | None = None, sparse_fallback: bool = True) → RegressionResult | None

Robust regression detection combining step and drift tests.

The method estimates the metric’s inherent noise from history using a median + MAD (median absolute deviation) scale and expresses the current run’s deviation in those noise units. Two orthogonal tests run in parallel:

Short-term step — flags if (current_mean - baseline) / noise_floor exceeds regression_mad in the regression direction.
Long-term drift — fits a Theil–Sen slope on the historical time-point means (after a Hampel filter removes isolated outliers) and flags if the total projected drift, scaled by noise_floor, exceeds drift_threshold and a Mann–Kendall test confirms monotonic trend with p < mk_alpha.

Parameters:

variable – Name of the result variable being checked.
historical_time_means – 1-D array of per-time-point mean values from history (one entry per prior run).
current – Current run values (will be averaged).
regression_mad – Step-test threshold in MAD-sigma units.
drift_threshold – Drift-test threshold in MAD-sigma units. If None, defaults to _DRIFT_FRAC * regression_mad so users need to tune only one knob.
mk_alpha – Significance level for the Mann–Kendall trend guard.
direction – Optimization direction from the result variable.
historical_samples – Optional flat array of all historical samples (not per-time means). Used for the sparse-history fallback so the delegated percentage detector sees the same input it would have received from detect_regressions directly. Falls back to historical_time_means when not provided.
regression_percentage – Optional minimum percent change required to flag a regression (directional, i.e. interpreted against direction). When set, acts as a second acceptance band: a regression fires only when BOTH the MAD test and the percent change exceed their thresholds. Suppresses noise-floor false positives on metrics with few repeats or very tight history.
sparse_fallback – When history is too sparse for a robust MAD scale (fewer than 4 points), fall back to a percentage check at regression_percentage. Override checks pass False so a listed variable is never judged by a threshold outside its spec; the check then returns None until history accumulates.

bencher.regression.detect_delta(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, max_delta: float, direction: bencher.variables.results.OptDir = OptDir.minimize) → RegressionResult

Fail when the current mean’s delta from history exceeds max_delta.

Simple escape hatch: one directional rule on the absolute-unit delta between the current mean and the mean of all historical per-time means. minimize fails when curr - hist_mean > max_delta; maximize fails when hist_mean - curr > max_delta; none uses |delta|. Same shape as detect_percentage() and detect_absolute(); contrast with detect_adaptive() which layers noise modelling and drift testing. Selected via regression_method='delta'.

bencher.regression.detect_absolute(variable: str, current: numpy.ndarray, limit: float, direction: bencher.variables.results.OptDir = OptDir.minimize) → RegressionResult | None

Fail when current mean violates an absolute limit in the direction of OptDir.

Simple escape hatch: one directional rule against a fixed limit — no historical data required. For OptDir.minimize limit is a ceiling; for OptDir.maximize it’s a floor; OptDir.none has no direction to check against, so the guard warns and returns None rather than reporting a check that never ran. Same shape as detect_percentage() and detect_delta(); contrast with detect_adaptive() which needs history to estimate noise.

bencher.regression._compute_history_arrays(da: xarray.DataArray) → tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]

Aggregate history into per-time means + per-sample scatter arrays.

Returns (time_means, hist_samples_flat, hist_x_flat) or all-None when there is no history to summarise. Per-time means collapse every non-time dim into one scalar per run so detection and plotting both see a 1-D series; the scatter arrays preserve per-repeat spread broadcast against the historical over_time coords.

bencher.regression._attach_plot_metadata(result: RegressionResult, *, time_coord: numpy.ndarray, current_samples: numpy.ndarray, time_means: numpy.ndarray | None, hist_samples_flat: numpy.ndarray | None, hist_x_flat: numpy.ndarray | None) → None: Attach the history/current arrays a RegressionResult needs for replay plotting.

bencher.regression._METHOD_THRESHOLD_ATTR

bencher.regression._HISTORY_FREE_METHODS

bencher.regression._valid_threshold(value) → float | None

Return value as a finite float, or None if it isn’t one.

Rejects bools (True would silently become 1.0) and non-finite numbers (a NaN threshold makes every comparison False, so the check would look configured but could never fire). Accepts any real number, including numpy scalars.

bencher.regression._normalize_overrides(overrides) → tuple[dict, dict]

Validate run_cfg.regression_overrides into {var: {method: threshold}}.

A bare number is shorthand for {"absolute": value}. Malformed entries are dropped with a warning, never raised: an unknown method key or a bad threshold loses that one check, and a spec left with no valid checks falls back to the benchmark-wide method (so a typo’d key can’t silently disable detection). Only a literal empty spec opts the variable out of detection — the variable was explicitly listed with no checks.

A min_history key inside a spec is not a check: it is the per-variable override of regression_min_history and is returned separately as the second element {var: min_history}. A spec containing only min_history keeps the benchmark-wide method.

bencher.regression._run_check(check_method: str, threshold: float, *, var_name: str, direction: bencher.variables.results.OptDir, current_mean_scalar: numpy.ndarray, time_means_arr: numpy.ndarray | None, historical_clean: numpy.ndarray, dual_band_percentage: float, allow_sparse_fallback: bool) → RegressionResult | None

Dispatch one (method, threshold) check for a single variable.

Shared by the benchmark-wide method and per-variable overrides so both paths stay in lockstep. dual_band_percentage is the adaptive method’s percent gate (always the benchmark-wide regression_percentage; an adaptive override’s threshold is its MAD limit). allow_sparse_fallback is False for override checks: with sparse history the adaptive detector would otherwise degrade to a percentage check at the benchmark-wide threshold, contradicting the contract that a listed variable is checked only by its spec. Callers must ensure history exists for every method outside _HISTORY_FREE_METHODS.

bencher.regression._history_points_since_birth(dataset: xarray.Dataset, da: xarray.DataArray) → int

Historical over_time points available for da, excluding the current run.

Counts from the column’s birth coordinate (stamped by history reconciliation on freshly added or meaning_version-restarted columns) so NaN backfill before the column existed does not inflate its baseline age. Columns without a birth marker — as old as the history itself, or a dataset without over_time coordinates — count the full window. A birth value no longer present in the coordinates means the birth has aged out past max_time_events, so the whole (trimmed) window is real history.

When over_time labels are duplicated (a reused TimeEvent), the birth value can match more than one coordinate. We take the last occurrence: if the column was truly born on the later duplicate this is exact, and if it was born on an earlier one this undercounts history — erring toward “younger”, which is only ever notify-only and never a premature block.

bencher.regression.detect_regressions(dataset: xarray.Dataset, bench_cfg, run_cfg) → RegressionReport

Run regression detection on a dataset with over_time dimension.

For each numeric result variable, dispatches to the detector chosen by run_cfg.regression_method (percentage, adaptive, delta, or absolute). absolute runs even with a single over_time point since it needs no baseline; every other method requires history.

Variables named in run_cfg.regression_overrides are instead checked by exactly the methods in their spec ({method: threshold}, or a bare number as shorthand for an absolute limit), so thresholds — and methods — can differ per variable, including multiple independent checks on one variable. History-needing override checks skip until history exists (including adaptive overrides, which never fall back to a percentage check); absolute checks fire from the first recording.

Parameters:

dataset – xarray Dataset with an over_time dimension.
bench_cfg – BenchCfg with result_vars list.
run_cfg – BenchRunCfg. Reads regression_method and its method-specific threshold: regression_percentage for percentage; regression_mad (plus regression_percentage as a dual-band gate) for adaptive; regression_delta for delta; regression_absolute for absolute. Also reads regression_overrides for per-variable specs.

Returns:

RegressionReport with one result per variable per fired detector/guard.