bencher.regression
Benchmark regression detection for over-time benchmarks.
Provides statistical methods to detect if benchmark values have changed significantly between runs. Supports a percentage threshold and an adaptive MAD-based detector with an optional percent floor for dual-band suppression.
Attributes
Exceptions
Raised when regression detection finds regressions and regression_fail is True. |
Classes
Result of regression detection for a single variable. |
|
Aggregates regression results for all variables in a benchmark. |
|
Per-method rendering of a single regression result. |
Functions
|
Build the per-method cell bundle for a |
|
|
|
|
|
Prepare the data + styling used by both the matplotlib and holoviews renderers. |
Register the holoviews matplotlib backend without changing the default. |
|
|
Build a |
|
Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib. |
|
Flatten to 1-D float and remove NaNs. |
|
Calculate percentage change, handling zero baseline gracefully. |
|
Determine if a change constitutes a regression given the optimization direction. |
|
Check if change exceeds threshold in the direction-appropriate sense. |
|
Compare current mean vs historical mean by percentage threshold. |
|
Return (median, MAD-based sigma) for a 1-D numeric array. |
|
Estimate step-to-step noise via MAD of first differences. |
|
Robust regression detection combining step and drift tests. |
|
Fail when the current mean's delta from history exceeds |
|
Fail when current mean violates an absolute limit in the direction of OptDir. |
|
Aggregate history into per-time means + per-sample scatter arrays. |
|
Attach the history/current arrays a RegressionResult needs for replay plotting. |
|
Run regression detection on a dataset with over_time dimension. |
Module Contents
- bencher.regression._METHOD_DEFAULTS
- bencher.regression._MAD_TO_SIGMA = 1.4826
- bencher.regression._DRIFT_FRAC = 0.85
- bencher.regression._HAMPEL_K = 5.0
- exception bencher.regression.RegressionError
Bases:
ExceptionRaised when regression detection finds regressions and regression_fail is True.
- class bencher.regression.RegressionResult
Result of regression detection for a single variable.
- variable: str
- method: str
- regressed: bool
- current_value: float
- baseline_value: float
- change_percent: float
- threshold: float
- direction: str
- details: str
- band_lower: float | None = None
- band_upper: float | None = None
- percent_band_lower: float | None = None
- percent_band_upper: float | None = None
- historical: numpy.ndarray | None = None
- current_samples: numpy.ndarray | None = None
- historical_all: numpy.ndarray | None = None
- historical_all_x: numpy.ndarray | None = None
- historical_x: numpy.ndarray | None = None
- current_x: numpy.ndarray | None = None
- render_png(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) str
Render this result as a diagnostic PNG (see
render_regression_png()).
- render_overlay(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None)
Build a
holoviews.Overlayof this result (seebuild_regression_overlay()).
- class bencher.regression.RegressionReport
Aggregates regression results for all variables in a benchmark.
- results: list[RegressionResult] = []
- property has_regressions: bool
- property regressed_variables: list[RegressionResult]
- summary() str
- to_markdown() str
Return a nicely formatted Markdown summary of all regression results.
- append_to_report(report) None
Append a formatted regression summary to a
BenchReport.
- prepend_to_result(report, bench_res) None
Insert a formatted regression summary at the top of bench_res’s tab.
- class bencher.regression.MethodCells
Per-method rendering of a single regression result.
Each detector has a different gate — percent ratio, MAD-sigma, absolute delta, hard limit — so the report cells must describe it in its own units. This bundle is the single source of truth consumed by both the built-in text summary and the markdown table, and is exposed as public API so downstream report builders can produce their own layouts (custom columns, non-markdown output, templated HTML, GitHub PR comments with status decoration, etc.) without reimplementing method dispatch and drifting when new detection methods are added.
Example — building a minimal custom row from a RegressionResult:
from bencher import method_cells cells = method_cells(result) row = f"{result.variable}: {cells.change} (gate {cells.threshold})"
- change
Change column (markdown) — gated quantity in its own units.
- baseline
Baseline column (markdown) — em-dash for absolute (no historical baseline exists).
- threshold
Threshold column (markdown) — carries the gate’s native units (
±T%,Tσ,±T, or a direction-aware inequality).
- summary_lead
First clause of the summary line, before the details parenthesis. Captures the gated quantity in sentence form.
- summary_standalone
When True, the summary line skips the
(baseline=…, current=…, threshold=…)tail becausesummary_leadalready contains the relevant values. Used by the absolute method (no baseline, limit is in the lead).
- change: str
- baseline: str
- threshold: str
- summary_lead: str
- summary_standalone: bool = False
- bencher.regression.method_cells(r: RegressionResult) MethodCells
Build the per-method cell bundle for a
RegressionResult.Returns a
MethodCellswith pre-rendered display strings for the result’s change, baseline, and threshold, plus the summary lead clause. Dispatches onr.methodso each gate describes itself in its native units. Safe to call on anyRegressionResult— unknown methods fall back to the percentage-style rendering.Intended for consumers that want to embed regression results in a custom layout while staying consistent with how the built-in
RegressionReport.summary()andRegressionReport.to_markdown()present each method.Notes on the
absolutebranch:baseline_valueandthresholdboth hold the limit for this detector (seedetect_absolute()); the code reads fromthresholdto make the intent (“this is the gate value”) explicit.
- bencher.regression._format_summary_line(r: RegressionResult) str
- bencher.regression._format_markdown_row(r: RegressionResult) str
- bencher.regression._regression_plot_spec(result: RegressionResult, historical: numpy.ndarray | None, current: numpy.ndarray | float | None) dict
Prepare the data + styling used by both the matplotlib and holoviews renderers.
Resolves the history and current arrays from the arguments first, falling back to anything stored on result. Returns a dict of primitives the backend-specific renderers consume. Keeping this shared guarantees the PNG and in-report plots stay in sync as the diagnostic evolves.
- bencher.regression._ensure_matplotlib_backend_loaded() None
Register the holoviews matplotlib backend without changing the default.
render_regression_png needs matplotlib to export a PNG, but the report path uses bokeh — calling hv.extension(‘matplotlib’) naively would flip the global default mid-run. This loads the renderer if missing, then restores the prior default. Selects the non-interactive Agg backend when no matplotlib backend has been configured yet (
force=False), so holoviews doesn’t pick up Tk/Qt on a fresh process (which leaksmain thread is not in main looptracebacks at interpreter shutdown). If the caller has already configured a backend (e.g., Jupyter’s inline backend), that choice is left alone.
- bencher.regression.build_regression_overlay(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, width: int = 700, height: int = 350, fig_inches: tuple[float, float] = (7.0, 3.5))
Build a
holoviews.Overlaydiagnostic of a regression result.Opts are applied per-backend so the same overlay renders correctly under both bokeh (for embedded HTML reports) and matplotlib (for PNG export via
render_regression_png()). History always shows as mean line + raw alpha scatter; regression-specific layers (acceptance band, baseline, verdict-coloured current marker) are conditional on the data in result.- Parameters:
result – The
RegressionResultto visualise.historical – Optional 1-D array of historical per-time-point means. Falls back to
result.historicalif omitted.current – Optional current-run sample array (or scalar). Falls back to
result.current_samples/result.current_value.width – Pixel dimensions for the bokeh backend.
height – Pixel dimensions for the bokeh backend.
fig_inches – Figure size in inches for the matplotlib backend.
- bencher.regression.render_regression_png(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) str
Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib.
Produces the same plot as the in-report bokeh overlay — it calls
build_regression_overlay()and hands the result to holoviews’ matplotlib renderer, so there’s a single source of truth for the diagnostic visual.- Parameters:
result – The
RegressionResultproduced by adetect_*call.historical – 1-D array of historical per-time-point means. Falls back to
result.historical.current – Current-run sample(s). Falls back to
result.current_samples/result.current_value.path – Output PNG path. If
None, a path is generated viabencher.utils.gen_image_path()so the file lives under the bencher cache directory.figsize – Figure size in inches (matplotlib
fig_inches).dpi – Output DPI (500x320 at
dpi=100works well for GitHub comments).
- Returns:
Absolute path to the saved PNG as a string.
- bencher.regression._clean_1d(a: numpy.ndarray) numpy.ndarray
Flatten to 1-D float and remove NaNs.
- bencher.regression._safe_change_percent(current: float, baseline: float) float
Calculate percentage change, handling zero baseline gracefully.
- bencher.regression._is_regression(change_percent: float, direction: bencher.variables.results.OptDir) bool
Determine if a change constitutes a regression given the optimization direction.
- bencher.regression._exceeds_directional_threshold(change_percent: float, threshold_percent: float, direction: bencher.variables.results.OptDir) bool
Check if change exceeds threshold in the direction-appropriate sense.
- bencher.regression.detect_percentage(variable: str, historical: numpy.ndarray, current: numpy.ndarray, threshold_percent: float = 5.0, direction: bencher.variables.results.OptDir = OptDir.minimize) RegressionResult
Compare current mean vs historical mean by percentage threshold.
Simple escape hatch: one directional rule comparing the current mean against the historical mean. Same shape as
detect_delta()anddetect_absolute(); contrast withdetect_adaptive()which layers noise modelling, drift test, and a dual-band AND gate.
- bencher.regression._robust_scale(values: numpy.ndarray) tuple[float, float]
Return (median, MAD-based sigma) for a 1-D numeric array.
The MAD is scaled by 1.4826 so it matches the standard deviation for Gaussian data.
- bencher.regression._residual_sigma(values: numpy.ndarray) float
Estimate step-to-step noise via MAD of first differences.
For data
y[i] = trend[i] + eps[i]the diffy[i+1] - y[i]has variance2 * sigma^2, soMAD(diff) * 1.4826 / sqrt(2)recovers sigma even whentrendis non-stationary. This prevents a gradual drift from inflating its own noise estimate and masking itself.
- bencher.regression.detect_adaptive(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, regression_mad: float = 3.5, drift_threshold: float | None = None, mk_alpha: float = 0.1, direction: bencher.variables.results.OptDir = OptDir.minimize, historical_samples: numpy.ndarray | None = None, regression_percentage: float | None = None) RegressionResult
Robust regression detection combining step and drift tests.
The method estimates the metric’s inherent noise from history using a median + MAD (median absolute deviation) scale and expresses the current run’s deviation in those noise units. Two orthogonal tests run in parallel:
Short-term step — flags if
(current_mean - baseline) / noise_floorexceedsregression_madin the regression direction.Long-term drift — fits a Theil–Sen slope on the historical time-point means (after a Hampel filter removes isolated outliers) and flags if the total projected drift, scaled by
noise_floor, exceedsdrift_thresholdand a Mann–Kendall test confirms monotonic trend withp < mk_alpha.
- Parameters:
variable – Name of the result variable being checked.
historical_time_means – 1-D array of per-time-point mean values from history (one entry per prior run).
current – Current run values (will be averaged).
regression_mad – Step-test threshold in MAD-sigma units.
drift_threshold – Drift-test threshold in MAD-sigma units. If
None, defaults to_DRIFT_FRAC * regression_madso users need to tune only one knob.mk_alpha – Significance level for the Mann–Kendall trend guard.
direction – Optimization direction from the result variable.
historical_samples – Optional flat array of all historical samples (not per-time means). Used for the sparse-history fallback so the delegated
percentagedetector sees the same input it would have received fromdetect_regressionsdirectly. Falls back tohistorical_time_meanswhen not provided.regression_percentage – Optional minimum percent change required to flag a regression (directional, i.e. interpreted against
direction). When set, acts as a second acceptance band: a regression fires only when BOTH the MAD test and the percent change exceed their thresholds. Suppresses noise-floor false positives on metrics with few repeats or very tight history.
- bencher.regression.detect_delta(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, max_delta: float, direction: bencher.variables.results.OptDir = OptDir.minimize) RegressionResult
Fail when the current mean’s delta from history exceeds
max_delta.Simple escape hatch: one directional rule on the absolute-unit delta between the current mean and the mean of all historical per-time means.
minimizefails whencurr - hist_mean > max_delta;maximizefails whenhist_mean - curr > max_delta;noneuses|delta|. Same shape asdetect_percentage()anddetect_absolute(); contrast withdetect_adaptive()which layers noise modelling and drift testing. Selected viaregression_method='delta'.
- bencher.regression.detect_absolute(variable: str, current: numpy.ndarray, limit: float, direction: bencher.variables.results.OptDir = OptDir.minimize) RegressionResult
Fail when current mean violates an absolute limit in the direction of OptDir.
Simple escape hatch: one directional rule against a fixed limit — no historical data required. For
OptDir.minimizelimitis a ceiling; forOptDir.maximizeit’s a floor;OptDir.nonerecords a non-regressed result and leaves it to the caller to log. Same shape asdetect_percentage()anddetect_delta(); contrast withdetect_adaptive()which needs history to estimate noise.
- bencher.regression._compute_history_arrays(da: xarray.DataArray) tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]
Aggregate history into per-time means + per-sample scatter arrays.
Returns
(time_means, hist_samples_flat, hist_x_flat)or all-Nonewhen there is no history to summarise. Per-time means collapse every non-time dim into one scalar per run so detection and plotting both see a 1-D series; the scatter arrays preserve per-repeat spread broadcast against the historical over_time coords.
- bencher.regression._attach_plot_metadata(result: RegressionResult, *, time_coord: numpy.ndarray, current_samples: numpy.ndarray, time_means: numpy.ndarray | None, hist_samples_flat: numpy.ndarray | None, hist_x_flat: numpy.ndarray | None) None
Attach the history/current arrays a RegressionResult needs for replay plotting.
- bencher.regression.detect_regressions(dataset: xarray.Dataset, bench_cfg, run_cfg) RegressionReport
Run regression detection on a dataset with over_time dimension.
For each numeric result variable, dispatches to the detector chosen by
run_cfg.regression_method(percentage,adaptive,delta, orabsolute).absoluteruns even with a single over_time point since it needs no baseline; every other method requires history.- Parameters:
dataset – xarray Dataset with an over_time dimension.
bench_cfg – BenchCfg with
result_varslist.run_cfg – BenchRunCfg. Reads
regression_methodand its method-specific threshold:regression_percentageforpercentage;regression_mad(plusregression_percentageas a dual-band gate) foradaptive;regression_deltafordelta;regression_absoluteforabsolute.
- Returns:
RegressionReport with one result per variable per fired detector/guard.