bencher.regression ================== .. py:module:: bencher.regression .. autoapi-nested-parse:: Benchmark regression detection for over-time benchmarks. Provides statistical methods to detect if benchmark values have changed significantly between runs. Supports a percentage threshold and an adaptive MAD-based detector with an optional percent floor for dual-band suppression. Attributes ---------- .. autoapisummary:: bencher.regression._METHOD_DEFAULTS bencher.regression._MAD_TO_SIGMA bencher.regression._DRIFT_FRAC bencher.regression._HAMPEL_K Exceptions ---------- .. autoapisummary:: bencher.regression.RegressionError Classes ------- .. autoapisummary:: bencher.regression.RegressionResult bencher.regression.RegressionReport bencher.regression.MethodCells Functions --------- .. autoapisummary:: bencher.regression.method_cells bencher.regression._format_summary_line bencher.regression._format_markdown_row bencher.regression._regression_plot_spec bencher.regression._ensure_matplotlib_backend_loaded bencher.regression.build_regression_overlay bencher.regression.render_regression_png bencher.regression._clean_1d bencher.regression._safe_change_percent bencher.regression._is_regression bencher.regression._exceeds_directional_threshold bencher.regression.detect_percentage bencher.regression._robust_scale bencher.regression._residual_sigma bencher.regression.detect_adaptive bencher.regression.detect_delta bencher.regression.detect_absolute bencher.regression._compute_history_arrays bencher.regression._attach_plot_metadata bencher.regression.detect_regressions Module Contents --------------- .. py:data:: _METHOD_DEFAULTS .. py:data:: _MAD_TO_SIGMA :value: 1.4826 .. py:data:: _DRIFT_FRAC :value: 0.85 .. py:data:: _HAMPEL_K :value: 5.0 .. py:exception:: RegressionError Bases: :py:obj:`Exception` Raised when regression detection finds regressions and regression_fail is True. .. py:class:: RegressionResult Result of regression detection for a single variable. .. py:attribute:: variable :type: str .. py:attribute:: method :type: str .. py:attribute:: regressed :type: bool .. py:attribute:: current_value :type: float .. py:attribute:: baseline_value :type: float .. py:attribute:: change_percent :type: float .. py:attribute:: threshold :type: float .. py:attribute:: direction :type: str .. py:attribute:: details :type: str .. py:attribute:: band_lower :type: float | None :value: None .. py:attribute:: band_upper :type: float | None :value: None .. py:attribute:: percent_band_lower :type: float | None :value: None .. py:attribute:: percent_band_upper :type: float | None :value: None .. py:attribute:: historical :type: numpy.ndarray | None :value: None .. py:attribute:: current_samples :type: numpy.ndarray | None :value: None .. py:attribute:: historical_all :type: numpy.ndarray | None :value: None .. py:attribute:: historical_all_x :type: numpy.ndarray | None :value: None .. py:attribute:: historical_x :type: numpy.ndarray | None :value: None .. py:attribute:: current_x :type: numpy.ndarray | None :value: None .. py:method:: render_png(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) -> str Render this result as a diagnostic PNG (see :func:`render_regression_png`). .. py:method:: render_overlay(historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None) Build a :class:`holoviews.Overlay` of this result (see :func:`build_regression_overlay`). .. py:class:: RegressionReport Aggregates regression results for all variables in a benchmark. .. py:attribute:: results :type: list[RegressionResult] :value: [] .. py:property:: has_regressions :type: bool .. py:property:: regressed_variables :type: list[RegressionResult] .. py:method:: summary() -> str .. py:method:: to_markdown() -> str Return a nicely formatted Markdown summary of all regression results. .. py:method:: append_to_report(report) -> None Append a formatted regression summary to a :class:`BenchReport`. .. py:method:: prepend_to_result(report, bench_res) -> None Insert a formatted regression summary at the top of *bench_res*'s tab. .. py:class:: MethodCells Per-method rendering of a single regression result. Each detector has a different gate — percent ratio, MAD-sigma, absolute delta, hard limit — so the report cells must describe it in its own units. This bundle is the single source of truth consumed by both the built-in text summary and the markdown table, and is exposed as public API so downstream report builders can produce their own layouts (custom columns, non-markdown output, templated HTML, GitHub PR comments with status decoration, etc.) without reimplementing method dispatch and drifting when new detection methods are added. Example — building a minimal custom row from a RegressionResult:: from bencher import method_cells cells = method_cells(result) row = f"{result.variable}: {cells.change} (gate {cells.threshold})" .. attribute:: change Change column (markdown) — gated quantity in its own units. .. attribute:: baseline Baseline column (markdown) — em-dash for absolute (no historical baseline exists). .. attribute:: threshold Threshold column (markdown) — carries the gate's native units (``±T%``, ``Tσ``, ``±T``, or a direction-aware inequality). .. attribute:: summary_lead First clause of the summary line, before the details parenthesis. Captures the gated quantity in sentence form. .. attribute:: summary_standalone When True, the summary line skips the ``(baseline=…, current=…, threshold=…)`` tail because ``summary_lead`` already contains the relevant values. Used by the absolute method (no baseline, limit is in the lead). .. py:attribute:: change :type: str .. py:attribute:: baseline :type: str .. py:attribute:: threshold :type: str .. py:attribute:: summary_lead :type: str .. py:attribute:: summary_standalone :type: bool :value: False .. py:function:: method_cells(r: RegressionResult) -> MethodCells Build the per-method cell bundle for a :class:`RegressionResult`. Returns a :class:`MethodCells` with pre-rendered display strings for the result's change, baseline, and threshold, plus the summary lead clause. Dispatches on ``r.method`` so each gate describes itself in its native units. Safe to call on any ``RegressionResult`` — unknown methods fall back to the percentage-style rendering. Intended for consumers that want to embed regression results in a custom layout while staying consistent with how the built-in :meth:`RegressionReport.summary` and :meth:`RegressionReport.to_markdown` present each method. Notes on the ``absolute`` branch: ``baseline_value`` and ``threshold`` both hold the limit for this detector (see :func:`detect_absolute`); the code reads from ``threshold`` to make the intent ("this is the gate value") explicit. .. py:function:: _format_summary_line(r: RegressionResult) -> str .. py:function:: _format_markdown_row(r: RegressionResult) -> str .. py:function:: _regression_plot_spec(result: RegressionResult, historical: numpy.ndarray | None, current: numpy.ndarray | float | None) -> dict Prepare the data + styling used by both the matplotlib and holoviews renderers. Resolves the history and current arrays from the arguments first, falling back to anything stored on *result*. Returns a dict of primitives the backend-specific renderers consume. Keeping this shared guarantees the PNG and in-report plots stay in sync as the diagnostic evolves. .. py:function:: _ensure_matplotlib_backend_loaded() -> None Register the holoviews matplotlib backend without changing the default. render_regression_png needs matplotlib to export a PNG, but the report path uses bokeh — calling hv.extension('matplotlib') naively would flip the global default mid-run. This loads the renderer if missing, then restores the prior default. Selects the non-interactive Agg backend when no matplotlib backend has been configured yet (``force=False``), so holoviews doesn't pick up Tk/Qt on a fresh process (which leaks ``main thread is not in main loop`` tracebacks at interpreter shutdown). If the caller has already configured a backend (e.g., Jupyter's inline backend), that choice is left alone. .. py:function:: build_regression_overlay(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, width: int = 700, height: int = 350, fig_inches: tuple[float, float] = (7.0, 3.5)) Build a :class:`holoviews.Overlay` diagnostic of a regression result. Opts are applied per-backend so the same overlay renders correctly under both bokeh (for embedded HTML reports) and matplotlib (for PNG export via :func:`render_regression_png`). History always shows as mean line + raw alpha scatter; regression-specific layers (acceptance band, baseline, verdict-coloured current marker) are conditional on the data in *result*. :param result: The :class:`RegressionResult` to visualise. :param historical: Optional 1-D array of historical per-time-point means. Falls back to ``result.historical`` if omitted. :param current: Optional current-run sample array (or scalar). Falls back to ``result.current_samples`` / ``result.current_value``. :param width: Pixel dimensions for the bokeh backend. :param height: Pixel dimensions for the bokeh backend. :param fig_inches: Figure size in inches for the matplotlib backend. .. py:function:: render_regression_png(result: RegressionResult, historical: numpy.ndarray | None = None, current: numpy.ndarray | float | None = None, path: str | pathlib.Path | None = None, figsize: tuple[float, float] = (8.0, 5.0), dpi: int = 100) -> str Render a diagnostic PNG by saving the shared holoviews overlay via matplotlib. Produces the same plot as the in-report bokeh overlay — it calls :func:`build_regression_overlay` and hands the result to holoviews' matplotlib renderer, so there's a single source of truth for the diagnostic visual. :param result: The :class:`RegressionResult` produced by a ``detect_*`` call. :param historical: 1-D array of historical per-time-point means. Falls back to ``result.historical``. :param current: Current-run sample(s). Falls back to ``result.current_samples`` / ``result.current_value``. :param path: Output PNG path. If ``None``, a path is generated via :func:`bencher.utils.gen_image_path` so the file lives under the bencher cache directory. :param figsize: Figure size in inches (matplotlib ``fig_inches``). :param dpi: Output DPI (500x320 at ``dpi=100`` works well for GitHub comments). :returns: Absolute path to the saved PNG as a string. .. py:function:: _clean_1d(a: numpy.ndarray) -> numpy.ndarray Flatten to 1-D float and remove NaNs. .. py:function:: _safe_change_percent(current: float, baseline: float) -> float Calculate percentage change, handling zero baseline gracefully. .. py:function:: _is_regression(change_percent: float, direction: bencher.variables.results.OptDir) -> bool Determine if a change constitutes a regression given the optimization direction. .. py:function:: _exceeds_directional_threshold(change_percent: float, threshold_percent: float, direction: bencher.variables.results.OptDir) -> bool Check if change exceeds threshold in the direction-appropriate sense. .. py:function:: detect_percentage(variable: str, historical: numpy.ndarray, current: numpy.ndarray, threshold_percent: float = 5.0, direction: bencher.variables.results.OptDir = OptDir.minimize) -> RegressionResult Compare current mean vs historical mean by percentage threshold. Simple escape hatch: one directional rule comparing the current mean against the historical mean. Same shape as :func:`detect_delta` and :func:`detect_absolute`; contrast with :func:`detect_adaptive` which layers noise modelling, drift test, and a dual-band AND gate. .. py:function:: _robust_scale(values: numpy.ndarray) -> tuple[float, float] Return (median, MAD-based sigma) for a 1-D numeric array. The MAD is scaled by 1.4826 so it matches the standard deviation for Gaussian data. .. py:function:: _residual_sigma(values: numpy.ndarray) -> float Estimate step-to-step noise via MAD of first differences. For data ``y[i] = trend[i] + eps[i]`` the diff ``y[i+1] - y[i]`` has variance ``2 * sigma^2``, so ``MAD(diff) * 1.4826 / sqrt(2)`` recovers sigma even when ``trend`` is non-stationary. This prevents a gradual drift from inflating its own noise estimate and masking itself. .. py:function:: detect_adaptive(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, regression_mad: float = 3.5, drift_threshold: float | None = None, mk_alpha: float = 0.1, direction: bencher.variables.results.OptDir = OptDir.minimize, historical_samples: numpy.ndarray | None = None, regression_percentage: float | None = None) -> RegressionResult Robust regression detection combining step and drift tests. The method estimates the metric's inherent noise from history using a median + MAD (median absolute deviation) scale and expresses the current run's deviation in those noise units. Two orthogonal tests run in parallel: * **Short-term step** — flags if ``(current_mean - baseline) / noise_floor`` exceeds ``regression_mad`` in the regression direction. * **Long-term drift** — fits a Theil–Sen slope on the historical time-point means (after a Hampel filter removes isolated outliers) and flags if the total projected drift, scaled by ``noise_floor``, exceeds ``drift_threshold`` and a Mann–Kendall test confirms monotonic trend with ``p < mk_alpha``. :param variable: Name of the result variable being checked. :param historical_time_means: 1-D array of per-time-point mean values from history (one entry per prior run). :param current: Current run values (will be averaged). :param regression_mad: Step-test threshold in MAD-sigma units. :param drift_threshold: Drift-test threshold in MAD-sigma units. If ``None``, defaults to ``_DRIFT_FRAC * regression_mad`` so users need to tune only one knob. :param mk_alpha: Significance level for the Mann–Kendall trend guard. :param direction: Optimization direction from the result variable. :param historical_samples: Optional flat array of all historical samples (not per-time means). Used for the sparse-history fallback so the delegated ``percentage`` detector sees the same input it would have received from ``detect_regressions`` directly. Falls back to ``historical_time_means`` when not provided. :param regression_percentage: Optional minimum percent change required to flag a regression (directional, i.e. interpreted against ``direction``). When set, acts as a second acceptance band: a regression fires only when BOTH the MAD test and the percent change exceed their thresholds. Suppresses noise-floor false positives on metrics with few repeats or very tight history. .. py:function:: detect_delta(variable: str, historical_time_means: numpy.ndarray, current: numpy.ndarray, max_delta: float, direction: bencher.variables.results.OptDir = OptDir.minimize) -> RegressionResult Fail when the current mean's delta from history exceeds ``max_delta``. Simple escape hatch: one directional rule on the absolute-unit delta between the current mean and the mean of all historical per-time means. ``minimize`` fails when ``curr - hist_mean > max_delta``; ``maximize`` fails when ``hist_mean - curr > max_delta``; ``none`` uses ``|delta|``. Same shape as :func:`detect_percentage` and :func:`detect_absolute`; contrast with :func:`detect_adaptive` which layers noise modelling and drift testing. Selected via ``regression_method='delta'``. .. py:function:: detect_absolute(variable: str, current: numpy.ndarray, limit: float, direction: bencher.variables.results.OptDir = OptDir.minimize) -> RegressionResult Fail when current mean violates an absolute limit in the direction of OptDir. Simple escape hatch: one directional rule against a fixed limit — no historical data required. For ``OptDir.minimize`` ``limit`` is a ceiling; for ``OptDir.maximize`` it's a floor; ``OptDir.none`` records a non-regressed result and leaves it to the caller to log. Same shape as :func:`detect_percentage` and :func:`detect_delta`; contrast with :func:`detect_adaptive` which needs history to estimate noise. .. py:function:: _compute_history_arrays(da: xarray.DataArray) -> tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None] Aggregate history into per-time means + per-sample scatter arrays. Returns ``(time_means, hist_samples_flat, hist_x_flat)`` or all-``None`` when there is no history to summarise. Per-time means collapse every non-time dim into one scalar per run so detection and plotting both see a 1-D series; the scatter arrays preserve per-repeat spread broadcast against the historical over_time coords. .. py:function:: _attach_plot_metadata(result: RegressionResult, *, time_coord: numpy.ndarray, current_samples: numpy.ndarray, time_means: numpy.ndarray | None, hist_samples_flat: numpy.ndarray | None, hist_x_flat: numpy.ndarray | None) -> None Attach the history/current arrays a RegressionResult needs for replay plotting. .. py:function:: detect_regressions(dataset: xarray.Dataset, bench_cfg, run_cfg) -> RegressionReport Run regression detection on a dataset with over_time dimension. For each numeric result variable, dispatches to the detector chosen by ``run_cfg.regression_method`` (``percentage``, ``adaptive``, ``delta``, or ``absolute``). ``absolute`` runs even with a single over_time point since it needs no baseline; every other method requires history. :param dataset: xarray Dataset with an over_time dimension. :param bench_cfg: BenchCfg with ``result_vars`` list. :param run_cfg: BenchRunCfg. Reads ``regression_method`` and its method-specific threshold: ``regression_percentage`` for ``percentage``; ``regression_mad`` (plus ``regression_percentage`` as a dual-band gate) for ``adaptive``; ``regression_delta`` for ``delta``; ``regression_absolute`` for ``absolute``. :returns: RegressionReport with one result per variable per fired detector/guard.