bencher.report_export

Machine-readable export of benchmark results for agents and CI.

Bencher already computes per-metric verdicts, optimal values, and regression deltas during collection — but historically only emitted them as HTML, pickle, or human-prose markdown. This module turns those already-computed values into a stable JSON contract so an automated workflow can read ground truth instead of scraping logs or parsing rendered reports.

Two artifacts:

result_to_dict() / result_to_json() — a single run’s metrics + regression verdicts + provenance (result.json).
compare_results() — an A/B diff between two independently-collected results (comparison.json). It reuses the over-time detect_regressions() path verbatim by stacking the two results on a synthetic 2-point over_time axis, so the A/B verdict shares identical direction/threshold semantics with the normal pipeline.

The contracts carry schema_version so downstream consumers can pin to a shape.

Attributes

SCHEMA_VERSION

Functions

`_provenance`(→ dict)	Best-effort provenance for a result (time-event label if recorded).
`_metric_entry`(→ dict)	Per-metric summary: identity + optimal value/inputs when computable.
`_coord_scalar`(values)	Coerce an optimal-input coordinate to a JSON-safe scalar.
`series_for_var`(→ list[dict])	Per-time-event mean/std/n for a scalar result var across the over_time axis.
`result_to_dict`(→ dict)	Build the stable, JSON-serializable contract for a single result.
`result_to_json`(→ pathlib.Path)	Write `result_to_dict()` for bench_res to path as JSON.
`_snapshot_ds`(→ xarray.Dataset)	Return a single-snapshot dataset (collapse a pre-existing over_time axis).
`_verdict`(→ str)	Classify a metric movement as improved / regressed / unchanged.
`compare_results`(→ dict)	Diff two independently-collected results into an A/B comparison contract.
`comparison_to_json`(→ pathlib.Path)	Write `compare_results()` for the two results to path as JSON.

Module Contents

bencher.report_export.SCHEMA_VERSION = 1

bencher.report_export._provenance(bench_res: bencher.results.bench_result.BenchResult) → dict: Best-effort provenance for a result (time-event label if recorded).

bencher.report_export._metric_entry(bench_res: bencher.results.bench_result.BenchResult, rv) → dict: Per-metric summary: identity + optimal value/inputs when computable.

bencher.report_export._coord_scalar(values): Coerce an optimal-input coordinate to a JSON-safe scalar.

bencher.report_export.series_for_var(ds: xarray.Dataset, var_name: str) → list[dict]

Per-time-event mean/std/n for a scalar result var across the over_time axis.

Reduces over every dim except over_time (the sweep inputs + repeat) with NaN-aware reductions, mirroring the history reduction used elsewhere. The over_time coordinate labels can carry embedded newlines (long labels are wrapped in place), so strip them back to single-line strings.

Returns one {time_event, mean, std, n} record per over-time event, with mean/std coerced finite-or-None so the output stays strict-JSON safe.

bencher.report_export.result_to_dict(bench_res: bencher.results.bench_result.BenchResult, *, include_series: bool = False) → dict

Build the stable, JSON-serializable contract for a single result.

Parameters:

bench_res – A collected BenchResult (e.g. from plot_sweep(auto_plot=False) / Bench.collect()).
include_series – When True and the result carries an over_time axis, attach a per-time-event series (series_for_var()) to each scalar metric — the trend behind the regression verdict, for callers that render sparklines. Off by default so the base contract stays byte-stable.

Returns:

A dict with schema_version, bench_name, provenance, input_vars, over_time, metrics, and regressions.

bencher.report_export.result_to_json(bench_res: bencher.results.bench_result.BenchResult, path: str | pathlib.Path, *, indent: int = 2, include_series: bool = False) → pathlib.Path: Write result_to_dict() for bench_res to path as JSON.

bencher.report_export._snapshot_ds(bench_res: bencher.results.bench_result.BenchResult) → xarray.Dataset: Return a single-snapshot dataset (collapse a pre-existing over_time axis).

bencher.report_export._verdict(change_percent: float | None, direction: str, regressed: bool, threshold: float) → str

Classify a metric movement as improved / regressed / unchanged.

regressed comes straight from the detector (direction- and threshold-aware). An improvement is the mirror image: a beneficial-direction move whose magnitude clears the same threshold.

bencher.report_export.compare_results(baseline: bencher.results.bench_result.BenchResult, candidate: bencher.results.bench_result.BenchResult, *, run_cfg=None) → dict

Diff two independently-collected results into an A/B comparison contract.

Stacks baseline and candidate on a synthetic 2-point over_time axis (baseline first, candidate last) and runs the regular detect_regressions() over it, so the A/B verdict uses identical direction/threshold logic to the over-time path.

Parameters:

baseline – The reference result.
candidate – The result being compared against the baseline.
run_cfg – Optional BenchRunCfg controlling the detector. When omitted, a percentage comparison (regression_method='percentage') is used — the natural choice for a two-point A/B.

Returns:

A dict with schema_version, baseline/candidate provenance, per-metric metrics (with a verdict), and a summary count.

Raises:

ValueError – when the two results share no comparable scalar metric.

bencher.report_export.comparison_to_json(baseline: bencher.results.bench_result.BenchResult, candidate: bencher.results.bench_result.BenchResult, path: str | pathlib.Path, *, run_cfg=None, indent: int = 2) → pathlib.Path: Write compare_results() for the two results to path as JSON.