bencher.report_export

Machine-readable export of benchmark results for agents and CI.

Bencher already computes per-metric verdicts, optimal values, and regression deltas during collection — but historically only emitted them as HTML, pickle, or human-prose markdown. This module turns those already-computed values into a stable JSON contract so an automated workflow can read ground truth instead of scraping logs or parsing rendered reports.

Two artifacts:

  • result_to_dict() / result_to_json() — a single run’s metrics + regression verdicts + provenance (result.json).

  • compare_results() — an A/B diff between two independently-collected results (comparison.json). It reuses the over-time detect_regressions() path verbatim by stacking the two results on a synthetic 2-point over_time axis, so the A/B verdict shares identical direction/threshold semantics with the normal pipeline.

The contracts carry schema_version so downstream consumers can pin to a shape.

Attributes

SCHEMA_VERSION

Functions

_provenance(→ dict)

Best-effort provenance for a result (time-event label if recorded).

_metric_entry(→ dict)

Per-metric summary: identity + optimal value/inputs when computable.

_coord_scalar(values)

Coerce an optimal-input coordinate to a JSON-safe scalar.

result_to_dict(→ dict)

Build the stable, JSON-serializable contract for a single result.

result_to_json(→ pathlib.Path)

Write result_to_dict() for bench_res to path as JSON.

_snapshot_ds(→ xarray.Dataset)

Return a single-snapshot dataset (collapse a pre-existing over_time axis).

_verdict(→ str)

Classify a metric movement as improved / regressed / unchanged.

compare_results(→ dict)

Diff two independently-collected results into an A/B comparison contract.

comparison_to_json(→ pathlib.Path)

Write compare_results() for the two results to path as JSON.

Module Contents

bencher.report_export.SCHEMA_VERSION = 1
bencher.report_export._provenance(bench_res: bencher.results.bench_result.BenchResult) dict

Best-effort provenance for a result (time-event label if recorded).

bencher.report_export._metric_entry(bench_res: bencher.results.bench_result.BenchResult, rv) dict

Per-metric summary: identity + optimal value/inputs when computable.

bencher.report_export._coord_scalar(values)

Coerce an optimal-input coordinate to a JSON-safe scalar.

bencher.report_export.result_to_dict(bench_res: bencher.results.bench_result.BenchResult) dict

Build the stable, JSON-serializable contract for a single result.

Parameters:

bench_res – A collected BenchResult (e.g. from plot_sweep(auto_plot=False) / Bench.collect()).

Returns:

A dict with schema_version, bench_name, provenance, input_vars, over_time, metrics, and regressions.

bencher.report_export.result_to_json(bench_res: bencher.results.bench_result.BenchResult, path: str | pathlib.Path, *, indent: int = 2) pathlib.Path

Write result_to_dict() for bench_res to path as JSON.

bencher.report_export._snapshot_ds(bench_res: bencher.results.bench_result.BenchResult) xarray.Dataset

Return a single-snapshot dataset (collapse a pre-existing over_time axis).

bencher.report_export._verdict(change_percent: float | None, direction: str, regressed: bool, threshold: float) str

Classify a metric movement as improved / regressed / unchanged.

regressed comes straight from the detector (direction- and threshold-aware). An improvement is the mirror image: a beneficial-direction move whose magnitude clears the same threshold.

bencher.report_export.compare_results(baseline: bencher.results.bench_result.BenchResult, candidate: bencher.results.bench_result.BenchResult, *, run_cfg=None) dict

Diff two independently-collected results into an A/B comparison contract.

Stacks baseline and candidate on a synthetic 2-point over_time axis (baseline first, candidate last) and runs the regular detect_regressions() over it, so the A/B verdict uses identical direction/threshold logic to the over-time path.

Parameters:
  • baseline – The reference result.

  • candidate – The result being compared against the baseline.

  • run_cfg – Optional BenchRunCfg controlling the detector. When omitted, a percentage comparison (regression_method='percentage') is used — the natural choice for a two-point A/B.

Returns:

A dict with schema_version, baseline/candidate provenance, per-metric metrics (with a verdict), and a summary count.

Raises:

ValueError – when the two results share no comparable scalar metric.

bencher.report_export.comparison_to_json(baseline: bencher.results.bench_result.BenchResult, candidate: bencher.results.bench_result.BenchResult, path: str | pathlib.Path, *, run_cfg=None, indent: int = 2) pathlib.Path

Write compare_results() for the two results to path as JSON.