methodology

Methodology

SenseBench scoring, verification, and leaderboard ranking rules.

Scoring

Accuracy is the fraction of dataset items whose predicted WordNet sense key matches the gold sense key set.

Confidence intervals are bootstrap intervals over item correctness with a fixed seed, shown as a ± half-width next to accuracy.

Rank ranges list the positions a run could plausibly occupy among the visible rows given overlapping 95% confidence intervals.

The compare view tests paired per-item differences between runs on the same dataset version with McNemar's test, which is far more sensitive than comparing overlapping intervals.

Reference baselines (MFS, BEM, ESCHER, ConSeC) are scored from per-item system predictions on exactly the same dataset items as the model runs.

Verification

Every public run is reloaded, replayed, and checked before it appears on the site.

Verification checks run metadata, prompt references, dataset hashes, candidate sets, raw output extraction, vote decisions, and correctness.

Ranking

Runs sort by higher accuracy, then lower cost per million items when available, then newer creation time.

The default leaderboard view lists every verified run; the collapsed view keeps only the best verified run per model and dataset version, across prompts and reasoning efforts.

Self-Hosted Runs

Self-hosted runs record the GPU machine they ran on and a benchmark time that covers only the per-item evaluation loop, excluding model download, weight loading, and inference engine startup.

Machine-hours per 1M items is the benchmark time divided by the item count, scaled to one million items and expressed in machine hours; it is comparable only across runs on the same GPU configuration.

When the machine's hourly rate is known, run cost is estimated as machine time multiplied by that rate (cost source machine_time_estimate); otherwise cost is unavailable.

Comparing Across GPUs and Quantization

Self-hosted rows record the quantization used (for example fp8 or bf16) alongside the GPU. The same model may appear at different quantization on different GPUs because each GPU is run at its best practical configuration: native fp8 on H100 and H200, and bf16 on A100, which has no native fp8 hardware. A cross-GPU accuracy difference for one model therefore reflects both the hardware and the quantization, and the two should not be attributed to the GPU alone.

Quantized inference is not bit-identical across GPU architectures, so the same model under greedy decoding can produce slightly different accuracy on different GPUs. Small cross-GPU accuracy differences for an identical model and quantization are expected and are a property of the kernels, not a measurement error.

Throughput is measured at a fixed per-GPU concurrency, reported as machine-hours per 1M items, so figures are comparable at a standard load rather than at each model's individually tuned optimum. Accuracy is computed under greedy decoding (temperature 0) and is deterministic given the weights; reported confidence intervals are fixed-seed bootstrap intervals and pairwise comparisons use McNemar's test.

The per-item generation cap is a runaway guard, not a scoring knob: it is set high enough that compliant models reach their answer and stop at the end-of-sequence token well before the cap. Submissions whose outputs are truncated by the cap on a material fraction of items are rejected by verification.