scoring

Label schemes

SenseBench scores a model's predicted WordNet sense against a gold answer key. The leaderboard's two selectors let you choose which gold answer key and at what sense granularity — nine combinations in all. This page explains what each one means; see coarsening for a deeper guide to the two coarse inventories.

Axis 1 — Score against (the gold label set)

Every lexEN item is the same target word in the same sentence context, but three projects assigned its "correct" sense and they do not always agree. You can score every run against any of the three gold label sets:

lexEN v1 (default) — Glite's gold for these items, produced by a same-protocol three-annotator consensus review that corrects annotation errors in the source labels. This is the official SenseBench score. See the lexEN repository.
Maru 2022 (ALLamended) — the amended gold labels from Maru et al. 2022, Nibbling at the Hard Core of Word Sense Disambiguation (ACL 2022), released with the WSD Hard Benchmark.
Raganato 2017 (original) — the original gold labels from the Raganato et al. 2017 unified evaluation framework, the long-standing standard for English all-words WSD. See the WSD Evaluation Framework.

On lexEN v1 these labels differ: the lexEN gold changes 211 labels relative to Maru 2022 and 1,004 relative to the original Raganato labels. Scoring against the original labels therefore reads lower than scoring against the corrected lexEN gold.

Axis 2 — Sense granularity

WordNet fine-grained (default) — the prediction must match the exact WordNet 3.0 sense key. This is the traditional, strict WSD metric.
Glite coarse-grained — WordNet's narrow senses are grouped into broader Glite Dictionary concepts, a more application-oriented sense inventory. A prediction is coarse-correct when its WordNet sense maps to the same Glite concept as a gold sense. Because every fine-grained hit is also a coarse hit, coarse accuracy is always at least the fine-grained accuracy for the same gold set.
CSI coarse-grained (Lacerra 2020) — an independent, third-party coarsening that groups WordNet synsets into 45 broad semantic domains (CSI), included so coarse results can be checked against an inventory the dataset authors do not control. Same coarse-correct rule as Glite, with the CSI map swapped in.

Both coarse inventories are explained in depth on the coarsening page.

The nine schemes

Score against	WordNet fine-grained	Glite coarse-grained	CSI coarse-grained (Lacerra 2020)
lexEN v1	lexEN v1 · WordNet fine-grained	lexEN v1 · Glite coarse-grained	lexEN v1 · CSI coarse-grained (Lacerra 2020)
Maru 2022 (ALLamended)	Maru 2022 (ALLamended) · WordNet fine-grained	Maru 2022 (ALLamended) · Glite coarse-grained	Maru 2022 (ALLamended) · CSI coarse-grained (Lacerra 2020)
Raganato 2017 (original)	Raganato 2017 (original) · WordNet fine-grained	Raganato 2017 (original) · Glite coarse-grained	Raganato 2017 (original) · CSI coarse-grained (Lacerra 2020)

The default and official SenseBench score is lexEN v1 · WordNet fine-grained. Switching the selectors on the leaderboard re-scores every run and reference baseline live; each run page reports all nine numbers in one table.