scoring

Label schemes

SenseBench scores a model's predicted WordNet sense against a gold answer key. The leaderboard's two selectors let you choose which gold answer key and at what sense granularity — nine combinations in all. This page explains what each one means; see coarsening for a deeper guide to the two coarse inventories.

Axis 1 — Score against (the gold label set)

Every lexEN item is the same target word in the same sentence context, but three projects assigned its "correct" sense and they do not always agree. You can score every run against any of the three gold label sets:

On lexEN v1 these labels differ: the lexEN gold changes 211 labels relative to Maru 2022 and 1,004 relative to the original Raganato labels. Scoring against the original labels therefore reads lower than scoring against the corrected lexEN gold.

Axis 2 — Sense granularity

Both coarse inventories are explained in depth on the coarsening page.

The nine schemes

Score against WordNet fine-grainedGlite coarse-grainedCSI coarse-grained (Lacerra 2020)
lexEN v1 lexEN v1 · WordNet fine-grained lexEN v1 · Glite coarse-grained lexEN v1 · CSI coarse-grained (Lacerra 2020)
Maru 2022 (ALLamended) Maru 2022 (ALLamended) · WordNet fine-grained Maru 2022 (ALLamended) · Glite coarse-grained Maru 2022 (ALLamended) · CSI coarse-grained (Lacerra 2020)
Raganato 2017 (original) Raganato 2017 (original) · WordNet fine-grained Raganato 2017 (original) · Glite coarse-grained Raganato 2017 (original) · CSI coarse-grained (Lacerra 2020)

The default and official SenseBench score is lexEN v1 · WordNet fine-grained. Switching the selectors on the leaderboard re-scores every run and reference baseline live; each run page reports all nine numbers in one table.