Progress in lexicographic data in Wikidata 2024

22 January 2025

Here are some highlights of the progress in lexicographic data in Wikidata in 2024

Hausa: jumped from 1.5% coverage right to 40%
Danish: Danish also made another huge jump forward, increasing the number of forms from 170k to 570k, form coverage from 33% to 52%, and token coverage from 87% to 92%
Italian: Italian made another huge push, increased the number of forms from 290k to 410k, and the coverage from 83% to 93%
Spanish: Spanish also kept pushing forward, increasing the number of forms from 440k to 560k, and the coverage from 91.3% to 91.8%
Norwegian (Nynorsk): increased the number of forms from 67k to 88k, and coverage from 80% to 82%
Czech: increased the coverage from 63% to 69%, the number of forms from 190k to 210k
Tamil: almost doubled the number of forms from 3800 to 6600, increasing coverage from 8% to 11%
Breton: added 1000 new forms, increasing the coverage from 67% to 71%
Croatian: increased from 4k to 5.5k forms, improving coverage from 45% to 48%

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. Note that every percent more gets much more difficult than the previous one: an increase from 1% to 2% usually needs much much less work than from 91% to 92%.

Progress in lexicographic data in Wikidata 2024

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

topics

Tools