Progress in lexicographic data in Wikidata 2024

From Simia
Revision as of 00:00, 22 January 2025 by Denny (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Here are some highlights of the progress in lexicographic data in Wikidata in 2024

  • Hausa: jumped from 1.5% coverage right to 40%
  • Danish: Danish also made another huge jump forward, increasing the number of forms from 170k to 570k, form coverage from 33% to 52%, and token coverage from 87% to 92%
  • Italian: Italian made another huge push, increased the number of forms from 290k to 410k, and the coverage from 83% to 93%
  • Spanish: Spanish also kept pushing forward, increasing the number of forms from 440k to 560k, and the coverage from 91.3% to 91.8%
  • Norwegian (Nynorsk): increased the number of forms from 67k to 88k, and coverage from 80% to 82%
  • Czech: increased the coverage from 63% to 69%, the number of forms from 190k to 210k
  • Tamil: almost doubled the number of forms from 3800 to 6600, increasing coverage from 8% to 11%
  • Breton: added 1000 new forms, increasing the coverage from 67% to 71%
  • Croatian: increased from 4k to 5.5k forms, improving coverage from 45% to 48%

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. Note that every percent more gets much more difficult than the previous one: an increase from 1% to 2% usually needs much much less work than from 91% to 92%.

See also last year's progress.

Simia

Previous entry:
Wikidata lexicographic data coverage for Croatian in 2024
Next entry:
None