Wikidata lexicographic data coverage for Croatian in 2023

From Simia
Revision as of 10:31, 3 January 2024 by Denny (talk | contribs)
Jump to navigation Jump to search

Last year, I published ambitious goals for the coverage of lexicographic data for Croatian in Wikidata. My self-proclaimed goal was widely missed: I wanted to go from 40% coverage to 60% -- instead, thanks to the help of contributors, we reached 45%.

We grew from 3,124 forms to 4,115, i.e. almost a thousand new forms, or about 31%. The coverage grew from around 11 million tokens to about 13 million tokens in the Croatian Wikipedia, or, as said, from 40% to 45%. The covered forms grew from 1.4% to 1.9%, which illustrates neatly the increased difficulty to reach more coverage (thanks to Zipf's law): last year, we increased covered forms by 1%, which translated to an overall coverage increase of occurrences by 35%. This year, although we increased the covered forms by another 0.5%, we only got an overall coverage increase of occurrences by 5%.

But some of my energy was diverted from adding more lexicographic data to adding functions that help with adding and checking lexicographic data. We launched a new project, Wikifunctions, that can hold functions. There, we collected functions to create the regular forms for Croatian nouns. All nouns are now covered.

I think that's still a great achievement and progress. Sure, we didn't meet the 60%, but the functions helped a lot to get to the 45%, and they will continue to benefit us 2024 too. Again, I want to declare some goals, at least for myself, but not as ambitious with regards to coverage: the goal for 2024 is to reach 50% coverage of Croatian, and in addition, I would love us to have Lexeme forms available for verbs and adjectives, not only for nouns, (for verbs, Ivi404 did most of the work already), and maybe even have functions ready for adjectives.

Other languages that had great progress:

Statistics

Languages with the best coverage as of the end of 2023

  1. English 92.9%
  2. Spanish 91.3%
  3. Bokmal 89.1%
  4. Swedish 88.9%
  5. French 86.9%
  6. Danish 86.9%
  7. Latin 85.8%
  8. Italian 82.9%
  9. Estonian 81.2%
  10. Nynorsk 80.2%
  11. German 79.5%
  12. Basque 75.9%
  13. Portuguese 74.8%
  14. Malay 73.1%
  15. Panjabi 71.0%
  16. Slovak 67.8%
  17. Breton 67.3%
Simia

Previous entry:
Star Trek's 32nd century
Next entry:
RIP Niklaus Wirth