Wikidata lexicographic data coverage for Croatian in 2023
Last year, I published ambitious goals for the coverage of lexicographic data for Croatian in Wikidata. My self-proclaimed goal was widely missed: I wanted to go from 40% coverage to 60% -- instead, thanks to the help of contributors, we reached 45%.
We grew from 3,124 forms to 4,115, i.e. almost a thousand new forms, or about 31%. The coverage grew from around 11 million tokens to about 13 million tokens in the Croatian Wikipedia, or, as said, from 40% to 45%. The covered forms grew from 1.4% to 1.9%, which illustrates neatly the increased difficulty to reach more coverage (thanks to Zipf's law): last year, we increased covered forms by 1%, which translated to an overall coverage increase of occurrences by 35%. This year, although we increased the covered forms by another 0.5%, we only got an overall coverage increase of occurrences by 5%.
But some of my energy was diverted from adding more lexicographic data to adding functions that help with adding and checking lexicographic data. We launched a new project, Wikifunctions, that can hold functions. There, we collected functions to create the regular forms for Croatian nouns. All nouns are now covered.
I think that's still a great achievement and progress. Sure, we didn't meet the 60%, but the functions helped a lot to get to the 45%, and they will continue to benefit us 2024 too. Again, I want to declare some goals, at least for myself, but not as ambitious with regards to coverage: the goal for 2024 is to reach 50% coverage of Croatian, and in addition, I would love us to have Lexeme forms available for verbs and adjectives, not only for nouns, (for verbs, Ivi404 did most of the work already), and maybe even have functions ready for adjectives.
Other languages that had great progress:
- Panjabi jumped right away from 0% to 71%
- Greek jumped from 0% to 45% right away
- Italian made a huge jump from 52% to 82% by increasing the number of forms from 9,000 to 286,000
- Turkish jumped from 0.9% to 22%
- Sindhi climbed from 15% to 25%
- Farsi climbed from 15% to 24% increasing the number of forms from 4,000 to 33,000
- Western Panjabi climbed from 36.9% to 47.9%
- Hindi climbed from 49.9% to 65.9%
- Breton increased from 56% to 67%
- Changes in Croatian coverage in 2023
- Dutch went from 20% to 29%
- French from 82.9% to 86.9%, mostly by dealing better with apostrophes in the analysis
- Nynorsk pushed from 75% to 80% by increasing the number of forms from 18,000 to 68,000
- Danish from 83.9% to 86.9% by increasing the number of forms in Wikidata from 65,000 to 170,000
- German from 76% to 79% by increasing the number of forms in Wikidata from 90,000 to 200,000
Statistics
- Spanish pushed from 88% to 91% by increasing the number of forms from 280,000 to 430,000
Languages with the best coverage as of the end of 2023
- English 92.9%
- Spanish 91.3%
- Bokmal 89.1%
- Swedish 88.9%
- French 86.9%
- Danish 86.9%
- Latin 85.8%
- Italian 82.9%
- Estonian 81.2%
- Nynorsk 80.2%
- German 79.5%
- Basque 75.9%
- Portuguese 74.8%
- Malay 73.1%
- Panjabi 71.0%
- Slovak 67.8%
- Breton 67.3%
Previous entry: Star Trek's 32nd century | Next entry: RIP Niklaus Wirth |