Difference between revisions of "Wikidata lexicographic data coverage for Croatian in 2023"

From Simia
Jump to navigation Jump to search
 
Line 11: Line 11:
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&diff=2044020185&oldid=1797161415 Changes in Croatian coverage in 2023]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&diff=2044020185&oldid=1797161415 Changes in Croatian coverage in 2023]
 
* [https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/hr/Missing List of missing forms for Croatian]
 
* [https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/hr/Missing List of missing forms for Croatian]
 
Other languages that had great progress:
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/pa/Statistics&diff=2044024354&oldid=1869209736 Panjabi jumped right away from 0% to 71%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/el/Statistics&diff=2044019187&oldid=1754050586 Greek jumped from 0% to 45% right away]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/it/Statistics&diff=2028938353&oldid=1797161586 Italian made a huge jump from 52% to 82%] by increasing the number of forms from 9,000 to 286,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/tr/Statistics&diff=2044022567&oldid=1797162371 Turkish jumped from 0.9% to 22%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/sd/Statistics&diff=2019967207&oldid=1869209877 Sindhi climbed from 15% to 25%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/fa/Statistics&diff=2044019624&oldid=1797161189 Farsi climbed from 15% to 24%] increasing the number of forms from 4,000 to 33,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/pnb/Statistics&diff=2044024556&oldid=1869209814 Western Panjabi climbed from 36.9% to 47.9%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hi/Statistics&diff=2044020098&oldid=1797161375 Hindi climbed from 49.9% to 65.9%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/br/Statistics&diff=2038034728&oldid=1797162548 Breton increased from 56% to 67%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&diff=2044020185&oldid=1797161415 Changes in Croatian coverage in 2023]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/nl/Statistics&diff=2044021248&oldid=1772624930 Dutch went from 20% to 29%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/fr/Statistics&diff=2044019870&oldid=1797161299 French from 82.9% to 86.9%], mostly by dealing better with apostrophes in the analysis
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/nn/Statistics&diff=2044024214&oldid=1797163178 Nynorsk pushed from 75% to 80%] by increasing the number of forms from 18,000 to 68,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/da/Statistics&diff=2044018934&oldid=1797160892 Danish from 83.9% to 86.9%] by increasing the number of forms in Wikidata from 65,000 to 170,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/de/Statistics&diff=2044019020&oldid=1797160946 German from 76% to 79%] by increasing the number of forms in Wikidata from 90,000 to 200,000
 
Statistics
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/es/Statistics&diff=2044019397&oldid=1797161086 Spanish pushed from 88% to 91%] by increasing the number of forms from 280,000 to 430,000
 
 
Languages with the best coverage as of the end of 2023
 
# English 92.9%
 
# Spanish 91.3%
 
# Bokmal 89.1%
 
# Swedish 88.9%
 
# French 86.9%
 
# Danish 86.9%
 
# Latin 85.8%
 
# Italian 82.9%
 
# Estonian 81.2%
 
# Nynorsk 80.2%
 
# German 79.5%
 
# Basque 75.9%
 
# Portuguese 74.8%
 
# Malay 73.1%
 
# Panjabi 71.0%
 
# Slovak 67.8%
 
# Breton 67.3%
 
  
 
{{tag|Simia}}
 
{{tag|Simia}}
 
<noinclude>{{simiapost|english}}</noinclude>
 
<noinclude>{{simiapost|english}}</noinclude>

Latest revision as of 11:31, 3 January 2024

Last year, I published ambitious goals for the coverage of lexicographic data for Croatian in Wikidata. My self-proclaimed goal was widely missed: I wanted to go from 40% coverage to 60% -- instead, thanks to the help of contributors, we reached 45%.

We grew from 3,124 forms to 4,115, i.e. almost a thousand new forms, or about 31%. The coverage grew from around 11 million tokens to about 13 million tokens in the Croatian Wikipedia, or, as said, from 40% to 45%. The covered forms grew from 1.4% to 1.9%, which illustrates neatly the increased difficulty to reach more coverage (thanks to Zipf's law): last year, we increased covered forms by 1%, which translated to an overall coverage increase of occurrences by 35%. This year, although we increased the covered forms by another 0.5%, we only got an overall coverage increase of occurrences by 5%.

But some of my energy was diverted from adding more lexicographic data to adding functions that help with adding and checking lexicographic data. We launched a new project, Wikifunctions, that can hold functions. There, we collected functions to create the regular forms for Croatian nouns. All nouns are now covered.

I think that's still a great achievement and progress. Sure, we didn't meet the 60%, but the functions helped a lot to get to the 45%, and they will continue to benefit us 2024 too. Again, I want to declare some goals, at least for myself, but not as ambitious with regards to coverage: the goal for 2024 is to reach 50% coverage of Croatian, and in addition, I would love us to have Lexeme forms available for verbs and adjectives, not only for nouns, (for verbs, Ivi404 did most of the work already), and maybe even have functions ready for adjectives.

Simia

Previous entry:
Star Trek's 32nd century
Next entry:
RIP Niklaus Wirth