Difference between revisions of "Progress in lexicographic data in Wikidata 2023"

From Simia
Jump to navigation Jump to search
(Created page with "{{pubdate|{{subst:CURRENTDAY}}|{{subst:CURRENTMONTHNAME}}|{{subst:CURRENTYEAR}}}} Here are some highlights of the progress in lexicographic data in Wikidata in 2023 * [https...")
 
 
Line 3: Line 3:
 
Here are some highlights of the progress in lexicographic data in Wikidata in 2023
 
Here are some highlights of the progress in lexicographic data in Wikidata in 2023
  
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/pa/Statistics&diff=2044024354&oldid=1869209736 Panjabi jumped right away from 0% to 71%]
 
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/el/Statistics&diff=2044019187&oldid=1754050586 Greek jumped from 0% to 45% right away]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/el/Statistics&diff=2044019187&oldid=1754050586 Greek jumped from 0% to 45% right away]
 +
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/pa/Statistics&diff=2044024354&oldid=1869209736 Panjabi jumped right away from 0% to 71%] (but on an admittedly small corpus)
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/it/Statistics&diff=2028938353&oldid=1797161586 Italian made a huge jump from 52% to 82%] by increasing the number of forms from 9,000 to 286,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/it/Statistics&diff=2028938353&oldid=1797161586 Italian made a huge jump from 52% to 82%] by increasing the number of forms from 9,000 to 286,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/tr/Statistics&diff=2044022567&oldid=1797162371 Turkish jumped from 0.9% to 22%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/tr/Statistics&diff=2044022567&oldid=1797162371 Turkish jumped from 0.9% to 22%]
Line 12: Line 12:
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hi/Statistics&diff=2044020098&oldid=1797161375 Hindi climbed from 49.9% to 65.9%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hi/Statistics&diff=2044020098&oldid=1797161375 Hindi climbed from 49.9% to 65.9%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/br/Statistics&diff=2038034728&oldid=1797162548 Breton increased from 56% to 67%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/br/Statistics&diff=2038034728&oldid=1797162548 Breton increased from 56% to 67%]
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&diff=2044020185&oldid=1797161415 Changes in Croatian coverage in 2023]
+
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&diff=2044020185&oldid=1797161415 Croatian increased from 40% to 45%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/nl/Statistics&diff=2044021248&oldid=1772624930 Dutch went from 20% to 29%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/nl/Statistics&diff=2044021248&oldid=1772624930 Dutch went from 20% to 29%]
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/fr/Statistics&diff=2044019870&oldid=1797161299 French from 82.9% to 86.9%], mostly by dealing better with apostrophes in the analysis
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/fr/Statistics&diff=2044019870&oldid=1797161299 French from 82.9% to 86.9%], mostly by dealing better with apostrophes in the analysis
Line 19: Line 19:
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/de/Statistics&diff=2044019020&oldid=1797160946 German from 76% to 79%] by increasing the number of forms in Wikidata from 90,000 to 200,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/de/Statistics&diff=2044019020&oldid=1797160946 German from 76% to 79%] by increasing the number of forms in Wikidata from 90,000 to 200,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/es/Statistics&diff=2044019397&oldid=1797161086 Spanish pushed from 88% to 91%] by increasing the number of forms from 280,000 to 430,000
 
* [https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/es/Statistics&diff=2044019397&oldid=1797161086 Spanish pushed from 88% to 91%] by increasing the number of forms from 280,000 to 430,000
 +
 +
What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. Note that every percent more gets much more difficult than the previous one: an increase from 1% to 2% usually needs much much less work than from 91% to 92%.
  
 
{{tag|Simia}}
 
{{tag|Simia}}
 
<noinclude>{{simiapost|english}}</noinclude>
 
<noinclude>{{simiapost|english}}</noinclude>

Latest revision as of 09:57, 5 January 2024

Here are some highlights of the progress in lexicographic data in Wikidata in 2023

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. Note that every percent more gets much more difficult than the previous one: an increase from 1% to 2% usually needs much much less work than from 91% to 92%.

Simia

Previous entry:
RIP Niklaus Wirth
Next entry:
Languages with the best lexicographic data coverage in Wikidata 2023