Progress in lexicographic data in Wikidata 2023

5 January 2024

Here are some highlights of the progress in lexicographic data in Wikidata in 2023

Greek jumped from 0% to 45% right away
Panjabi jumped right away from 0% to 71% (but on an admittedly small corpus)
Italian made a huge jump from 52% to 82% by increasing the number of forms from 9,000 to 286,000
Turkish jumped from 0.9% to 22%
Sindhi climbed from 15% to 25%
Farsi climbed from 15% to 24% increasing the number of forms from 4,000 to 33,000
Western Panjabi climbed from 36.9% to 47.9%
Hindi climbed from 49.9% to 65.9%
Breton increased from 56% to 67%
Croatian increased from 40% to 45%
Dutch went from 20% to 29%
French from 82.9% to 86.9%, mostly by dealing better with apostrophes in the analysis
Nynorsk pushed from 75% to 80% by increasing the number of forms from 18,000 to 68,000
Danish from 83.9% to 86.9% by increasing the number of forms in Wikidata from 65,000 to 170,000
German from 76% to 79% by increasing the number of forms in Wikidata from 90,000 to 200,000
Spanish pushed from 88% to 91% by increasing the number of forms from 280,000 to 430,000

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. Note that every percent more gets much more difficult than the previous one: an increase from 1% to 2% usually needs much much less work than from 91% to 92%.

Simia

Previous entry:
RIP Niklaus Wirth

Next entry:
Languages with the best lexicographic data coverage in Wikidata 2023

Progress in lexicographic data in Wikidata 2023

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

topics

Tools