Languages with the best lexicographic data coverage in Wikidata 2024

23 January 2025

Languages with the best coverage as of the end of 2023

English 93.1% (=, +0.2%)
Italian 92.6% (+7, +9.7%)
Danish 92.3% (+3, +5.4%)
Spanish 91.8% (-2, +0.5%)
Norwegian Bokmal 89.4% (-2, +0.3%)
Swedish 89.3% (-2, +0.4%)
French 87.6% (-2, +0.6%)
Latin 85.7% (-1, -0.1%)
Norwegian Nynorsk 81.8% (+1, +1.6%)
Estonian 81.3% (-1, +0.1%)
German 79.6% (=, +0.1%)
Malay 77.8% (+2, +4.7%)
Basque 75.9% (-1, =)
Portuguese 74.9% (-1, +0.1%)
Panjabi 73.3% (=, +2.3%)
Breton 71.1% (+1, +3.8%)
Czech 69.3% (NEW, +6.1%)
Slovak 67.8% (-2, =)
Igbo 67.8% (NEW, +2.0%)

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. The first number in the parentheses is the change in rank compared to last year, and the second number the change in coverage compared to last year.

The list contains all languages where the data covers more than two thirds of the selected corpus.

English managed to keep the lead, but the distance to the second place melted from 1.6% last year to a mere 0.5% this year. Italian and Danish made huge jumps forward, Italian by increasing coverage by almost 10% and raising seven ranks to second place. Compared to last year, two new languages made it into the top list, Czech and Igbo, both cracking the ⅔ limit to join the top list – Hindi just being behind at 66.5%.

The complete data is available on Wikidata.

Simia

Previous entry:
Progress in lexicographic data in Wikidata 2024

Next entry:
AI and centralization

Languages with the best lexicographic data coverage in Wikidata 2024

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

topics

Tools