Wikidata lexicographic data coverage for Croatian in 2024

From Simia
Jump to navigation Jump to search

For last year I picked up an ambitious goal for growing the lexicographic data for Croatian in 2024. And, just like last year, I missed again.

My goal was to grow the coverage to 50% - i.e. half of all the words in a Croatian corpus would be found in Wikidata. Instead, we grew from 45.5% to 47.9%. The number of forms grew from 4115 to 5506, more than a thousand new forms, a far bigger growth in forms than last year. So, even though the goal was missed, the speed of growth in Croatian is accelerating.

Part of that growth in forms is due to Google's Wordgraph release, a free dataset with words in about 40 languages which describe people - both demonyms and professions.

Do I want to set again a goal? After missing it twice, I am hesitant. Would I again reduce the goal further? But less than 50% sounds defeatist. But back to 60% is obviously too much. So, yes, let's go for 50% again. Let's see where it will take us this time. It's only 2.1% of coverage away from 50%, so that should be doable.

Simia

Previous entry:
Large Language Models, Knowledge Graphs and Search Engines
Next entry:
None