Semantic search

Jump to navigation Jump to search

Ten years of Wikidata

Today it's ten years since Wikidata had launched. A few memories.

It's been an amazing time. In the summer of 2011, people still didn't believe Wikidata would happen. In the fall of 2012, it was there.

Markus Krötzsch and I were pushing for the idea of a Semantic Wikipedia since 2005. Semantic MediaWiki was born from that idea, Freebase and DBpedia launched in 2007, microformats in Wikipedia became a grassroots thing, but no one was working on the real thing at the Wikimedia Foundation.

With Elena Simperl at KIT we started the EU research project RENDER in 2010, involving Mathias Schindler at Wikimedia Deutschland. It was about knowledge diversity on the Web, still an incredibly important topic. In RENDER, we developed ideas for the flexible representation of knowledge, and how to deal with contradicting and incomplete information. We analysed Wikipedia to understand the necessity of these ideas.

In 2010, I was finishing my PhD at KIT, and got an invitation by Yolanda Gil to work at the ISI at University of Southern California for a half year sabbatical. There, Yolanda, Varun Ratnakar, Markus and I developed a prototype for Wikidata which received the third place in the ISWC Semantic Web Challenge that year.

In 2011, the Wikimedia Data summit happened, invited by Tim O'Reilly and organised by Danese Cooper, to the headquarters of O'Reilly in Sebastopol, CA. There were folks from the Wikimedia Foundation, Freebase, DBpedia, Semantic MediaWiki, O'Reilly, there was Guha, Mark Greaves, I think, and others. I think that's where it became clear that Wikidata would be feasible.

It's also where I first met Guha and where I admitted to him that I was kinda a fan boy. He invented MFC, RDF, had worked with Douglas Lenat on CYC, and later that year introduced He's now working on Data Commons. Check it out, it's awesome.

Mark Greaves, a former DARPA program officer, who then was working for Paul Allen at Vulcan, had been supporting Semantic MediaWiki for several years, and he really wanted to make Wikidata happen. He knew my PhD was done, and that I was thinking about my next step. I thought it would be academia, but he suggested I should write up a project proposal for Wikidata.

After six years advocating for it, I understood that someone would need to step up to make it happen. With the support and confidence of so many people - Markus Krötzsch, Elena Simperl, Mark Greaves, Guha, Jamie Taylor, Rudi Studer, John Giannandrea, and others - I drafted the proposal.

The Board of the Wikimedia Foundation approved the proposal as a new Wikimedia project, but neither allocated the funding, nor directed the Foundation to do it. In fact, the Foundation was reluctant to take it on, unsure whether they would be able to host such a project development at that time. Back then, that was a wise decision.

Erik Möller, then CTO of the Foundation, was the driving force behind a major change: instead of turning the individual Wikipedias semantic, we would have a single Wikidata for all languages. Erik was also the one who had secured the domain for Wikidata. Many years prior.

Over the next half year and with the help of the Wikimedia Foundation, we secured funding from AI2 (Paul Allen), Google (who had acquired Freebase in the meantime), and the Gordon and Betty Moore Foundation, 1.3 million.

Other funders backed out because I insisted on the Wikidata ontology to be entirely under the control of the community. They argued to have professional ontologists, or reuse ontologies, or to use DBpedia to seed Wikidata. I said no. I firmly believed, and still believe, that the ontology has to be owned, created and maintained by the community. I invited the ontologists to join the project as community members, but to the best of my knowledge, they never made significant contributions. We did miss out on quite a bit of funding, though.

There we were. We had the funding and the project proposal, but no one to host us. We were even thinking of founding a new organisation, or hosting it at KIT, but due to the RENDER collaboration, Mathias Schindler had us talk with Pavel Richter, ED of Wikimedia Deutschland, and Pavel offered to host the development of Wikidata.

For Pavel and Wikimedia Deutschland this was a big step: the development team would significantly increase WMDE (I think, almost double it in size, if I remember correctly), which would necessitate a sudden transformation and increased professionalisation of WMDE. But Pavel was ready for it, and managed this growth admirably.

On April 1st 2012, we started the development of Wikidata. On October 29 2012 we launched the site.

The original launch was utterly useless. All you could do was creating new pages with Q IDs (the Q being a homage to Kamara, my wife), associated those Q IDs with labels in many languages, and connect to articles in Wikipedia, so called sitelinks. You could not add any statements yet. You could not connect items with each other. The sitelinks were not used anywhere. The labels were not used anywhere. As I said, the site was completely useless. And great fun, at least to me.

QIDs for entities are still being often disparaged. Why QIDs? Why not just the English name? Isn't dbp:Tokyo much easier to understand than Q1490? It was an uphill battle ten years ago to overcome the anglocentricity of many people. Unfortunately, this has not changed much. I am thankful to the Wikimedia movement to be one of the places that encourages, values, and supports the multilingual approach of Wikidata.

Over the next few months, the first few Wikipedias were able to access the sitelinks from Wikidata, and started deleting the sitelinks from their Wikipedias. This lead to a removal of more than 240 million lines of wikitext across the Wikipedias. 240 million lines that didn't need to be maintained anymore. In some languages, these lines constituted more than half of the content of the Wikipedia. In many languages, editing activity dropped dramatically at first, sometimes by 80%.

But then something happened. Those edits were mostly bots. And with those bots gone, humans were suddenly better able to see each other and build a more meaningful community. In many languages, this eventually lead to an increased community activity.

One of my biggest miscalculations when launching Wikidata was to entirely dismiss the possibility of a SPARQL endpoint. I thought that none of the existing open source triple stores would be performant enough. Peter Haase was instrumental in showing that I was wrong. Today, the SPARQL endpoint is an absolutely crucial piece of the Wikidata infrastructure, and is widely used to explore the dataset. And with its beautiful visualisations, I find it almost criminally underused. Unfortunately, the SPARQL endpoint is also the piece of infrastructure that worries us the most. The Wikimedia Foundation is working hard on figuring out the future for this service, and if you can offer substantial help, please reach out.

Today, Wikidata has more than 1.4 billion statements about approximately 100 million topics. It is by far the most edited Wikimedia project, with more edits than the English, German, and French Wikipedia together - even though they are each a decade older than Wikidata.

Wikidata is widely used. Almost every time Wikipedia serves one of its 24 billion monthly page views. Or during the pandemic in order to centralise the data about COVID cases in India to make them available across the languages of India. By large companies answering questions and fulfilling tasks with their intelligent assistants, be it Google or Apple or Microsoft. By academia, where you will find thousands of research papers using Wikidata. By numerous Open Source projects, by one-off analyses by data scientists, by small enterprises using the dataset, by student programmers exploring and playing with it on the weekend, by spreadsheet enthusiasts enriching their data, by scientists, librarians and curators linking their datasets to Wikidata, and thus to each other. Already, more than 7,000 catalogs are linked to Wikidata, and thus to each other, really and substantially establishing a Web of linked data.

I will always remember the Amazon developer who approached me after a talk. He had used Wikidata to gather data about movies. I was surprised: Amazon owns imdb, why would they ever use anything else for movies? He said that imdb was great for what it had, but Wikidata complemented it in unexpected ways, offering many interesting connections between the movies and other topics which would be out of scope for imdb.

Not to be misunderstood: knowledge bases such as imdb are amazing, and Wikidata does not aim to replace them. They often have a clear scope, have a higher quality, and almost always a better coverage in their field than Wikidata ever can hope to have, or aims to have. And that's OK. Wikidata's goal is not to replace these knowledge bases. But to provide the connecting tissue between the many knowledge bases out there. To connect them. To provide a common set of entities to work with. To turn the individual knowledge bases into a large interconnected Web of knowledge.

I am still surprised that Wikidata is not known more widely among developers. It always makes me smile with joy when I see yet another developer who just discovered Wikidata and writes an excited post about it and how much it helped them. In the last two weeks, I stumbled upon two projects who used Wikidata identifiers where I didn't expect them at all, just used them as if it was the most normal thing in the world. This is something I hope we will see even more in the future. I hope that Wikidata will become the common knowledge base that is ubiquitously used by a large swarm of intelligent applications. Not only to make these applications be smarter, by knowing more about the world - but also by allowing these applications to exchange data with each other more effectively because they are using the same language.

And most importantly: Wikidata has a healthy, large, and comparatively friendly and diverse community. It is one of the most active Wikimedia projects, only trailing the English Wikipedia, and usually similarly active as Commons.

Last time I checked, more than 400,000 people have contributed to Wikidata. For me, that is easily the most surprising number about the project. If you had asked me in 2012 how many people would contribute to Wikidata, I would have sheepishly hoped for a few hundred, maybe a few thousand. And I would have defensively explained why that's OK. I am humbled and awestruck by the fact that several hundred thousand people have contributed to an open knowledge base that is available to everyone, and that everyone can contribute to.

And that I think is the most important role that Wikidata plays. That it is a place that everyone can contribute to. That the knowledge base that everyone uses is not owned and gateguarded by any one company or government, but that it is a common good, that everyone can contribute to. That everyone with an internet connection can lend their voice to the sum of all knowledge.

We all own Wikidata. We are responsible for Wikidata. And we all benefit from Wikidata.

It has been an amazing ten years. I am looking forward to many more years of Wikidata, and to the many new roles that it will play in the years to come, and to the many people who will contribute to it.

Shoutout to the brilliant team that started the work on Wikidata: Lydia Pintscher, Abraham Taherivand, Daniel Kinzler, Jeroen De Dauw, Katie Filbert, Tobias Gritschacher, Jens Ohlig, John Blad, Daniel Werner, Henning Snater, and Silke Meyer.

And thank you for all these amazing pictures of cakes for Wikidata's birthday. (And if you're curious what is coming next: we are working on Wikifunctions and Abstract Wikipedia, in order to allow more people to contribute more knowledge to even more people!)

Markus Krötzsch ISWC 2022 keynote

A brilliant keynote by Markus Krötzsch for this year's ISWC.

"The era of standard semantics has ended"

Yes, yes! 100%! That idea was in the air for a long time, but Markus really captured it in clear and precise language.

This talk is a great birthday present for Wikidata's ten year anniversary tomorrow. The Wikidata community had over the last years defined numerous little pockets of semantics for various use cases, shared SPARQL queries to capture some of those, identified constraints and reasoning patterns and shared those. And Wikidata connecting to thousands of external knowledge bases and authorities, each with their own constraints - only feasible since we can, in a much more fine grained way, use the semantics we need for a given context. The same's true for the billions of triples out there, and how they can be brought together.

The middle part of the talk goes into theory, but make sure to listen to the passionate summary at 59:40, where he emphasises shared understanding, that knowledge is human, and the importance of community.

"Why have people ever started to share ontologies? What made people collaborate in this way?" Because knowledge is human. Because knowledge is often more valuable when it is shared. The data available on the Web of linked data, including Wikidata, Data Commons,, can be used in many, many ways. It provides a common foundation of knowledge that enables many things. We are far away from using it to its potential.

A remark on triples, because I am still thinking too much about them: yes to Markus's comments: "The world is not triples, but we make it triples. We break down the world into triples, but we don't know how to rebuild it. What people model should follow the technical format is wrong, it should be the other way around" (rough quotes)

At 1:17:56, Markus calls back our discussions of the Wikidata data model in 2012. I remember how he was strongly advocating for more standard semantics (as he says), and I was pushing for more flexible knowledge representations. It's great to see the synthesis in this talk.

Karl-Heinz Witzko

Ich hatte unglaublich gutes über das DSA Abenteuer "Jenseits des Lichts" gehört. Aber auch, dass es sehr schwer zu spielleiten sei. Ich sprach Karl-Heinz Witzko darauf an, den Autor des Abenteuers, und er sagte, er würde es für mich leiten. Wir müssten nur eine Zeit finden.

Wann auch immer wir uns trafen, versprachen wir uns gegenseitig, Zeit dafür zu finden. Ich hatte das Buch gekauft, aber natürlich nicht gelesen, und war immer sehr gespannt darauf, was es wohl mit dem Abenteuer auf sich hatte.

Karli hat zu DSA seine ganz einzigartige Stimme beigetragen. Ein Werk wie DSA, eine Welt wie Aventurien, entstammt nicht aus dem Kopf einer einzigen Person, sondern hunderte schufen und trugen bei. Und Karli's Stimme hatte ihren ganz eigenen Humor, und erweiterte die Welt um Perspektiven und Eigenheiten die sonst nie entdeckt worden wären. Ich habe seine Romane mit viel Schmunzeln gelesen, seine Solos sehr gerne und wiederholt gespielt und erforscht, nur sein einziges Gruppenabenteuer kannte ich nicht. Nach seiner Zeit bei DSA schrieb Karli weitere Romane und erschuf weitere Welten.

Am 29. September 2022 ging Karli von uns. Der Name Karl-Heinz Witzko wurde aus dem "Buch der Anwesenden" gestrichen, und ins "Buch der Abwesenden" eingetragen. Altem Brauch auf Maraskan folgend werden Karli nun die Sechszehn Ratschläge mit auf dem Weg gegeben, und die Sechszehn Forderungen gestellt. Ich hätte gerne gehört oder gelesen, was Karli aus diesen gemacht hätte.

Danke für Deine Worte. Danke für Deine Zeit. Danke für Deinen Humor.

Heute schlug ich "Jenseits des Lichts" auf und fing an zu lesen.

RIP Steve Wilhite

RIP Steve Wilhite, who worked on CompuServe chat for decades and was the lead of the CompuServe team that developed the GIF format, which is still widely used, and which made the World Wide Web a much more colorful and dynamic place by having a format that allowed for animations. Wilhite incorrectly insisted on GIF being pronounced Jif. Wilhite died on March 14, 2022 at the age of 74.

RIP Christopher Alexander

RIP Christopher Alexander, the probably most widely read actual architect in all of computer science. His work, particularly his book "A Pattern Language" was popularized, among others, by the Gang of Four and Design Pattern work, and is frequently read and cited in Future of Programming and UX circles for the idea that everyone should be able to create, but in order to enable them, they need patterns that make creation possible. His work inspired Ward Cunningham when developing wikis and Will Wright when developing that most ungamelike of games, Sim City. Alexander died on March 17, 2022 at the age of 85.

Ante Vrandečić (1919-1944)

I knew that my father was named for his uncle. His other brother told me about him, and he was telling me that he became a prisoner of war and that they lost his trace. Back then, I didn't dare to ask on which side he was fighting, and when I would have dared to ask, it was too late.

Today, thanks to the increasing digitalisation of older sources and their publication on the Web and the Web being indexed, I accidentally stumbled upon a record about him in a three thousand pages long book, Volume 8 of the "Victims of the War 1941-1945" (Žrtve rata 1941-1945).

He was a soldier in the NOV i POJ (Yugoslav partisans), became a prisoner of war, and was killed by Germans during a transport in 1944. I don't know where he was captured, from where to where he was transported, where he was killed.

My father, his namesake, then moved to Germany in the 1970s, where he and my mother built a new life for themselves and their children, and where I was born.

I have a lot of complicated emotions and thoughts.

A quick draft for a curriculum for Computer Science

The other day, on Facebook, I was asking the question who would be the person closest to being a popularizer for ideas in Computer Science to the wider audience, which lead to an interesting and insightful discussion.

Pat Hayes asked what I would consider the five (or so) core concepts of Computer Science. Ernest Davis answer with the following short list (not in any particular order):

  1. Virtual machine
  2. Caching
  3. Algorithm
  4. Data structure
  5. Programming language

And I followed up with this drafty, much longer answer:

  1. how and why computation works; that a computation is a mapping from your problem domain into some machine state, then we have some automatic movement, and the result represents an answer to your question; that it is always layers of interpretation; that it doesn't matter whether the computing machine is made of ICs or of levers, marbles, and gravity (i.e. what is a function); that computation is always real and you can't simulate computation; what can be done with computation and what cannot; computational thinking - this might map to number 1 in Ernest's list
  2. that everything can be represented with zeros and ones, but doesn't have to be; it could also be represented by A and B and Cs, and many other ways; that two states are simply convenient for electric devices; that all information, all data, all input to all computation, and the steps for computations themselves are represented with zeros and ones (i.e. the von Neumann architecture and binary encoding); what can be represented in this paradigm and what cannot - this might map to number 4 in Ernest's list
  3. how are functions encoded; how many different functions can have the same results; how wildly different in efficiency functions can be even when they have the same result; why that makes some things quick to calculate whereas others take a long time; basically smearing ideas from lambda calculus and assembler and building everything from NAND circuits; why this all maps to higher level languages such as JavaScript - this might map to ideas from 2, 3, and 5 on Ernest's list
  4. bringing it back to the devices; where does, physically, the computation happen, where is physically the data stored, and why it matters in terms of privacy, equity, convenience, economics, interdependence, even freedom and independence; what kind of computations and data storage we can expect to have in our mobile phones, in a data center, in an RFID card; how long the turnaround times are in each case; how cryptography works and what kind of guarantees it can provide; why centralization is so alluring and what the price of that might be; and what might be the cost of computation for the environment
  5. given our times, and building on the previous lessons, what is the role of machine learning; how does it actually work, why does it work as good as it does, and why does it not work when it doesn't and where can't it work; what does this have to with "intelligence", if it does; what becomes possible because of these methods, and what it costs; why these methods may reinforce inequities; but also how they might help us with significantly increasing access to better health care for many people are allow computers to have much more intuitive interfaces and thus democratize access to computing resources

I think the intuitions in 1, 2, and maybe 3 are really the core of computer science, and then 4 and 5 provide shortcuts to important questions four ourselves and society that, I think, would be worthwhile for everyone to ponder and have an informed understanding of the situation so that they can meaningfully make relevant decisions.

The Strange Case of Booker T. Washington’s Birthday

A lovely geeky essay about how much work a single edit to Wikipedia can be. I went down this kind of rabbit holes myself more than once, and so I very much enjoyed the essay.

Wordle is good and pure

The nice thing about Wordle - whether you play it or not, whether you like it or not - is that it is one of those good, pure things the Web was made for. A simple Website, without ads, popups, monetization, invasive tracking, etc.

You know, something that can chiefly be done by someone who already has a comfortable life and won't regret not having monetized this. The same way scientists mainly have been "gentleman scientist". Or tenured professors who spent years on writing novels.

And that is why I think that we should have a Universal Basic Income. To unlock that creativity. To allow for ideas from people who are not already well off to see the light. To allow for a larger diversity of people to try more interesting things.

Thank you for coming to my TED talk.

P.S.: on January 31, five days after I wrote this text, Wordle was acquired by the New York Times for an undisclosed seven-digit sum. I think that is awesome for Wardle, the developer of Wordle, and I still think that what I said was true at that time and still mostly is, although I expect the Website now to slowly change to have more tracking, branding, and eventually a paywall.

Meat Loaf

"But it was long ago
And it was far away
Oh God, it seemed so very far
And if life is just a highway
Then the soul is just a car
And objects in the rear view mirror may appear closer than they are."

Bat out of Hell II: Back into Hell was the first album I really listened to, over and over again. Where I translated the songs to better understand them. Paradise by the Dashboard Light is just a fun song. He was in cult classic movies such as The Rocky Horror Picture Show, Fight Club, and Wayne's World.

Many of the words we should remember him for are by Jim Steinman, who died last year and wrote many of the lyrics that became famous as Meat Loaf's songs. Some of Meat Loaf's own words better not be remembered.

Rock in Peace, Meat Loaf! You have arrived at your destination.