Semantic search

Jump to navigation Jump to search

A sermon on tolerance and inclusion

Warning: meandering New Year's sermon ahead, starting at a random point and going somewhere entirely else.

I started reading Martin Kay's book on Translation, and I am enjoying it quite a bit so far. Kay passed away August 2021. His work seems highly relevant for the work on Abstract Wikipedia.

One thing that bummed me though is that for more than a page in the introduction he rants about pronouns and how he is going to use "he" to generically mean both men and women, and how all other solutions have deficits.

He culminates in the explanation: "Another solution to this problem is which is increasing in popularity, is to use both 'he' and 'she', shifting between them more or less randomly. So we will sometimes get 'When a translator is confronted with a situation of this kind, she must decide...'. The trouble with this is that some readers, including the present writer, reacts quite differently to the sentence depending on which version of the generic pronoun it contains. We read the one containing 'he' smoothly and, all else being equal, assimilate the intended meaning. Encountering the one with 'she', on the other hand, is like following a television drama that is suddenly interrupted by a commercial."

Sooo frustratingly close to getting it.

I wish he'd had just not spent over a page on this topic, but just used the generic 'he' in the text, and that's it. I mean, I don't expect everyone born more than eighty years ago to adjust to the modern usage of pronouns.

Now, I am not saying that to drag Kay's name through dirt, or to get him cancelled or whatever. I have never met him, but I am sure he was a person with many positive facets, and given my network I wouldn't be surprised if there are people who knew him and can confirm so. I'm also not saying that to virtue signal and say "oh man, look how much more progressive I am". Yes, I am slightly annoyed by this page. Unlike many others though, I am not actually personally affected by it - I use the pronoun "he" for myself and not any other pronoun, so this really is not about me. Is it because of that that it is easy for me to gloss over this and keep reading?

So is it because I am not affected personally that it is so easy for me to say the following: it is still worthwhile to keep reading his work, and the rest of the book, and to build on top of his work and learn from him. The people we learn some things from, the influences we accept, they don't have to be perfect in every way, right? Would it have been as easy for me to say that if I were personally affected? I don't know.

I am worried about how quickly parts of society seems to be ready to "cancel" and "call out" people, and how willing they are to tag a person as unacceptable because they do not necessarily share every single belief that is currently regarded as a required belief.

I have great difficulties in drawing the line. Which beliefs or actions of a person should be sufficient grounds to shun them or their work? When JK Rowling doubles down on her stance regarding trans women, is this enough to ask everyone to drop all interest in the world she created and the books she wrote? Do we reshoot movie scenes such as the cameo of Donald Trump in Home Alone 2 in order to "purify" the movie and make it acceptable for our new enlightened age again? When Johnny Depp was accused of domestic abuse, does he need to be recast from movies he had already been signed on? Do we also need to stop watching his previous movies? Do the believable accusations of child abuse against Marion Zimmer Bradley mean that we have to ignore her contributions to feminist causes, never mind her books? Should we stop using a font such as Gill Sans because of the sexual abuse Erjc Gill committed against his daughters? Do we have to stop watching movies or listen to music produced by murderers such as OJ Simpson, Phil Spector, or Johnny Lewis?

I intentionally escalated the examples, and they don't compare at all to Kay's defence of his usage of pronouns.

I offer no answers as to where the line should be, I have none. I don't know. In my opinion, none of us is perfect, and none of our idols, paragons, or example model humans will survive the scrutiny for perfection. This is not a new problem. Think of Gandhi, Michael Jackson, Alice Schwarzer, Socrates - no matter where you draw your idols from, they all come with imperfections, sometimes massive ones.

Can we keep and accept their positive contributions - without ignoring their faults? Can we allow people with faults to still continue to contribute their skills to society, or do we reduce them to their faults and negatives? Do we have to get someone fired for tweeting a stupid joke? Do we demand perfection by everyone at all time?

Or do we allow everyone to be human, make and have errors, and have beliefs many don't deem acceptable? Committing or causing actions resulting from these beliefs? Even if these actions and beliefs hurt or endanger people, or deny the humanity of others? We don't have to and should not accept their racism, sexism, homo- and transphobia - but can and should we still recognise their other contributions?

I am worried about something else as well. By pushing out so many because of the one thing they don't want to accept in the basket of required beliefs, we push them all into the group of outsiders. But if there are too many outsiders, the whole system collapses. Do we all have to have the same belief on guns, on climate, on gender, on abortion, on immigration, on race, on crypto, on capitalism, on housing? Or can we integrate and work together even if we have differences?

The vast majority of Americans think that human-caused climate change is real and that we should act to avoid it. Only 10% don't. And yet, because of the way we define and fence our in- and outgroups, we have a strong voting block that repeatedly leads to outright sabotage to effective measures. A large majority of Americans support the right to abortion, but you would never be able to tell given the fights around laws and court cases. Taxing billionaires more effectively is highly popular with voters, but again these majorities fizzle away and don't translate to the respective changes in the tax code.

I think we should be able to work together with people we don't agree with on everything. We should stop requiring perfection and alignment on all issues before moving forward. But then again, that's what I am saying, and I am saying it from a position of privilege, am I not? I am male. I am White. I am heterosexual. I am not Muslim or Jewish. I am well educated. I am not poor. I am reasonably technologically savvy. I am not disabled. What right do I have at all to voice my opinion on these topics? To demand for acceptance people with beliefs that hurt or endanger people who are not like me. Or even to ask for your precious attention for these words of mine?

None.

And yet I hope that we will work together towards progress on the topics we agree on, that we will enlighten each other on the topics we disagree on, and that we will be able to embrace more of us on our way into the future.

P.S.: this post is problematic and not very well written, and I recognise that. Please refer to the discussion about it on Facebook.

Long John and Average Joe

You may know about Long John Silver. But who's the longest John? Here's the answer according to Wikidata: https://w.wiki/4dFL

What about your Average Joe? Here's the answer about the most average Joe, based on all the Joes in Wikidata: https://w.wiki/4dFR

Note, the average height of a Joe in Wikidata is 1,86cm or 6'1", which is quite a bit higher than the average height in the population. A data collection and coverage issue: it is much more likely to have the height for a basketball player than for an author in Wikidata.

Just two silly queries for Wikidata, which are nice ways to show off the data set and what one can do with the SPARQL query endpoint. Especially the latter one shows off a rather interesting and complex SPARQL query.

Temperatures in California

It has been a bit chillier the last few days. I noticed that after almost a decade in California, I feel pretty comfortable with understanding temperatures in Fahrenheit - as long as they are over 60° F. If it is colder, I need to switch to Celsius in order to understand how cold it exactly is. I have no idea what 40° or 45° or 50° F are, but I still know what 5° C is!

The fact that I still haven't acclimatised to Fahrenheit for the cooler temperatures tells you a lot about the climate in California.

SWSA panel

Thursday, October 7, 2021, saw a panel of three founding members of the Semantic Web research community, who each have been my teachers and mentors over the years: Rudi Studer, Natasha Noy, and Jim Hendler. I loved watching the panel and enjoyed it thoroughly, also because it was just great to see all of them again.

There were many interesting insights and thoughts in this panel, too many to write them all down, but I want to mention a few.

It was interesting how much all panelists talked about creating the Semantic Web community, and how much of an intentional effort that was. Deciding that it needs a conference, a journal, an organization, setting those up, and their interactions. Seeing and fostering a sustainable research community grown out of an idea is a formidable and amazing effort. They all mentioned positively the diversity in the community, and that it was a conscious effort to work towards that. Rudi mentioned that the future challenge will be with ensuring that computer science students actually have Semantic Web technologies integrated into their standard curriculum.

They named a number of the successes that were influenced by the Semantic Web research work, such as Schema.org, the heavy use of SPARQL in supercomputing (I had no idea!), Wikidata (thanks for the shout out, Rudi!), and the development of scalable graph databases. Natasha raised the advantage of having common identifiers throughout an organization, i.e. that everyone refers to California the same way. They also named areas that remained elusive and that they expect to see progress in the coming years, Rudi in particular mentioned Agents and Common Sense, which was echoed by the other participants, and Jim mentioned Personal Knowledge Graphs. Jim mentioned he was surprised by the growing importance of unstructured data. Jim is also hoping for something akin to “procedural attachments” - you see some new data coming in, you perform this action (I would like to think that a little Wikifunctions goes a long way).

We need both, open knowledge graphs and closed knowledge graphs (think of your personal ones, but also the ones by companies).

The most important contribution so far and also well into the future was the idea of decentralization of semantics. To allow different stakeholders to work asynchronously and separately on parts of the semantics and yet share data. This also includes the decentralization of knowledge graphs, but also in the future we will encounter a world where semantics are increasingly brought together and yet decentralized.

One interesting anecdote was shared by Natasha. She was talking about a keynote by Guha (one of the few researchers who were namechecked in the panel, along with Tim Berners-Lee) at ISWC in Sydney 2013. How Guha was saying how simple the technology needs to be, and how there were many in the audience who were aghast and shocked by the talk. Now, eight years later and given her experience building Dataset Search, she appreciates the insights. If they have a discussion about a new property for longer than five minutes, they drop it. It’s too complicated, and people will use it wrong so often that the data cleanup will become expensive.

All of them shared the advice for researchers in their early career stage to work on topics that truly inspire them, on problems that are real and that they and others care about, and that if they do so, the results have the best chance to have impact. Think about problems you can explain to people not in your field, about “how can we use triples to save the world” - and not just about “hey, look, that problem that we solved with these other technologies previously, now we can also solve it with Semantic Web technologies”. This doesn’t really help anyone. Solve new problems. Solve real problems. And do what you are truly passionate about.

I enjoyed the panel, and can recommend everyone in the Semantic Web research area or any related, nearby research, to check it out. Thanks to the organizers for this talk (which is the first session in a series of talks that will continue with Ora Lassila early December).


Our four freedoms for our technology

(This is a draft. Comments are welcome. This is not meant as an attack on any person or company individually, but at certain practises that are becoming increasingly prevalent)

We are not allowed to use the devices we paid for in the ways we want. We are not allowed to use our own data in the way we want. We are only allowed to use them in the way the companies who created the devices and services allow us.

Sometimes these companies are nice and give us a lot of freedom in how to use the devices and data. But often they don’t. They close them down for all kinds of reasons. They may say it is for your protection and safety. They might admit it is for profit. They may say it is for legal reasons. But in the end, you are buying a device, or you are creating some data, and you are not allowed to use that device and that data in the way you want to, you are not allowed to be creative.

The companies don’t want you to think of the devices that you bought and the data that you created as your devices and your data. They want you to think of them as black boxes that offer you services they create for you. They don’t want you to think of a Ring doorbell as a camera, a microphone, a speaker, and a button, but they want you to think of it as providing safety. They don’t want you to think of the garage door opener as a motor and a bluetooth module and a wifi module, but as a garage door opening service, and the company wants to control how you are allowed to use that service. Companies like Chamberlain and SkyLink and Genie don’t allow you to write a tool to check on your garage door, and to close or open it, but they make deals with Google and Amazon and Apple in order to integrate these services into their digital assistants, so that you can use it in the way these companies have agreed on together, through the few paths these digital assistants are available. The digital assistant that you buy is not a microphone and a speaker and maybe a camera and maybe a screen that you buy and use as you want, but you buy a service that happens to have some technical ingredients. But you cannot use that screen to display what you want. Whether you can watch your Amazon Prime show on the screen of a Google Nest Hub depends on whether Amazon and Google have an agreement with each other, not on whether you have paid for access to Amazon Prime and you have paid for a Google Nest Hub. You cannot use that camera to take a picture. You cannot use that speaker to make it say something you want it to say. You cannot use the rich plethora of services on the Web, and you cannot use the many interesting services these digital assistants rely on, in novel and creative combinations.

These companies don’t want you to think of the data that you have created and that they have about you as your data. They don’t want you to think about this data at all. They just want you to use their services in the way they want you to use their services. On the devices they approve. They don’t want you to create other surfaces that are suited to the way you use your data. They don’t want you to decide on what you want to see in your feed. They don’t want you to be able to take a list of your friends and do something with it. They will say it is to protect privacy. They will say that it is for safety. That is why you cannot use the data you and your friends have created. They want to exactly control what you can and cannot do with the data you and your friends have created. They want to control how many ads you must see in order to be allowed to see your friends’ posts. They don't want anyone else to have the ability to provide you creative new interfaces to your feed. They don’t want you yourself the ability to look at your feed and do whatever you want with it.

Those are devices you paid for.

These are data you and your friends have created.

And more and more we are losing our freedom of using our devices and our data as we like.

It would be impossible to invent email today. It would be impossible to invent the telephone today. Both are protocols that allow everyone to communicate with anyone no matter what their email provider or their phone is. Try reading your friend’s Facebook feed on Instagram, or send a direct message from your Twitter account to someone on WhatsApp, or call your Skype contact on Facetime.

It would be impossible to launch the Web today - many companies don’t want you browsing the Web. They want you to be inside of your Facebook feed and consume your content there. They want you to be on your Twitter feed. They don’t want you to go to the Website of the New York Times and read an article there, they don’t want you to visit the Website of your friend and read their blog there. They want you to stay on their apps. Per default, they open Websites inside their app, and not in your browser, so you are always within their app. They don’t want you to experience the Web. The Web is dwindling and all the good things on it are being recut and rebundled within the apps and services of tech companies.

Increasingly, we are seeing more and more walls in the world. Already, it is becoming impossible to pay and watch certain movies and shows without buying into a full subscription in a service. We will likely see the day where you will need a specific device to watch a specific movie. Where the only way to watch a Disney+ exclusive movie is on a Disney+ tablet. You don’t think so? Think about how easy it is to get your Kindle books onto another Ebook reader. How do you enable a skill or capability available in Alexa on your Nest smart speaker? How can you search through the books that you bought and are in your digital library, besides by using a service provided by the company that allows you to search your digital library? When you buy a movie today on YouTube or on iMovies, what do you own? What are you left with when the companies behind these services close that service, or go out of business altogether?

Devices and content we pay for, data we and our friends create, should be ours to use in empowering and creative ways. Services and content should not be locked in with a certain device or subscription service. The bundling of services, content, devices, and locking up user data creates monopolies that stifle innovation and creativity. I am not asking to give away services or content or devices for free, I am asking to be allowed to pay for them and then use them as I see fit.

What can we do?

As far as I can tell, the solution, unfortunately, seems to be to ask for regulation. The market won’t solve it. The market doesn’t solve monopolies and oligopolies.

But don’t ask to regulate the tech giants individually. We don’t need a law that regulates Google and a law that regulates Apple and a law that regulates Amazon and a law to regulate Microsoft. We need laws to regulate devices, laws to regulate services, laws to regulate content, laws that regulate AI.

Don’t ask for Facebook to be broken up because you think Mark Zuckerberg is too rich and powerful. Breaking up Facebook, creating Baby Books, will ultimately make him and other Facebook shareholders richer than ever before. But breaking up Facebook will require the successor companies to work together on a protocol to collaborate. To share data. To be able to move from one service to another.

We need laws that require that every device we buy can be made fully ours. Yes, sure, Apple must still be allowed to provide us with the wonderful smooth User Experience we value Apple for. But we must also be able to access and share the data from the sensors in our devices that we have bought from them. We must be able to install and run software we have written or bought on the devices we paid for.

We need laws that require that our data is ours. We should be able to download our data from a service provider and use it as we like. We must be allowed to share with a friend the parts of our data we want to share with that friend. In real time, not in a dump download hours later. We must be able to take our social graph from one social service and move to a new service. The data must be sufficiently complete to allow for such a transfer, and not crippled.

We need laws that require that published content can be bought and used by us as we like. We should be able to store content on our hard disks. To lend it to a friend. To sell it. Anything I can legally do with a book I bought I must be able to legally do with a movie or piece of music I bought online. Just as with a book you are not allowed to give away the copies if the work you bought still enjoys copyright.

We need laws that require that services and capabilities are unbundled and made available to everyone. Particularly as technological progress with regards to AI, Quantum computing, and providing large amounts of compute becomes increasingly an exclusive domain for trillion dollar companies, we must enable other organizations and people to access these capabilities, or run the risk that sooner or later all and any innovation will be happening only in these few trillion dollar companies. Just because a company is really good at providing a specific service cheaply, it should not be allowed to unfairly gain advantage in all related areas and products and stifle competition and innovation. This company should still be allowed to use these capabilities in their products and services, but so should anyone else, fairly prized and accessible by everyone.

We want to unleash creativity and innovation. In our lifetimes we have seen the creation of technologies that would have been considered miracles and impossible just decades ago. These must belong to everybody. These must be available to everyone. There cannot be equity if all of these marvellous technologies can be only wielded by a few companies on the West coast of the United States. We must make them available to all the people of the world: the people of the Indian subcontinent, the people of Subsaharan Africa,the people of Latin America, and everyone else. They all should own the devices they paid for, the data they created, the content they paid for. They all should have access to the same digital services and capabilities that are available to the engineers at Amazon or Google or Microsoft. The universities and research centers of the world should be able to access the same devices and services and extend them with their novel and creative ideas. The scrappy engineers in Eastern Europe and India and Nigeria and Central Asia should be able to call the AI models trained by Google and Microsoft and use them in novel ways to run their devices and chip-powered cars and agricultural machines. We want a world of freedom, tinkering, where creativity and innovation are unleashed, and where everyone can contribute their ideas, their creativity, and where everyone can build their fortune.


The Center of the Universe

The discovery of the center of the universe led to a series of unexpected consequences. It killed some, it enlightened others, but most people just were left utterly confused in the end.

When the results from the Total Radiating Universal Tessellation Hyperfield satellites measurements came in, it became depressingly clear that the universe was indeed contracting. Very slowly, but without any reasonable doubt — or, as the physicists said, they were five sigma sure about it. As the data from the measurements became available, physicists, cosmologists, topologists, even a few mathematically inclined philosophers, and a huge number of volunteers started to investigate it. And after a short period of time, they came to a whole set of staggering conclusions.

First, the Universe had a rather simple four-dimensional form. The only unfortunate blemishes in this theory were the black holes, but most of the volunteers, philosophers, and topologists decided to ignore these as accidental.

Second, the form was bounded. There was a beginning and an end in time, and there were boundaries in space, and those who understood that these were the same were enlightened about the form of the universe.

Third, since the form of the universe was bounded and simple, it had a center. Whereas this was slightly surprising it was a necessary consequence of the previous findings. What first seemed exciting, but soon will turn out not to be only the heart of this report, but the heart of all humanity, was that the data collected by the satellites allowed to calculate the position of the center of the universe.

Before that, let me recapture what we traditionally knew about how the universe is built. Our sun is a star, around which a few planets travel, one of them being our Earth. Our sun is one of a few tens of billions of stars that form a long curved thread which ties around a supermassive black hole. A small number of such threads are tangled together, forming the spiral arms of our galaxy, the Milky Way. Our galaxy consists of half a trillion stars like our sun.

Galaxies, like everything else in the universe, like to stick together and form groups. A few hundred thousand galaxies make up a supercluster. A few of these superclusters together build enormous walls of stars, filaments traversing the universe. The galaxies of such a wall are all in a single plane, more or less, and sometimes even in a single line.

Between these walls, walls made of superclusters and galaxies and stars and planets, there is, basically, nothing. The walls of stars are like gigantic honeycombs, and between them, are enormous empty spaces, hundred million of light years wide. When you look at a honeycomb, you will see that the empty spaces between the walls are much, much larger than the walls themselves. Such is the universe. You might think that the distance from here to the next grocery store is quite far, or that the ocean is quite big. But the distance from the earth to the sun is so much bigger, and the distance from the sun to the next star again so much more. And from our galaxy to the next, there is a huge empty space. Nevertheless, our galaxy is so close to the next group of galaxies that they together form a building block of a huge wall, separating two unimaginable large empty spaces from each other.

So when we figured out that we can calculate the center of the universe, it was widely expected that the center would be somewhere in one of those vast spaces of nothing. The chances that it would be in one of the filaments were tiny.

It turned out that this was not a question of chance.

The center of the universe was not only inside of a filament, but the first quick calculations (quick, though, has to be understood as taking three and a half years) suggested that the center is actually within our filament. And not only within our filament — but our galaxy. Within a one light year radius of our sun.

The team that made these calculations was working at a small research institute in rural Japan. They did not believe the results, and double and triple checked them. The head of the institute had graduated from Princeton, and called his former advisor there. Although it was deep in the night in Japan, they talked for many hours. In the end he learned that Princeton has made the same calculations, and received their own results about eight months ago. They didn’t dare to publish them. There must have been a mistake. These results had to be wrong.

Science has humiliated the whole of humanity again and again. And it was quite successful in doing so. A scientist would much easier accept that the center of the universe is some mathematical construct pointing to nothing than what the infallible mathematics indicated. But the data was out. And the number of people making the above mentioned realizations and calculations continued growing. It was only a matter of time. And when the Catholic University of Rio de Janeiro finally published the results — in a carefully written paper, without any accompanying press release, and formulated so cautiously and defensively — all the scientists who already knew the results held their breath.

The storm was unimaginable. Everyone demanded an explanation, but no one would listen to anyone offering one. The religions rejoiced, claiming they knew it all along, and many flocked to the mosques and churches and temples, as a proof of God was finally found. The irony of science leading humans to the embrace of religion was profoundly lost at that time, but later recognized as one of the largest jokes in history. Science has dealt its ultimate humiliation, not to humanity, but perversely to its most devout followers, the scientists. The scientists, who, while trashing the superiority of humans over the world, were secretly inflating their own, and were now reminded that they were merely slaves to a most cruel mistress. Their bitter resistance to the results did not stop them from emerging.

The mathematics and calculations were soon made public. The mathematics were deceptively simple, once the required factorizations were done, and easy to check. High school courses went through the proofs, and desperate parents peeked over the shoulders of their daughters and sons who, sometimes for the first time, talked of integrals and imaginary numbers. Television and streaming platforms were explaining discriminants and complex numbers and roots of higher degrees. Websites offering math courses bent under the load and moral weight.

There is one weird thing about roots. The root of a number is the number that, multiplied with itself, gives you the original number. The weird thing is that there is usually not a single, unique result to that question. For example, the root of the number four is not just two, but also minus two, as minus two times minus two results in four, too. There are two roots of the second degree (which we usually call the square root). There are three roots of the third degree (sometimes called the cube root). There are four roots of the fourth degree. And so on. All of them are correct. Sometimes you can discard one or the other because the result has to fit certain constraints (say, you are looking only for the positive root of four), but sometimes, you can not.

As the calculations went public, the methods became more and more refined. The results became increasingly precise, and as the data from the satellites poured in, one of the last steps involved a root of the seventh degree. First, this was regarded as a minor curiosity, especially because these seven results led to basically the same point. Cosmologically speaking.

Earth is moving. Earth is moving around the sun, with a speed of a sixty seven thousand miles per hour, or eighteen miles each second. Also the sun is moving, and the earth is moving with the sun, and our galaxy is moving, and with our galaxy the sun moves along, and with the sun our earth. We are racing with a speed of a thousand miles each second in some direction away from the center of the universe.

And it was realized, maybe we just passed the center of the universe. Maybe it was just an accident, maybe all the planets and stars pass the center of the universe at some point. That we are so close to the center of the universe might be just a funny coincidence.

And maybe they are right. Maybe every star will at some point cross the center of the universe within the distance of a light year.

At some point though it was realized that, since the universe was bounded in all four dimensions, there was not only a center in space, but also a center in time, a midpoint between the beginning of the universe and its future end.

All human history is encompassed in the last hundred thousand years. From the mitochondrial Eve and the Y-Chromosomal Adam who lived in Africa, the mother of our mother of our mother, and so on, that we all share, and the father of our father of our father, and so on, that we all share, their descendants, our ancestors, who crossed the then fertile jungle of the Sahara and who afterwards settled the whole planet, painted on the walls of caves and filled the air with music by blowing over grass blades and into hollow bones, wandered over the land bridge connecting Asia with the Americas and traveled over the vast Pacific to discover tiny islands, until the recent invention of the alphabet, all of this happened in the last hundred thousand years. The universe has an age of hundred thousand times a hundred thousand years, roughly. And the fabled midpoint turned out to be within the last few thousand years.

The hopes that our earth was just accidentally next to the center of the universe was shattered. As the precision of the calculations increased, it became clearer and clearer that earth was not merely close to the center of the universe, but back at the midpoint of history, earth was right there in the center. In every single of the seven possible results, Earth was right at the center of the universe. [1]

As the calculations continued over the years, a new class of mystic mathematicians emerged, and many walls between religion and science were shattered. On both sides the unshakeable ones remained: the scientists who would not admit that these results mean anything, that it all is merely a mathematical abstraction; and the priests who say that these results mean nothing, that they don’t tell us about how to live a good life. That these parallels intersect, is the only trace of infinity left.


[1] As the results refined, it seemed that the seven mathematical solutions for the center of time and space turned out to be some very well known dates. So far the precisions calculated was ten years here or there. The well known dates were: 3760 BC, 541 BC, 30 AD, and 610 AD. The other dates turned out to be quite less well known: 10909 BC, 3114 BC, and 1989 AD. The interpretation of the dates led to a well-known series of events all over the world, which we will not discuss here.


(This story was first published on Medium on February 2, 2014 under CC-BY 4.0).

CodeNet problem descriptions on the Web

Project CodeNet is a large corpus of code published by IBM. It has close to one and a half million programs around a bit more than 4,000 problems.

I took the problem descriptions, created a simple index file to those, and uploaded them to the Web to make them easily browseable.

Wikidata or scraping Wikipedia

Yesterday I was pointed to a blog post describing how to answer an interesting project: how many generations from Alfred the Great to Elizabeth II? Alfred the Great was a king in England at the end of the 9th century, and Elizabeth II is the current Queen of England (and a bit more).

The author of the blog post, Bill P. Godfrey, describes in detail how he wrote a crawler that started downloading the English Wikipedia article of Queen Elizabeth II, and then followed the links in the infobox to download all her ancestors, one after the other. He used a scraper to get the information from the Wikipedia infoboxes from the HTML page. He invested quite a bit of work in cleaning the data, particularly doing entity reconciliation. This was then turned into a graph and the data analyzed, resulting in a number of paths from Elizabeth II to Alfred, the shortest being 31 generations.

I honestly love these kinds of projects, and I found Bill’s write-up interesting and read it with pleasure. It is totally something I would love to do myself. Congrats to Bill for doing it. Bill provided the dataset for further analysis on his Website. Thanks for that!

Everything I say in this post is not meant, in any way, as a criticism of Bill. As said, I think he did a fun project with interesting results, and he wrote a good write-up and published his data. All of this is great. I left a comment on the blog post sketching out how Wikidata could be used for similar results.

He submitted his blog post to Hacker News, where a, to me, extremely surprising discussion ensued. He was pointed rather naturally and swiftly to Wikidata and DBpedia. DBpedia is a project that started and invested heavily in scraping the infoboxes from Wikipedia. Wikidata is a sibling project of Wikipedia where data can be directly maintained by contributors and accessed in a number of machine-readable ways. Asked why he didn’t use Wikidata, he said he didn’t know about it. All fair and good.

But some of the discussions and comments on Hacker News surprised me entirely.

Expressing my consternation, I started discussions on Twitter and on Facebook. And there were some very interesting stories about the pain of using Wikidata, and I very much expect us to learn from them and hopefully make things easier. The number of API queries one has to make in order to get data (although, these numbers would be much smaller than with the scraping approach), the learning curve about SPARQL and RDF (although, you can ignore both, unless you want to use them explicitly - you can just use JSON and the Wikidata API), the opaqueness of the identifiers (wdt:P25 wd:Q9682 instead of “mother” and “Queen Elizabeth II”) were just a few. The documentation seems hard to find, there seem to be a lack of libraries and APIs that are easy to use. And yet, comments like "if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you" surprised me a lot.

Others asked about the data quality of Wikidata, and complained about the huge amount of bad data, duplicates, and the bad ontology in Wikidata (as if Wikipedia wouldn’t have these problems. I mean how do you figure out what a Wikipedia article is about? How do you get a list of all bridges or events from Wikipedia?)

I am not here to fight. I am here to listen and to learn, in order to help figuring out what needs to be made better. I did dive into the question of data quality. Thankfully, Bill provides his dataset on the Website, and downloading the query result for the following query - select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } - is just one click away. The result of this query is equivalent to what Bill was trying to achieve - a list of all ancestors of Elizabeth II. (The actual query is a little bit more complex, because we also fetch the names of the ancestors, and their Wikipedia articles, in order to help match the data to Bill’s data).

I would claim that I invested far less work than Bill in creating my graph data. No data cleansing, no scraping, no crawling, no entity reconciliation, no manual checking. How about the quality of the two datasets?

Update: Note, this post is not a tutorial to SPARQL or Wikidata. You can find an explanation of the query in the discussion on Hacker News about this post. I really wanted to see how the quality of the data using the two approaches compares. Yes, it is an unfamiliar language for many, but I used to teach SPARQL and the basics of the languages seem not that hard to learn. Try out this tutorial for example. Update over

So, let’s look at the datasets. I will refer to the two datasets as the scrape (that’s Bill’s dataset) and Wikidata (that’s the query result from Wikidata, as of the morning of August 20 - in particular, none of the errors in Wikidata mentioned below have been fixed).

In the scrape, we find 2,584 ancestors of Elizabeth II (including herself). They are connected with 3,528 parenthood relationships.

In Wikidata, we find 20,068 ancestors of Elizabeth II (including herself). They are connected with 25,414 parenthood relationships.

So the scrape only found a bit less than 13% of the people that Wikidata knows about, and close to 14% of the relationships. If you ask me, that’s quite a bad recall - almost seven out of eight ancestors are missing.

Did the scrape find things that are missing in Wikidata? Yes. 43 ancestors are in the scrape which are missing in Wikidata, and 61 parenthood relationships are in the scrape which are missing from Wikidata. That’s about 1.8% of the data in the scrape, or 0.24% compared to the overall parent relationship data of Elizabeth II in Wikidata.

I evaluated the complete list of those relationships from the scrape missing from Wikidata. They fall into five categories:

  • Category 1: Errors that come from the scraper. 40 of the 61 relationships are errors introduced by the scrapers. We have cities or countries being parents - which isn’t too terrible, as Bill says in the blog post because they won’t have parents themselves and won’t participate in the original question of findinging the lineage from Alfred to Elizabeth, so no problem. More problematic is when grandparents or great-grandparents are identified as the parent, because this directly messes up the counting of generations: Ügyek is thought to be a son, not a grandson of Prince Csaba, Anna Dalassene is skipping two generations to Theophylact Dalassenos, etc. This means we have an error rate of at least 1.1% in the scraper dataset, besides having the low recall rate mentioned above.
  • Category 2: Wikipedia has an error. Those are rare, it happened twice. Adelaide of Metz had the wrong father and Sophie of Mecklenburg linked to the wrong mother in the infobox (although the text was linking to the right one). The first one has been fixed since Bill ran his scraper (unlucky timing!), and I fixed the second one. Note I am linking to the historic version of the article with the error.
  • Category 3: Wikidata was missing data. Jeanne de Fougères, Countess of La Marche and of Angoulême and Albert Azzo II, Margrave of Milan were missing one or both of their parents, and Bill’s scraping found them. So of the more than 3,500 scraped relationships, only 2 were missing! I added both.
  • In addition, correct data was marked deprecated once. I fixed that, too.
  • Category 4: Wikidata has duplicates, and that breaks the chain. That happened five times, I think the following pairs are duplicates: Q28739301/Q106688884, Q105274433/Q40115489, Q56285134/Q354855, Q61578108/Q546165 and Q15730031/Q59578032. Duplicates were mentioned explicitly in one of the comments as a problem, and here we can see that they happen with quite a bit of frequency, particularly for non-central items. I merged all of these.
  • Category 5: the situation is complicated, and different Wikipedia versions disagree, because the sources seem to disagree. Sometimes Wikidata models that disagreement quite well - but often not. After all, we are talking about people who sometimes lived more than a millennium ago. Here are these cases: Albert II, Margrave of Brandenburg to Ada of Holland; Prince Álmos to Sophia to Emmo of Loon (complicated by a duplicate as well); Oldřich, Duke of Bohemia to Adiva; William III to Raymond III, both Counts of Toulouse; Thored to Oslac of York; Bermudo II of León to Ordoño III of León (Galician says IV); and Robert Fitzhamon to Hamo Dapifer. In total, eight cases. I didn't edit those as these require quite a bit of thought.

Note that there was not a single case of “Wikidata got it wrong”, which surprised me a lot - I totally expected errors to happen. Unless you count the cases in Category 5. I mean, even English Wikipedia had errors! This was a pleasant surprise. Also, the genuine complicated cases are roughly as frequent as missing data, duplicates, and errors together. To be honest, that sounds like a pretty good result to me.

Also, the scraped data? Recall might be low, but the precision is pretty good: more than 98% of it is corroborated by Wikidata. Not all scraping jobs have such a high correctness.

In general, these results are comparable to a comparison of Wikidata with DBpedia and Freebase I did two years ago.

Oh, and what about Bill’s original question?

Turns out that Wikidata knows of a path between Alfred and Elizabeth II that is even shorter than the shortest 31 generations Bill found, as it takes only 30 generations.

This is Bill’s path:

  • Alfred the Great
  • Ælfthryth, Countess of Flanders
  • Arnulf I, Count of Flanders
  • Baldwin III, Count of Flanders
  • Arnulf II, Count of Flanders
  • Baldwin IV, Count of Flanders
  • Judith of Flanders
  • Henry IX, Duke of Bavaria
  • Henry X, Duke of Bavaria
  • Henry the Lion
  • Henry V, Count Palatine of the Rhine
  • Agnes of the Palatinate
  • Louis II, Duke of Bavaria
  • Louis IV, Holy Roman Emperor
  • Albert I, Duke of Bavaria
  • Joanna Sophia of Bavaria
  • Albert II o _Germany
  • Elizabeth of Austria
  • Barbara Jagiellon
  • Christine of Saxony
  • Christine of Hesse
  • Sophia of Holstein-Gottorp
  • Adolphus Frederick I, Duke of Mecklenburg-Schwerin
  • Adolphus Frederick II, Duke of Mecklenburg-Strelitz
  • Duke Charles Louis Frederick of Mecklenburg
  • Charlotte of Mecklenburg-Strelitz
  • Prince Adolphus, Duke of Cambridge
  • Princess Mary Adelaide of Cambridge
  • Mary of Teck
  • George VI
  • Elizabeth II

And this is the path that I found using the Wikidata data:

  • Alfred the Great
  • Edward the Elder (surprisingly, it deviates right at the beginning)
  • Eadgifu of Wessex
  • Louis IV of France
  • Matilda of France
  • Gerberga of Burgundy
  • Matilda of Swabia (this is a weak link in the chain, though, as there might possibly be two Matildas having been merged together. Ask your resident historian)
  • Adalbert II, Count of Ballenstedt
  • Otto, Count of Ballenstedt
  • Albert the Bear
  • Bernhard, Count of Anhalt
  • Albert I, Duke of Saxony
  • Albert II, Duke of Saxony
  • Rudolf I, Duke of Saxe-Wittenberg
  • Wenceslaus I, Duke of Saxe-Wittenberg
  • Rudolf III, Duke of Saxe-Wittenberg
  • Barbara of Saxe-Wittenberg (Barbara has no article in the English Wikipedia, but in German, Bulgarian, and Italian. Since the scraper only looks at English, they would have never found this path)
  • Dorothea of Brandenburg
  • Frederick I of Denmark
  • Adolf, Duke of Holstein-Gottorp (husband to Christine of Hesse in Bill’s path)
  • Sophia of Holstein-Gottorp (and here the two lineages merge again)
  • Adolphus Frederick I, Duke of Mecklenburg-Schwerin
  • Adolphus Frederick II, Duke of Mecklenburg-Strelitz
  • Duke Charles Louis Frederick of Mecklenburg
  • Charlotte of Mecklenburg-Strelitz
  • Prince Adolphus, Duke of Cambridge
  • Princess Mary Adelaide of Cambridge
  • Mary of Teck
  • George VI
  • Elizabeth II

I hope that this is an interesting result for Bill coming out of this exercise.

I am super thankful to Bill for doing this work and describing it. It led to very interesting discussions and triggered insights into some shortcomings of Wikidata. I hope the above write-up is also helpful, particularly in providing some data regarding the quality of Wikidata, and I hope that it will lead to work in making Wikidata more and easier accessible to explorers like Bill.

Update: there has been a discussion of this post on Hacker News.

Double copy in gravity

15 May 2021

When I was younger, I understood these theories much better. Today I read them like a fascinated, but a bit distant bystander.

But it is terribly interesting. What does turning physics into math mean? When we find a mathematical shortcut that works but we don't understand - is this real? What is the relation between mathematical formulas and reality? And will we finally understand gravity some day?

It was an interesting article, but I am not sure I understood it all. I guess, I'm getting old. Or just too specialized.

Zen and the Art of Motorcycle Maintenance

13 May 2021

During my PhD, on the topic of ontology evaluation - figuring out what a good ontology is and what is not - I was running circles up and down trying to define what "good" means for an ontology (Benjamin Good, another researcher on that topic, had it easier, as he could call his metric "Good metric" and be done with it).

So while I was struggling with the definition in one of my academic essays, a kind anonymous reviewer (I think it was Aldo Gangemi) suggested I should read "Zen and the Art of Motorcycle Maintenance".

When I read the title of the suggested book, I first thought the reviewer was being mean or silly and suggesting a made-up book because I was so incoherent. It took me two days to actually check whether that book existed, as I wouldn't believe it.

It existed. And it really helped me, by allowing me to set boundaries of how far I can go in my own work, and that it is OK to have limitations, and that trying to solve EVERYTHING leads to madness.

(Thanks to Brandon Harris for triggering this memory)

Keynote at Web Conference 2021

Today, I have the honor to give a keynote at the WWW Confe... sorry, the Web Conference 2021 in Ljubljana (and in the whole world). It's the 30th Web Conference!

Join Jure Leskovec, Evelyne Viegas, Marko Grobelnik, Stan Matwin and myself!

I am going to talk about how Abstract Wikipedia and Wikifunctions aims to contribute to Knowledge Equity. Register here for free:

Update: the talk can now be watched on VideoLectures:

Building a Multilingual Wikipedia

Communications of the ACM published my paper on "Building a Multilingual Wikipedia", a short description of the Wikifunctions and Abstract Wikipedia project that we are currently working on at the Wikimedia Foundation.


Jochen Witte

Jochen Witte war ein Freund meiner Schulzeit. Ich habe viel von ihm gelernt, er konnte all diese praktischen Sachen zu denen ich nie einen Zugang hatte und von denen ich oft wünschte, ich könnte sie. Von ihm lernte ich, was eine gute Soundanlage braucht und warum Subwoofer groß sein müssen und was Subwoofer überhaupt sind. Zusammen schleppten wir schwere Boxen, um Unterstufendiscos und Abischerze und Vorträge zu ermöglichen. Von ihm lernte ich die Vorzüge des Gaffertapes kennen, und dass es nicht nur silbernes Klebeband ist. Er war der erste, der mir Mangas und Anime ein wenig näherbrachte, insbesondere hatte er eine Leidenschaft für Akira. Er ließ mich das erste Mal die elektronische Musik von Chris Hülsbeck und Jean-Michel Jarre hören. Er las ASM, ich las Power Play. Wir spielten eine zeitlang DSA miteinander. Er war der erste den ich kannte mit einem Pager. Er wirkte stets so als konnte er alles reparieren, und es war gut so jemanden zu kennen.

Gleichzeitig waren einige meiner Freunde und ich ihm gegenüber nicht immer freundlich, oh nein, im Gegenteil, manchmal war ich geradewegs grausam. Ich mache mich über seine Brille lustig oder sein Gewicht, und konnte Punkte damit sammeln, über ihn Witze zu machen. Ich wusste es war falsch. Wir waren ja schon die Außenseiter in der Klasse, und ich versuchte ihn zum Außenseiter der Außenseiter zu machen. Meine einzige Entschuldigung ist, dass wir Kinder waren, und ich noch nicht die Stärke hatte, besser zu sein. Ich lernte viel daraus, und wollte nie wieder so sein. Mit der Zeit verstand ich mich besser. Wo diese Grausamkeit herkam. Und das es nicht an Jochen lag, sondern in mir. Ich schäme mich für vieles was ich tat. Ich weiß nicht, ob ich mich jemals bei ihm entschuldigt habe.

Und dennoch glaube ich waren wir Freunde.

Nach der Schulzeit verloren wir uns aus den Augen. Er studierte Chemie in Esslingen, wir trafen uns hin und wieder im Movie Dick zur Sneak Preview. Er zog nach Staig im Alb-Donau-Kreis und fand sich als Goth wieder. Aber über die Jahre hinweg, gerieten wir hin und wieder in Kontakt.

Eine unserer gemeinsamen Erinnerungen war, wie wir zusammen zu einem Vortrag von Erich von Däniken fuhren. Es war mein Auto. Wir hatten einen Platten, und während er es zum Laufen brachte - wie gesagt, er konnte alles reparieren - fragte er mich, wann ich denn das letzte Mal nach dem Öl geschaut habe. Ich muss so belämmert reingeschaut haben, dass er nur noch lachen konnte. Die Antwort war "Nie", und er sah es in meinem Gesicht. Jedesmal wenn wir uns trafen, sprach er mich auf diesen Abend an.

Jochen half mir beim Umzug nach Karlsruhe. Das Gästebett passte nicht richtig zusammen. Er sagte er könnte es festziehen, aber ich würde es nie wieder auseinander bekommen. Es wird schwierig, damit umzuziehen. Ich sagte, das ist OK, ist ja nur ein billiges IKEA Gästebett Couch Dings. Ich habe nicht vor, damit umzuziehen, versicherte ich ihm.

Ich zog damit von Karlsruhe nach Berlin. Von Berlin nach Alameda. Innerhalb von Alameda. Von Alameda nach Berkeley. Es hat den Umzugshelfern jedesmal Kopfzerbrechen bereitet, genau wie Jochen versprochen hatte. Letzte Woche brach ein Stück ab. Ich sitze jetzt darauf und schreibe das hier. Nach fast einem Jahrzehnt sollte ich es wohl endlich austauschen.

Das letzte mal trafen wir uns ganz zufällig 2017 am Stuttgarter Bahnhof. Ich war überhaupt nur ein Mal im letzen halben Jahrzehnt wieder in Deutschland. Und da, am Bahnhof, traf ich ihn. Es war schön, Jochen wiederzusehen, und wir redeten als ob wir uns immer noch täglich sehen würden, wie zwanzig Jahre zuvor. Als ob das Abitur erst gestern war.

Diese Woche erfuhr ich von Michael, dass Jochen verstorben ist. Er starb nur wenige Monate nach unserem zufälligen Treffen, im April 2018. Er wurde nur vierzig Jahre alt.

Es tut mir leid.

Und noch viel mehr: Danke.

Ruhe in Frieden, Jochen Witte.

Der Name Zdenko

Heute sah ich dass der Artikel Zdenko - mein eigentlicher Name - auf der Englischen Wikipedia verändert wurde. Jemand hatte die Bedeutung des Namens von dem, was ich für richtig hielt (slawische Form von Sidonius) zu etwas was ich nie zuvor gehört habe (Koseform von Zdeslav) verändert, aber nicht die Quelle angepasst. Ich dachte, das wird eine schnelle Korrektur, habe aber dennoch in die Quelle geschaut - und, schau an, die Quelle sagte weder das eine noch das andere, sondern behauptete der Name stammt von dem slawischen Wort zidati, bauen, errichten.

Das führte mich zu einer zweitstündigen Odyssee durch verschiedene Quellen des 19. und 20. Jahrhunderts, wo ich Belege für alle drei Bedeutungen finden konnte - außerdem Quellen, die behaupteten, dass der Name von dem Slawischen Wort zdenac, Brunnen, abgeleitet ist, dass auch der Name Sidney von Sidonius stamme, und eine Hessische Quelle die vehement darüber schimpfte, dass doch Zdenko und Sidonius nichts miteinander zu tun haben (auch die Slowenische Wikipedia sagt, dass die Namen Zdenko und Sidonius zwar einen gemeinsamen Namenstag haben, aber nicht der gleiche Name sind). Dafür aber führt die gleiche Quelle aus, dass der im Osthessischen gebrauchte Name Denje wohl von Zdenka kommt (so nah an Denny!)

Denje gefällt mir als Name.

Kurzgesagt: wenn Du denkst, Etymologie sei kompliziert, sei gewarnt: Anthroponomastik ist deutlich schlimmer!

The name Zdenko

Today I saw that the Wikipedia article on Zdenko - my actual name - was edited, and the meaning of the name was changed from something I considered correct (slavic form of Sidonius) to something that I never heard of before (diminutive of Zdeslav), but the reference stayed intact, so I thought that'll be an easy revert. Just to do due process, I checked the given source - and funnily enough, it didn't say neither one nor the other, but gave an etymology from the slavic word zidati, to build, to create.

That lead me down a two hour rabbit hole through different sources crossing the 19th to 20th century, finding sources that claim the name is derived from the Slavic word zdenac, a well, or that Zdenko is cognate to Sidney, a Hessian source explaining that it is considered the root for the name Denje (so close to Denny!) (and saying it has nothing to do with Sidonius), and much more.

In short, if you think that etymology is messy, I tell you, anthroponymy is far worse!

Time on Mars

This is a fascinating and fun listen about the mars mission. Because a day on Mars takes 40 minutes longer than on Earth, the people working on that mission had to live on Mars time, as the Mars rovers work with solar panels. So they have watches showing Mars time. They invent new words in their language, speaking about sol instead of day, of yestersol, and they start themselves calling Martians. 11 minutes.

Katherine Maher to step down from Wikimedia Foundation

Today Katherine Maher announced that she is stepping down as the CEO of the Wikimedia Foundation in April.

Thank you for everything!

Boole and Voynich and Everest

Did you know?

George Boole - after whom the Boolean data type and Boolean logic was named - was the father of Ethel Lilian Voynich - who wrote The Gadfly.

Her husband was Wilfrid Voynich - after whom the Voynich manuscript was named.

Ethel's mother and George Boole's wife was Mary Everest Boole - a self-thought mathematician who wrote educational books about mathematics. Her life is of interest to feminists as an example of how women made careers in an academic system that did not welcome them.

Mary Everest Boole's uncle was Sir George Everest - after whom Mount Everest is named.

And her daughter Lucy Everest was the first he first woman Fellow of the Royal Institute of Chemistry.

Geoffrey Hinton, great-great-grandson of George and Mary Everest Boole, received the Turing Award for his work on deep learning.

Abraham Taherivand to step down from Wikimedia Deutschland

Today Abraham Taherivand announced that he is stepping down as the CEO of Wikimedia Deutschland at the end of the year.

Thank you for everything!

Twenty years

On this day, twenty years ago, on January 15, 2001, I started my third Website, Nodix, and I kept it up since then (unlike my previous two Websites, which are lost to history as Internet Archive didn't capture them yet, it seems). A few years later I renamed it to Simia.

Here is the first entry: Willkommen auf der Webseite von Denny Vrandecic!

My Website never became particularly popular, although I was meticulously keeping track of how many hits I got and all of this. It was always a fun side project for which I had sometimes more and sometimes less time.

The funniest thing is that it was - and that was completely incidental - exactly the same day that another Website was started, which I, over the years, spent much more time on: Wikipedia.

Wikipedia changed my life, not only once, but many times.

It is how I met Kamara.

It is how I met a lot of other very smart people, too. It became part of my research work and my PhD thesis. It became the motivation for many of the projects I have started, be it Semantic MediaWiki, Wikidata, or Abstract Wikipedia. It is the reason for my career trajectory over the last fifteen years. It is hard to overstate how influential Wikipedia has been on my life.

It is hard to overstate how important Wikipedia has become for modern AI and for the Web of today. For smaller language communities. For many, many people looking for knowledge. And for the many people who realised that they can contribute to it too.

Thanks to the Wikipedia community, thanks to this marvellous project, and happy anniversary and many returns to Wikipedia!

Happy New Year 2021!

2020 was a challenging year, particularly due to the pandemic. Some things were very different, some things were dangerous, and the pandemic exposed the fault lines in many societies in a most tragic way around the world.

Let's hope that 2021 will be better in that respect, that we will have learned from how the events unfolded.

But I'm also amazed by how fast the vaccine was developed and made available to tens of millions.

I think there's some chance that the summer of '21 will become one to sing about for a generation.

Happy New Year 2021!

Keynote at SMWCon Fall 2020


I have the honor of being the invited keynote for the SMWCon Fall 2020. I am going to talk "From Semantic MediaWiki to Abstract Wikipedia", discussing fifteen years of Semantic MediaWiki, how it all started, where we are now - crossing Freebase, DBpedia, Wikidata - and now leading to Wikifunctions and Abstract Wikipedia. But, more importantly, how Semantic MediaWiki, over all these years, still holds up and what its unique value is.

Page about the talk on the official conference site: https://www.semantic-mediawiki.org/wiki/SMWCon_Fall_2020/Keynote:_From_Semantic_Wikipedia_to_Abstract_Wikipedia

Site went down

The site went down, again. First time was in July, when Apache had issues, this time it's due to MySQL acting up and frying the database. I found a snapshot from July 2019, and am trying to recreate the entries from in between (thanks, Wayback Machine!)

Until then, at least the site is back up, even though they might be some losses in the content.

P.S.: it should all be back up. If something is missing, please email me.

Wikidata crossed Q100000000

Wikidata crossed Q100000000 (and, in fact, skipped it and got Q100000001 instead).

Here's a small post by Lydia Pintscher and me: https://diff.wikimedia.org/2020/10/06/wikidata-reaches-q100000000/

Mulan

I was surprised when Disney made the decision to sell Mulan on Disney+. So if you wanted to watch Mulan, you not only have to buy it, so far so good, but you have to join their subscription service first. The price for Mulan is $30 in the US, additionally to the monthly fee of streaming, $7. So the $30 don't buy you Mulan, but allow you to watch it if you keep up your subscription.

Additionally, on December 4 the movie becomes free for everyone with a Disney+ subscription.

I thought, that's a weird pricing model. Who'd pay that much money for streaming the movie a few weeks earlier? I know, it will be very long weeks due to the world being so 2020, but still. Money is tight for many people. Also, the movie had very mixed reviews and a number of controversies attached to it.

According to the linked report, Disney really knows what they're doing. 30% of subscribers bought the early streaming privilege! Disney made hundreds of millions in extra profit within three first few days (money they really will be thankful for right now given their business with the cruise ships and theme parks and movies this year).

The most interesting part is how this will affect the movie industry. Compare to Tenet - which was reviewed much better and which was the hope to revive the moribund US cinema industry, but made less than $30M - which also needs to be shared with the theaters and had much more distribution costs. Disney keeps a much larger share of the $30 for Mulan than Tenet makes for its production company.

The lesson from Mulan and Trolls 2, which also did much better than I would ever have predicted, for the production companies experimenting with novel pricing models, could be disastrous for theaters.

I think we're going to see even more experimentation with pricing models. If the new Bond movie and/or the new Marvel movie should be pulled from cinemas, this might also be the end of cinemas as we know them.

I don't know how the industry will change, but the swing is from AMC to Netflix, with the producers being caught in between. The pandemic massively accelerated this transition, as it did so many others.

https://finance.yahoo.com/amphtml/news/nearly-onethird-of-us-households-purchased-mulan-on-disney-for-30-fee-data-221410961.html

Gödel's naturalization interview

When Gödel went to his naturalization interview, his good friend Einstein accompanied him as a witness. On the way, Gödel told Einstein about a gap in the US constitution that would allow the country to be turned into a dictatorship. Einstein told him to not mention it during the interview.

The judge they came to was the same judge who already naturalized Einstein. The interview went well until the judge asked whether Gödel thinks that the US could face the same fate and slip into a dictatorship, as Germany and Austria did. Einstein became alarmed, but Gödel started discussing the issue. The judge noticed, changed the topic quickly, and the process came to the desired outcome.

I wonder what that was, that Gödel found, but that's lost to history.

Gödel and Leibniz

Gödel in his later age became obsessed with the idea that Leibniz had written a much more detailed version of the Characteristica Universalis, and that this version was intentionally censored and hidden by a conspiracy. Leibniz had discovered what he had hunted for his whole life, a way to calculate truth and end all disagreements.

I'm surprised that it was Gödel in particular to obsess with this idea, because I'd think that someone with Leibniz' smarts would have benefitted tremendously from Gödel's proofs, and it might have been a helpful antidote to his own obsession with making truth a question of mathematics.

And wouldn't it seem likely to Gödel that even if there were such a Characteristics Universalis by Leibniz, that, if no one else before him, he, Gödel himself would have been the one to find the fatal bug in it?

Starting Abstract Wikipedia

I am very happy about the Board of the Wikimedia Foundation having approved the proposal for the multilingual Wikipedia aka Abstract Wikipedia aka Wikilambda aka we'll need to find a name for it.

In order to make that project a reality, I will as of next week join the Foundation. We will be starting with a small, exploratory team, which will allow us to have plenty of time to continue to socialize and discuss and refine the idea. Being able to work on this full time and with a team should allow us to make significant progress. I am very excited about that.

I am sad to leave Google. It was a great time, and I learned a lot about running *large* projects, and I met so many brilliant people, and I ... seriously, it was a great six and a half years, and I will very much miss it.

There is so much more I want to write but right now I am just super happy and super excited. Thanks everyone!

Lexical masks in JSON

We have released lexical masks as ShEx files before, schemata for lexicographic forms that can be used to validate whether the data is complete.

We saw that it was quite challenging to turn these ShEx files into forms for entering the data, such as Lucas Werkmeister’s Lexeme Forms. So we adapted our approach slightly to publish JSON files that keep the structures in an easier to parse and understand format, and to also provide a script that translates these JSON files into ShEx Entity Schemas.

Furthermore, we published more masks for more languages and parts of speech than before.

Full documentation can be found on wiki: https://www.wikidata.org/wiki/Wikidata:Lexical_Masks#Paper

Background can be found in the paper: https://www.aclweb.org/anthology/2020.lrec-1.372/

Thanks Bruno, Saran, and Daniel for your great work!

Major bill for US National Parks passed

Good news: the US Senate has passed a bipartisan large Public Lands Bill, which will provide billions right now and continued sustained funding for National Parks.

There a number of interesting and good parts about this, besides the obvious that National Parks are being funded better and predictably:

  1. the main reason why this passed and was made was that the Evangelical movement in the US is increasingly reckoning that Pro-Life also means Pro-Environment, and this really helped with making this bill a reality. This is major as it could set the US on a path to become a more sane nation regarding environmental policies. If this could also extend to global warming, that would be wonderful, but let's for now be thankful for any momentum in this direction.
  2. the sustained funding comes from oil and gas operations, which has a certain satisfying irony to it. I expect this part to backfire a bit somehow, but I don't know how yet.
  3. Even though this is a political move by Republicans in order to safe two of their Senators this fall, many Democrats supported it because the substance of the bill is good. Let's build on this momentum of bipartisanship.
  4. This has nothing to do with the pandemic, for once, but was in work for a long time. So all of the reasons above are true even without the pandemic.

Black lives matter

Fun in coding

16 May 2020

This article really was grinding my gears today. Coding is not fun, it claims, and everyone who says otherwise is lying for evil reasons, like luring more people into programming.

Programming requires almost superhuman capabilities, it says. And other jobs who do that, such as brain surgery, would never be described as fun, so it is wrong to talk like this about coding.

That is all nonsense. The article not only misses the point, but it denies many people their experience. What's the goal? Tell those "pretty uncommon" people that they are not only different than other people, but that their experience is plain wrong, that when they say they are having fun doing this, they are lying to others, to the normal people, for nefarious reasons? To "lure people to the field" to "keep wages under control"?

I feel offended by this article.

There are many highly complex jobs that some people have fun doing some of the time. Think of writing a novel. Painting. Playing music. Cooking. Raising a child. Teaching. And many more.

To put it straight: coding can be fun. I have enjoyed hours and days of coding since I was a kid. I will not allow anyone to deny me that experience I had, and I was not a kid with nefarious plans like getting others into coding to make tech billionaires even richer. And many people I know have expressed fun with coding.

Also: coding does not *have* to be fun. Coding can be terribly boring, or difficult, or frustrating, or tedious, or bordering on painful. And there are people who never have fun coding, and yet are excellent coders. Or good enough to get paid and have an income. There are coders who code to pay for their rent and bills. There is nothing wrong with that either. It is a decent job. And many people I know have expressed not having fun with coding.

Having fun coding doesn't mean you are a good coder. Not having fun coding doesn't mean you are not a good coder. Being a good coder doesn't mean you have to have fun doing it. Being a bad coder doesn't mean you won't have fun doing it. It's the same for singing, dancing, writing, playing the trombone.

Also, professional coding today is rarely the kind of activity portrayed in this article, a solitary activity where you type code in green letters into a monotype font on black background, without having to answer to anyone, your code not being reviewed and scrutinized before it goes into production. For decades, coding has been a highly social activity, that requires negotiation and discussion and social skills. I don't know if I know many senior coders who spend the majority of their work time actually coding. And it's in that level of activity where ethical decisions are made. Ethical decisions are rarely happening at the moment the coder writes an if statement, or declares a variable. These decisions are made long in advance, documented in design docs and task descriptions, reviewed by a group of people.

So this article, although it has its heart in the right position, trying to point out that coding, like any engineering, also has many relevant ethical questions, goes about it entirely wrongly, and manages to offend me, and probably a lot of other people.

Sorry for my Saturday morning rant.

OK

11 May 2020

I often hear "don't go for the mediocre, go for the best!", or "I am the best, * the rest" and similar slogans. But striving for the best, for perfection, for excellence, is tiring in the best of times, never mind, forgive the cliché, in these unprecedented times.

Our brains are not wired for the best, we are not optimisers. We are naturally 'satisficers', we have evolved for the good-enough. For this insight, Herbert Simon received a Nobel prize, the only Turing Award winner to ever get one.

And yes, there are exceptional situations where only the best is good enough. But if good enough was good enough for a Turing-Award winning Nobel laureate, it is probably for most of us too.

It is OK to strive for OK. OK can sometimes be hard enough, to be honest.

May is mental health awareness month. Be kind to each other. And, I know it is even harder, be kind to yourself.

Here is OK in different ways. I hope it is OK.

Oké ఓకే ਓਕੇ オーケー ओके 👌 ওকে או. קיי. Окей أوكي Օքեյ O.K.


Tim Bray leaving Amazon in protest

Tim Bray, co-author of XML, stepped down as Amazon VP over their handling of whistleblowers on May 1st. His post on this decision is worth reading.

If life was one day

If the evolution of animals was one day... (600 million years)

  • From 1am to 4am, most of the modern types of animals have evolved (Cambrian explosion)
  • Animals get on land a bit at 3am. Early risers! It takes them until 7am to actually breath air.
  • Around noon, first octopuses show up.
  • Dinosaurs arrive at 3pm, and stick around until quarter to ten.
  • Humans and chimpanzees split off about fifteen minutes ago, modern humans and Neanderthals lived in the last minute, and the pyramids were built around 23:59:59.2.

In that world, if that was a Sunday:

  • Saturday would have started with the introduction of sexual reproduction
  • Friday would have started by introducing the nucleus to the cell
  • Thursday recovering from Wednesday's catastrophe
  • Wednesday photosynthesis started, and lead to a lot of oxygen which killed a lot of beings just before midnight
  • Tuesday bacteria show up
  • Monday first forms of life show up
  • Sunday morning, planet Earth forms, pretty much at the same time as the Sun.
  • Our galaxy, the Milky Way, is about a week older
  • The Universe is about another week older - about 22 days.

There are several things that surprised me here.

  • That dinosaurs were around for such an incredibly long time. Dinosaurs were around for seven hours, and humans for a minute.
  • That life started so quickly after Earth was formed, but then took so long to get to animals.
  • That the Earth and the Sun started basically at the same time.

Addendum April 27: Álvaro Ortiz, a graphic designer from Madrid, turned this text into an infographic.

Architecture for a multilingual Wikipedia

I published a paper today:

"Architecture for a multilingual Wikipedia"

I have been working on this for more than half a decade, and I am very happy to have it finally published. The paper is a working paper and comments are very welcome.

Abstract:

Wikipedia’s vision is a world in which everyone can share in the sum of all knowledge. In its first two decades, this vision has been very unevenly achieved. One of the largest hindrances is the sheer number of languages Wikipedia needs to cover in order to achieve that goal. We argue that we need anew approach to tackle this problem more effectively, a multilingual Wikipedia where content can be shared between language editions. This paper proposes an architecture for a system that fulfills this goal. It separates the goal in two parts: creating and maintaining content in an abstract notation within a project called Abstract Wikipedia, and creating an infrastructure called Wikilambda that can translate this notation to natural language. Both parts are fully owned and maintained by the community, as is the integration of the results in the existing Wikipedia editions. This architecture will make more encyclopedic content available to more people in their own language, and at the same time allow more people to contribute knowledge and reach more people with their contributions, no matter what their respective language backgrounds. Additionally, Wikilambda will unlock a new type of knowledge asset people can share in through the Wikimedia projects, functions, which will vastly expand what people can do with knowledge from Wikimedia, and provide a new venue to collaborate and to engage the creativity of contributors from all around the world. These two projects will considerably expand the capabilities of the Wikimedia platform to enable every single human being to freely share in the sum of all knowledge.

Stanford seminar on Knowledge Graphs

My friend Vinay Chaudhri is organising a seminar on Knowledge Graphs with Naren Chittar and Michael Genesereth this semester at Stanford.

I have the honour to present in it as the opening guest lecturer, introducing what Knowledge Graphs are and what are good for.

Due to the current COVID situation, the seminar was turned virtual, and opened to everyone to attend to.

Other speakers during the semester include Juan Sequeda, Marie-Laure Mugnier, Héctor Pérez Urbina, Michael Uschold, Jure Leskovec, Luna Dong, Mark Musen, and many others.

Change is in the air

I'll be prophetic: the current pandemic will shine a bright light on the different social and political systems in the different countries. I expect to see noticeable differences in how disruptive the handling of the situation by the government is, how many issues will be caused by panic, and what effect freely available health care has. The US has always been on the very end of admiring the self sustained individual, and China has been on the other end of admiring the community and its power, and Europe is somewhere in the middle (I am grossly oversimplifying).

This pandemic will blow over in a year or two, it will sweep right through the US election, and the news about it might shape what we deem viable and possible in ways beyond the immediately obvious. The possible scenarios range all the way from high tech surveillance states to a much wider access to social goods such as health and education, and whatever it is, the pandemic might be a catalyst towards that.

Wired: "Wikipedia is the last best place on the Internet"

WIRED published a beautiful ode to Wikipedia, painting the history of the movement with broad strokes, aiming to capture its impact and ambition with beautiful prose. It is a long piece, but I found the writing exciting.

Here's my favorite paragraph:

"Pedantry this powerful is itself a kind of engine, and it is fueled by an enthusiasm that verges on love. Many early critiques of computer-assisted reference works feared a vital human quality would be stripped out in favor of bland fact-speak. That 1974 article in The Atlantic presaged this concern well: “Accuracy, of course, can better be won by a committee armed with computers than by a single intelligence. But while accuracy binds the trust between reader and contributor, eccentricity and elegance and surprise are the singular qualities that make learning an inviting transaction. And they are not qualities we associate with committees.” Yet Wikipedia has eccentricity, elegance, and surprise in abundance, especially in those moments when enthusiasm becomes excess and detail is rendered so finely (and pointlessly) that it becomes beautiful."

They also interviewed me and others for the piece, but the focus of the article is really on what the Wikipedia communities have achieved in our first two decades.

Two corrections: - I cannot be blamed for Wikidata alone, I blame Markus Krötzsch as well - the article says that half of the 40 million entries in Wikidata have been created by humans. I don't know if that is correct - what I said is that half of the edits are made by human contributors

Normbrunnenflasche

It's a pity there's no English Wikipedia article about this marvellous thing that exemplifies Germany so beautifully and quintessentially: the Normbrunnenflasche.

I was wondering the other day why in Germany sparkling water is being sold in 0.7l bottles and not in 1l or 2l or whatever, like in the US (when it's sold here at all, but that's another story).

Germany had a lot of small local producers and companies. To counter the advantages of the Coca Cola Company pressing in the German market, in 1969 a conference of representatives of the local companies decided to introduce a bottle design they all would use. This decision followed a half year competition and discussion on what this bottle should look like.

Every company would use the same bottle for sparkling water and other carbonated drinks, and so no matter which one you bought, the empty bottle would afterwards be routed to the closest participating company, not back home, therefore reducing transport costs and increasing competitiveness against Coca Cola.

The bottle is full of smart features. The 0.7l were chosen to ensure that the drink remained carbonated until the last sip, because larger bottles would last longer and thus gradually loose carbonization.

The form and the little pearls outside were chosen for improved grip, but also to symbolize the sparkles of the carbonization.

The metal screw cap was the real innovation there, useful for drinks that could increase pressure due to the carbonization.

And finally two slightly thicker bands along the lower half of the bottle that would, while being rerouted for another usage, slowly get more opaque due to mechanical pressure, thus indicating how well used the individual bottle was, so they could be taken out of service in time before breaking at the customer.

The bottles were reused an average of fifty times, their boxes an average of hundred times. More than five billion of them have been brought into circulation in the fifty years since their adoption, for an estimated quarter of a trillion fillings.

A new decade?

The job of an ontologist is to define concepts. And since I see some posts commenting on whether a decade is closing and a new decade is starting tonight, here's my private, but entirely official position.

A decade is a consecutive timespan of ten years, and therefore at every given point a new decade starts and one ends. But that's a trivial answer to the question and not very useful.

There are two ways to count calendar decades, and both are arbitrary and rely on retconning, I mean, they really on redefining the past. Therefore there is no right or wrong.

Method one is by using the proleptic Gregorian calendar, and starting with the year 1 and ending with the year 10, and calling that the first decade. If you keep counting, then the twohundredandthird decade will start on January 1st 2021, and we are currently firmly in the twohundredandsecond decade, and will stay there for another year.

Method two is based on the fact that for a millennium now and for many years to come there's a time period that conveniently lasts a decade where the years start with the same three digits. That is, the years starting with 202, which are called the 2020s, the ones with 199 which are called the 1990s (or sometimes just the 90s), etc. For centuries now we can find support for these kind of decades being widely used. According to this method, tonight marks a new decade.

So whether you are celebrating a new year tonight or not (because there are many other calendars out there too), or a new decade or not, I wish you wonderful 2020s!

SWAT4HCLS trip report

This week saw the 12th SWAT4HCLS event in Edinburgh, Scotland. It started with a day of tutorials and workshops on Monday, December 10th, on topics such as SPARQL, querying, ontology matching, and using Wikibase and Wikidata.

Conference presentations went on for two days, Tuesday and Wednesday. This included four keynotes, including mine on Wikidata, and how to move beyond Wikidata (presenting the ideas from my Abstract Wikipedia papers). The other three keynotes (as well as a number of the paper presentation) were all centered on the FAIR concept which I already saw being so prominent at the eScience conference earlier this year. FAIR as in Findable, Accessible, Interoperable, and Reusable publication of data. I am very happy to see these ideas spread out so prominently!

Birgitta König-Ries talked about how to use semantic technologies to manage FAIR data. Dov Greenbaum talked about how licenses interplay with data and what it means for FAIR data - personally, my personal favorite of the keynotes, because of my morbid fascination regarding licenses and intellectual property rights pertaining to data and knowledge. He actually confirmed my understanding of the area - that you can’t really use copyright for data, and thus the application of CC-BY or similar licenses to data would stand on shaky grounds in a court. The last keynote was by Helen Parkinson, who gave a great talk on the issues that come up when building vocabularies, including issues around over-ontologizing (and the siren call of just keeping on modeling) and others. She put the issues in parallel to the travels of Odysseus, which was delightful.

The conference talks and posters were really on spot on the topic of the conference: using semantic web technologies in the life sciences, health care, and related fields. It was a very satisfying experience to see so many applications of the technologies that Semantic Web researchers and developers have been creating over the years. My personal favorite was MetaStanza, web components that visualize SPARQL results in many different ways (a much needed update to SPARK, that Andreas Harth and I had developed almost a decade ago).

On Thursday, the conference closed with a Hackathon day, which I couldn’t attend unfortunately.

Thanks to the organizers for the event, and thanks again for the invitation to beautiful Edinburgh!

Other trip reports (send me more if you have them):

Frozen II in Korea

This is a fascinating story, that just keeps getting better (and Hollywood Reporter is only scratching the surface here, unfortunately): an NGO in South Korea is suing Disney for "monopolizing" the movie screens of the country, because Frozen II is shown on 88% of all screens.

Now, South Korea has a rich and diverse number of movie theatres - they have the large cineplexes in big cities, but in the less populated areas they have many small theatres, often with a small number of screens (I reckon it is similar to the villages in Croatia, where there was only a single screen in the theater, and most movies were shown only once, and there were only one or two screenings per day, and not on every day). The theatres are often independent, so there is no central planning about which movies are being shown (and today, it rarely matters today how many copies of a movie are being made, as many projectors are digital and thus unlimited copies can be created on the fly - instead of waiting for the one copy to travel from one town to the next, which was the case in my childhood).

So how would you ensure that these independent movies don't show a movie too often? By having a centralized way that ensures that not too many screens show the same movie? (Preferably on the Blockchain, using an auction system?) Good luck with that, and allowing the local theatres to adapt their screenings to their audiences.

But as said, it gets better: the 88% number is being arrived at by counting how many of the screens in the country showed Frozen II on a given day. It doesn't mean that that screen was used solely for Frozen II! If the screen was used at noon for a showing of Frozen II, and at 10pm for a Korean horror movie, that screen counts for both. Which makes the percentage a pretty useless number if you want to show monopolistic dominance (also, because the numbers add up to far more than 100%). Again, remember that in small towns there is often a small number of screens, and they have to show several different movies on the same screen. If the ideas of the lawsuit would be enacted, you would need to keep off Frozen II from a certain number of screens! Which basically makes it impossible to allow kids and teens in less populated areas to participate in event movie-going such as Frozen II and trying to avoid spoilers in Social Media afterwards.

Now, if you look how many screenings, instead of screens, were occupied by Frozen II, the number drops down to 46% - which is still impressive, but far less dominant and monopolistic than the 88% cited above (and in fact below the 50% the Korean law requires to establish dominance).

And even more impressive: in the end it is up to the audience. And even though 'only' 46% of the screenings were on Frozen II, every single day since its release between 60% and 85% of all revenue was going to Frozen II. So one could argue that the theatres were actually underserving the audience (but then again, that's not how it really works, because screenings are usually in rooms with hundred or more seats, and they can be very differently filled - and showing a blockbuster three times with almost full capacity, and showing a less popular movie once with only a dozen or so tickets sold might still have served the local community better than only running the block buster).

I bet the NGO's goal is just to raise awareness about the dominance of the American entertainment industry, and for that, hey, it's certainly worth a shot! But would they really want to go back to a system where small local cinemas would not be able to show blockbusters for a long time, involving a complicated centralized planning component?

(Also, I wish there was a way to sign up for updates on a story, like this lawsuit. Let me know if anyone knows of such a system!)


Machine Learning and Metrology

There are many, many papers in machine learning these days. And this paper, taking a step back, and thinking about how researchers measure their results and how good a specific type of benchmarks even can be - crowdsourced golden sets. It brings a convincing example based on word similarity, using terminology and concepts from metrology, to show how many results that have been reported are actually not supported by the golden set, because the resolution of the golden set is actually insufficient. So there might be no improvement at all, and that new architecture might just be noise.

I think this paper is really worth the time of people in the research field. Written by Chris Welty, Lora Aroyo, and Praveen Paritosh.

The story of the Swedish calendar

Most of us are mostly aware how the calendar works. There’s twelve months in a year, each month has 30 or 31 days, and then there’s February, which usually has 28 days and sometimes, in what is called a leap year, 29. In general, years divisible by four are leap years.

This calendar was introduced by no one else then Julius Caesar, before he became busy conquering the known world and becoming the Emperor of Rome. Before that he used to have the job title “supreme bridge builder” - the bridge connecting the human world with the world of the gods. One of the responsibilities of this role was to decide how many days to add to the end of the calendar year, because the Romans noticed that their calendar was getting misaligned with the seasons, because it was simply a bit too short. So, for every year, the supreme bridge builder had to decide how many days to add to the calendar.

Since we are talking about the Roman Republic, this was unsurprisingly misused for political gain. If the supreme bridge builder liked the people in power, he might have granted a few extra weeks. If not, no extra days. Instead of ensuring that the calendar and the seasons aligned, the calendar got even more out of whack.

Julius Caesar spearheaded a reform of the calendar, and instead of letting the supreme bridge builder decide how many days to add, the reform devised rules founded in observation and mathematical rules - leading to the calendar we still have today: twelve months each year, each with 30 or 31 days, besides February, which had 28, but every four years would have 29. This is what we today call the Julian calendar. This calendar was not perfect, but pretty good.

Over the following centuries, the role of the supreme bridge builder - or, in latin, Pontifex Maximus - transferred from the Emperor of Rome to the Bishop of Rome, the Pope. And with continuing observations over centuries it was noticed that the calendar was again getting out of sync with the seasons. So it was the Pope - Gregory XIII, later called The Great - who, in his role as Pontifex Maximus, decided that the calendar should be fixed once again. The committee he set up to work on that came up with fabulous improvements, which would guarantee to keep the calendar in sync for a much longer time frame. In addition to the rules established by the Julian calendar, every hundred years we would drop a leap year. But every four hundred years, we would skip dropping the leap year (as we did in 2000, which not many people noticed). And in 1582, this calendar - called the Gregorian calendar - was introduced.

Imagine leading a committee that comes up with rules on what the whole world would need to do once every four hundred years - and mostly having these rules implemented. How would you lead and design such a committee? I find this idea mind-blowing.

Since the time of Caesar until 1582, about fifteen centuries have passed. And in this time, the calendar was getting slightly out of sync - by one day every century, skipping every fourth. In order to deal with that shift, they decided that ten calendar days need to be skipped. Following the 4th of October 1582 was the 15th of October 1582. In 1582, there was no 5th or 14th of October, nor any of the days in between, in the countries that had the Gregorian calendar adopted.

This lead to plenty of legal discussions, mostly about monthly rents and wages: is this still a full month, or should the rent or wage be paid prorated to the number of days? Should annual rents, interests, and taxes be prorated by these ten days, or not? What day of the week should the 15th of October be?


The Gregorian calendar was a marked improvement over the Julian calendar with regards to keeping the seasons in sync with the calendar. So one might think its adoption should be a no-brainer. But there was a slight complication: politics.

Now imagine that today the Pope gets out on his balcony, and declares that, starting in five years, January to November all have 30 days, and December has 35 or 36 days. How would the world react? Would they ponder the merits of the proposal, would they laugh, would they simply adopt it? Would a country such as Italy have a different public discourse about this topic than a country such as China?

In 1582, the situation was similarly difficult. Instead of pondering the benefits of the proposal, the source of the proposal and the relation to that source became the main deciding factor. Instead of adopting the idea because it is a good idea, the idea was adopted - or not - because the Pope of the Catholic Church declared it. The Papal state, the Spanish and French Kingdoms, were first to adopt it.

Queen Elizabeth wanted to adopt it in England, but the Anglican bishops were fiercely opposed to it because it was suggested by the Pope. Other Protestant and the Orthodox countries simply ignored it for centuries. And thus there was a 5th of October 1582 in England, but not in France, and that lead to a number of confusions over the following centuries.

Ever wondered why the October Revolution started November 7? There we go. There is even a story that Napoleon won an important battle (either the Battle of Austerlitz or the Battle of Ulm) because the Russian and Austrian forces coordinated badly as the Austrians were using the Gregorian and the Russians the Julian calendar. The story is false, but it makes for a great story.

Today, the International Day of the Book is on April 23 - the death date of both Miguel de Cervantes and William Shakespeare in 1616, the two giants of literature in their respective languages - with the amusing side-effect that they actually died about two weeks apart, even though they died on the same calendar day, but in different calendars.

It wasn’t until 1923 that for most purposes all countries had deprecated the Julian calendar, and for religious purposes some still follow it - which is why the Orthodox and the Amish celebrate Christmas on January 6. Starting 2101, that should shift by another day - and I would be very curious to see whether it will, or whether by then January 6th has solidified as the Christmas date.


Possibly the most confusing story about adopting the Gregorian calendar comes from Sweden. Like most protestant countries, Sweden did not initially adopt the Gregorian calendar, and was sticking with the Julian calendar, until in 1699 they decided to switch.

Now, the idea of skipping eleven or twelve days in one go did not sound appealing - remember all the chaos that occurred in the other countries for dropping these days. So in Sweden they decided that instead of dropping the days all at once, they would drop them one by one, by skipping the leap years from 1700 until 1740, when the two calendars would finally catch up.

In 1700, February 29 was skipped in Sweden. Which didn’t bring them any closer to Gregorian countries such as Spain, because they skipped the leap year in 1700 anyway. But it brought them out of alignment with Russia - by one day.

A war with Russia started (not about the calendar, but just a week before the calendars went out of sync, incidentally), and due to the war Sweden forgot to skip the leap days in 1704 and 1708 (they had other things on their mind). And as this was embarrassing, in 1711, King Charles XII of Sweden declared to abandon the plan, and added one extra day the following year to realign it back to Russia. And because 1712 was a leap year anyway, in Sweden there was not only a February 29, but also a February 30, 1712. The only legal February 30 in history so far.

It needed not only for Charles XII to die, but also for his sister (who succeeded him) and her husband (who succeeded her) in 1751, before Sweden could move beyond that embarrassing episode, and in 1752 Sweden switched from the Julian to the Gregorian calendar, by cutting February short and ending it after February 17, following that by March 1.


Somewhere on my To-Do list, I have the wish to write a book on Wikidata. How it came to be, how it works, what it means, the complications we encountered, and the ones we missed, etc. One section in this book is planned to be about calendar models. This is an early, self-contained draft of part of that section. Feedback and corrections are very welcome.


Erdös number, update

I just made an update to a post from 2006, because I learned that my Erdös number has went down from 4 to 3. I guess that's pretty much it - it is not likely I'll ever become a 2.

The Fourth Scream

Janie loved her research. It was at the intersection of so many interesting areas - genetics, linguistics, neuroscience. And the best thing about it - she could work the whole day with these adorable vervet monkeys.

One more time, she showed the video of the flying eagle to Kassandra. The MRI helmet on Kassandra’s little head measured the neuron activation, highlighting the same region on her computer screen as the other times, the same region as with the other monkeys. Kassandra let out the scream that Janie was able to understand herself by now, the scream meaning “Eagle!”, and the other monkeys behind the bars in the far end of the room, in a cage large as half the room, ran to cover in the bushes and small caves, if they were close enough. As they did every time.

That MRI helmet was a masterpiece. She could measure the activation of the neurons in unprecedented high resolution. And not only that, she could even send inferencing waves back, stimulating very fine grained regions in the monkey’s brain. The stimulation wasn’t very fast, but it was a modern miracle.

She slipped a raspberry to Kassandra, and Kassandra quickly snatched it and stuffed it in her mouth. The monkeys came from different populations from all over Southern and Eastern Africa, and yet they all understood the same three screams. Even when the baby monkeys were raised by mute parents, the baby monkeys understood the same three screams. One scream was to warn them from leopards, one scream was to warn them from snakes, and the third scream was to warn them from eagles. The screams were universally understood by everyone across the globe - by every vervet monkey, that is. A language encoded in the DNA of the species.

She called up the aggregated areas from the scream from her last few experiments. In the last five years, she was able to trace back the proteins that were responsible for the growth of these four areas, and thus the DNA encoding these calls. She could prove that these three different screams, the three different words of Vervetian, were all encoded in DNA. That was very different from human language, where every word is learned, arbitrary, and none of the words were encoded in our DNA. Some researchers believed that other parts of our language were encoded in our DNA: deep grammatical patterns, the ability to merge chunks into hierarchies of meaning when parsing sentences, or the categorical difference between hearing the syllable ba and the syllable ga. But she was the first one to provably connect three different concrete genes with three different words that an animal produces and understands.

She told the software to create an overlapping picture of the three different brain areas activated by the three screams. It was a three dimensional picture that she could turn, zoom, and slice freely, in real time. The strands of DNA were highlighted at the bottom of the screen, in the same colors as the three different areas in the brain. One gene, then a break, then the other two genes she had identified. Leopard, snake, eagle.

She started to turn the visualization of the brain areas, as Kassandra started squealing in pain. Her hand was stuck between the cage bars and the plate with raspberries. The little thief was trying to sneak out a raspberry or two! Janie laughed, and helped the monkey get the hand unstuck. Kassandra yanked it back into the cage, looked at Janie accusingly, knowing that the pain was Janie’s fault for not giving her enough raspberries. Janie snickered, took out another raspberry and gave it to the monkey. She snatched it out of Janie’s hand, without stopping the accusing stare, and Janie then put the plate to the other side of the table, in safe distance and out of sight of Kassandra.

She looked back at the screen. When Kassandra cried out, her hand had twitched, and turned the visualization to a weird angle. She just wanted to turn it back to a more common view, when she suddenly stopped.

From this angle, she could see the three different areas, connecting together with the audiovisual cortex at a common point, like the leaves of a clover. But that was just it. It really looked like three leaves of a four-leaf clover. The area where the fourth leaf would be - it looked a lot like the areas where the other three leaves were.

She zoomed into the audiovisual cortex. She marked the neurons that triggered each of the three leaves. And then she looked at the fourth leaf. The connection to the cortex was similar. A bit different, but similar enough. She was able to identify what probably are the trigger-neurons, just like she was able to find them for the other three areas.

She targeted the MRI helmet on the neurons connected to the eagle trigger neurons, and with a click she sent a stimulus. Kassandra looked up, a bit confused. Janie looked at the neurons, how they triggered, unrolled the activation patterns, and saw how the signal was suppressed. She reprogrammed the MRI helmet, refined the neurons to be stimulated, and sent off another stimulus.

Kassandra yanked her head up, looking around, surprised. She looked at her screen, but it showed nothing as well. She walked nervously around inside the little cage, looking worriedly to the ceiling of the lab, confused. Janie again analyzed the activation patterns, and saw how it almost went through. There seemed to be a single last gatekeeper to pass. She reprogrammed the stimulator again. Third time's the charm, they say. She just remembered a former boyfriend, who was going on and on about this proverb. How no one knew how old it was, where it began, and how many different cultures all over the world associate trying something three times with eventual success, or an eventual curse. How some people believed you need to call the devil's name three times to —

Kassandra screamed out the same scream as before, the scream saying “Eagle!”. The MRI helmet had sent the stimulus, and it worked. The other monkeys jumped for cover. Kassandra raised her own arms above her head, peeking through her fingers to find the eagle she had just sensed.

Janie was more than excited! This alone will make a great paper. She could get the monkeys to scream out one of the three words of their language by a simple stimulation of particular neurons! Sure, she expected this to work - why wouldn’t it? But the actual scream, the confirmation, was exhilarating. As expected, the neurons now had a heightened potential, were easier to activate, waiting for more input. They slowly cooled down as Kassandra didn’t see any eagles.

She looked at the neurons connected to the fourth leaf. The gap. Was there a secret, fourth word hidden? One that all the zoologists studying vervet monkeys have missed so far? What would that word be? She reprogrammed the MRI helmet, aiming at the neurons that would trigger the fourth leaf. If her theory was right. With another click she sent a stimulus to the —

Janie was crouching in the corner of the room, breathing heavily, cold sweat was covering her arms, her face, her whole body. Her clothes were clamp. Her arms were slung above her head. She didn’t remember how she got here. The office chair she was just sitting in a moment ago, laid on the floor. The monkeys were quiet. Eerily quiet. She couldn’t see them from where she was, she couldn’t even see Kassandra from here, who was in the cage next to her computer. One of the halogen lamps in the ceiling was flickering. It wasn’t doing that before, was it?

She slowly stood up. Her body was shivering. She felt dizzy. She almost stumbled, just standing up. She slowly lowered her arms, but her arms were shaking. She looked for Kassandra. Kassandra was completely quiet, rolled up in the very corner of her cage, her arms slung around herself, her eyes staring catatonically forward, into nothing.

Janie took a step towards the middle of the room. She could see a bit more of the cage. The monkeys were partly huddled together, shaking in fear. One of them laid in the middle of the cage, his face in a grimace of terror. He was dead. She thought it was Rambo, but she wasn’t sure. She stumbled to the computer, pulled the chair from the floor, slumped into it.

The MRI helmet had recorded the activation pattern. She stepped through it. It did behave partially the same: the neurons triggered the unknown leaf, as expected, and that lead to activate the muscles around the lungs, the throat, the tongue, the mouth - in short, that activated the scream. But, unlike with the eagle scream, the activation potential did not increase, it was now suppressed. Like if it was trying to avoid a second triggering. She checked the pattern: yes, the neuron triggered that suppression itself. That was different. How did this secret scream sound?

Oh no! No, no, no, no, NOO!! She had not recorded the experiment. How stupid!

She was excited. She was scared, too, but she tried to push that away. She needed to record that scream. She needed to record the fourth word, the secret word of vervet monkeys. She switched on all three cameras in the lab, one pointed at the large cage with the monkeys, the other two pointing at Kassandra - and then she changed her mind, and turned one onto herself. What has happened to herself? Why couldn’t she remember hearing the scream? Why was she been crouching on the floor like one of the monkeys?

She checked her computer. The MRI helmet was calibrated as before, pointing at the group of triggering neurons. The suppression was ebbing down, but not as fast as she wanted. She increased the stimulation power. She shouldn’t. She should follow protocol. But this all was crazy. This was a cover story for Nature. With her as first author. She checked the recording devices. All three were on. The streams were feeding back into her computer. She clicked to send the sti—

She felt the floor beneath her. It was dirty and cold. She was laying on the floor, face down. Her ears were ringing. She turned her head, opened her eyes. Her vision was blurred. Over the ringing in her ears she didn’t hear a single sound from the monkeys. She tried to move, and she felt her pants were wet. She tried to stand up, to push herself up.

She couldn’t.

She panicked. Shivered. And when she felt the tears running over her face, she clenched her teeth together. She tried to breath, consciously, to collect herself, to gain control. Again she tried to stand up, and this time her arms and legs moved. Slower than she wanted. Weaker than she hoped. She was shaking. But she moved. She grabbed the chair. Pulled herself up a bit. The computer screen was as before, as if nothing has happened. She looked to Kassandra.

Kassandra was dead. Her eyes were bloodshot. Her face was a mask of pure terror, staring at nothing in the middle of the room. Janie tried to look at the cage with the other monkeys, but she couldn’t focus her gaze. She tried to yank herself into the chair.

The chair rolled away, and she crashed to the floor.

She had went too far. She had made a mistake. She should have had followed protocol. She was too ambitious, her curiosity and her impatience took the best of her. She had to focus. She had to fix things. But first she needed to call for help. She crawled to the chair. She pulled herself up, tried to sit in the chair, and she did it. She was sitting. Success.

Slowly, she rolled back to the computer. Her office didn’t have a phone. She double-clicked on the security app on her desktop. She had no idea how it worked, she never had to call security before. She hoped it would just work. A screen opened, asking her for some input. She couldn’t read it. She tried to focus. She didn’t know what to do. After a few moments the app changed, and it said in big letters: HELP IS ON THE WAY. STAY CALM. She closed her eyes. Breathed. Good.

After a few moments she felt better. She opened her eyes. HELP IS ON THE WAY. STAY CALM. She read it, once, twice. She nodded, her gaze jumping over the rest of the screen.

The recording was still on.

She moved the mouse cursor to the recording app. She wanted to see what has happened. There was nothing to do anyway, until security came. She clicked on the play button.

The recording filled three windows, one for each of the cameras. One pointed at the large cage with the vervet monkeys, two at Kassandra. Then, one of the cameras pointing at Kassandra was moved, pointing at Janie, just moments ago - it was moments, was it? - sitting at the desk. She saw herself getting ready to send the second stimulus to Kassandra, to make her call the secret scream a second time.

And then, from the recording, Kassandra called for a third time.

The end

History of knowledge graphs

An overview on the history of ideas leading to knowledge graphs, with plenty of references. Useful for anyone who wants to understand the background of the field, and probably the best current such overview.

On the competence of conspiracists

“Look, I’ll be honest, if living in the US for the last five years has taught me anything is that any government assemblage large enough to try to control a big chunk of the human population would in no way be consistently competent enough to actually cover it up. Like, we would have found out in three months and it wouldn’t even have been because of some investigative reporter, it would have been because one of the lizards forgot to put on their human suit on day and accidentally went out to shop for a pint of milk and like, got caught in a tik-tok video.” -- Os Keyes, WikidataCon, Keynote "Questioning Wikidata"

Power in California

It is wonderful to live in the Bay Area, where the future is being invented.

Sure, we might not have a reliable power supply, but hey, we have an app that connects people with dogs who don't want to pick up their poop with people who are desperate enough to do this shit.

Another example how the capitalism that we currently live failed massively: last year, PG&E was found responsible for killing people and destroying a whole city. Now they really want to play it safe, and switch off the power for millions of people. And they say this will go on for a decade. So in 2029 when we're supposed to have AIs, self-driving cars, and self-tieing Nikes, there will be cities in California that will get their power shut off for days when there is a hot wind for an afternoon.

Why? Because the money that should have gone into, that was already earmarked for, making the power infrastructure more resilient and safe went into bonus payments for executives (that sounds so cliché!). They tried to externalize the cost of an aging power infrastructure - the cost being literally the life and homes of people. And when told not to, they put millions of people in the dark.

This is so awfully on the nose that there is no need for metaphors.

San Francisco offered to buy the local power grid, to put it into public hands. But PG&E refused that offer of several billion dollars.

So if you live in an area that has a well working power infrastructure, appreciate it.