Semantic search

Jump to navigation Jump to search

A sermon on tolerance and inclusion

Warning: meandering New Year's sermon ahead, starting at a random point and going somewhere entirely else.

I started reading Martin Kay's book on Translation, and I am enjoying it quite a bit so far. Kay passed away August 2021. His work seems highly relevant for the work on Abstract Wikipedia.

One thing that bummed me though is that for more than a page in the introduction he rants about pronouns and how he is going to use "he" to generically mean both men and women, and how all other solutions have deficits.

He culminates in the explanation: "Another solution to this problem is which is increasing in popularity, is to use both 'he' and 'she', shifting between them more or less randomly. So we will sometimes get 'When a translator is confronted with a situation of this kind, she must decide...'. The trouble with this is that some readers, including the present writer, reacts quite differently to the sentence depending on which version of the generic pronoun it contains. We read the one containing 'he' smoothly and, all else being equal, assimilate the intended meaning. Encountering the one with 'she', on the other hand, is like following a television drama that is suddenly interrupted by a commercial."

Sooo frustratingly close to getting it.

I wish he'd had just not spent over a page on this topic, but just used the generic 'he' in the text, and that's it. I mean, I don't expect everyone born more than eighty years ago to adjust to the modern usage of pronouns.

Now, I am not saying that to drag Kay's name through dirt, or to get him cancelled or whatever. I have never met him, but I am sure he was a person with many positive facets, and given my network I wouldn't be surprised if there are people who knew him and can confirm so. I'm also not saying that to virtue signal and say "oh man, look how much more progressive I am". Yes, I am slightly annoyed by this page. Unlike many others though, I am not actually personally affected by it - I use the pronoun "he" for myself and not any other pronoun, so this really is not about me. Is it because of that that it is easy for me to gloss over this and keep reading?

So is it because I am not affected personally that it is so easy for me to say the following: it is still worthwhile to keep reading his work, and the rest of the book, and to build on top of his work and learn from him. The people we learn some things from, the influences we accept, they don't have to be perfect in every way, right? Would it have been as easy for me to say that if I were personally affected? I don't know.

I am worried about how quickly parts of society seems to be ready to "cancel" and "call out" people, and how willing they are to tag a person as unacceptable because they do not necessarily share every single belief that is currently regarded as a required belief.

I have great difficulties in drawing the line. Which beliefs or actions of a person should be sufficient grounds to shun them or their work? When JK Rowling doubles down on her stance regarding trans women, is this enough to ask everyone to drop all interest in the world she created and the books she wrote? Do we reshoot movie scenes such as the cameo of Donald Trump in Home Alone 2 in order to "purify" the movie and make it acceptable for our new enlightened age again? When Johnny Depp was accused of domestic abuse, does he need to be recast from movies he had already been signed on? Do we also need to stop watching his previous movies? Do the believable accusations of child abuse against Marion Zimmer Bradley mean that we have to ignore her contributions to feminist causes, never mind her books? Should we stop using a font such as Gill Sans because of the sexual abuse Erjc Gill committed against his daughters? Do we have to stop watching movies or listen to music produced by murderers such as OJ Simpson, Phil Spector, or Johnny Lewis?

I intentionally escalated the examples, and they don't compare at all to Kay's defence of his usage of pronouns.

I offer no answers as to where the line should be, I have none. I don't know. In my opinion, none of us is perfect, and none of our idols, paragons, or example model humans will survive the scrutiny for perfection. This is not a new problem. Think of Gandhi, Michael Jackson, Alice Schwarzer, Socrates - no matter where you draw your idols from, they all come with imperfections, sometimes massive ones.

Can we keep and accept their positive contributions - without ignoring their faults? Can we allow people with faults to still continue to contribute their skills to society, or do we reduce them to their faults and negatives? Do we have to get someone fired for tweeting a stupid joke? Do we demand perfection by everyone at all time?

Or do we allow everyone to be human, make and have errors, and have beliefs many don't deem acceptable? Committing or causing actions resulting from these beliefs? Even if these actions and beliefs hurt or endanger people, or deny the humanity of others? We don't have to and should not accept their racism, sexism, homo- and transphobia - but can and should we still recognise their other contributions?

I am worried about something else as well. By pushing out so many because of the one thing they don't want to accept in the basket of required beliefs, we push them all into the group of outsiders. But if there are too many outsiders, the whole system collapses. Do we all have to have the same belief on guns, on climate, on gender, on abortion, on immigration, on race, on crypto, on capitalism, on housing? Or can we integrate and work together even if we have differences?

The vast majority of Americans think that human-caused climate change is real and that we should act to avoid it. Only 10% don't. And yet, because of the way we define and fence our in- and outgroups, we have a strong voting block that repeatedly leads to outright sabotage to effective measures. A large majority of Americans support the right to abortion, but you would never be able to tell given the fights around laws and court cases. Taxing billionaires more effectively is highly popular with voters, but again these majorities fizzle away and don't translate to the respective changes in the tax code.

I think we should be able to work together with people we don't agree with on everything. We should stop requiring perfection and alignment on all issues before moving forward. But then again, that's what I am saying, and I am saying it from a position of privilege, am I not? I am male. I am White. I am heterosexual. I am not Muslim or Jewish. I am well educated. I am not poor. I am reasonably technologically savvy. I am not disabled. What right do I have at all to voice my opinion on these topics? To demand for acceptance people with beliefs that hurt or endanger people who are not like me. Or even to ask for your precious attention for these words of mine?


And yet I hope that we will work together towards progress on the topics we agree on, that we will enlighten each other on the topics we disagree on, and that we will be able to embrace more of us on our way into the future.

P.S.: this post is problematic and not very well written, and I recognise that. Please refer to the discussion about it on Facebook.

Long John and Average Joe

You may know about Long John Silver. But who's the longest John? Here's the answer according to Wikidata:

What about your Average Joe? Here's the answer about the most average Joe, based on all the Joes in Wikidata:

Note, the average height of a Joe in Wikidata is 1,86cm or 6'1", which is quite a bit higher than the average height in the population. A data collection and coverage issue: it is much more likely to have the height for a basketball player than for an author in Wikidata.

Just two silly queries for Wikidata, which are nice ways to show off the data set and what one can do with the SPARQL query endpoint. Especially the latter one shows off a rather interesting and complex SPARQL query.

Temperatures in California

It has been a bit chillier the last few days. I noticed that after almost a decade in California, I feel pretty comfortable with understanding temperatures in Fahrenheit - as long as they are over 60° F. If it is colder, I need to switch to Celsius in order to understand how cold it exactly is. I have no idea what 40° or 45° or 50° F are, but I still know what 5° C is!

The fact that I still haven't acclimatised to Fahrenheit for the cooler temperatures tells you a lot about the climate in California.

SWSA panel

Thursday, October 7, 2021, saw a panel of three founding members of the Semantic Web research community, who each have been my teachers and mentors over the years: Rudi Studer, Natasha Noy, and Jim Hendler. I loved watching the panel and enjoyed it thoroughly, also because it was just great to see all of them again.

There were many interesting insights and thoughts in this panel, too many to write them all down, but I want to mention a few.

It was interesting how much all panelists talked about creating the Semantic Web community, and how much of an intentional effort that was. Deciding that it needs a conference, a journal, an organization, setting those up, and their interactions. Seeing and fostering a sustainable research community grown out of an idea is a formidable and amazing effort. They all mentioned positively the diversity in the community, and that it was a conscious effort to work towards that. Rudi mentioned that the future challenge will be with ensuring that computer science students actually have Semantic Web technologies integrated into their standard curriculum.

They named a number of the successes that were influenced by the Semantic Web research work, such as, the heavy use of SPARQL in supercomputing (I had no idea!), Wikidata (thanks for the shout out, Rudi!), and the development of scalable graph databases. Natasha raised the advantage of having common identifiers throughout an organization, i.e. that everyone refers to California the same way. They also named areas that remained elusive and that they expect to see progress in the coming years, Rudi in particular mentioned Agents and Common Sense, which was echoed by the other participants, and Jim mentioned Personal Knowledge Graphs. Jim mentioned he was surprised by the growing importance of unstructured data. Jim is also hoping for something akin to “procedural attachments” - you see some new data coming in, you perform this action (I would like to think that a little Wikifunctions goes a long way).

We need both, open knowledge graphs and closed knowledge graphs (think of your personal ones, but also the ones by companies).

The most important contribution so far and also well into the future was the idea of decentralization of semantics. To allow different stakeholders to work asynchronously and separately on parts of the semantics and yet share data. This also includes the decentralization of knowledge graphs, but also in the future we will encounter a world where semantics are increasingly brought together and yet decentralized.

One interesting anecdote was shared by Natasha. She was talking about a keynote by Guha (one of the few researchers who were namechecked in the panel, along with Tim Berners-Lee) at ISWC in Sydney 2013. How Guha was saying how simple the technology needs to be, and how there were many in the audience who were aghast and shocked by the talk. Now, eight years later and given her experience building Dataset Search, she appreciates the insights. If they have a discussion about a new property for longer than five minutes, they drop it. It’s too complicated, and people will use it wrong so often that the data cleanup will become expensive.

All of them shared the advice for researchers in their early career stage to work on topics that truly inspire them, on problems that are real and that they and others care about, and that if they do so, the results have the best chance to have impact. Think about problems you can explain to people not in your field, about “how can we use triples to save the world” - and not just about “hey, look, that problem that we solved with these other technologies previously, now we can also solve it with Semantic Web technologies”. This doesn’t really help anyone. Solve new problems. Solve real problems. And do what you are truly passionate about.

I enjoyed the panel, and can recommend everyone in the Semantic Web research area or any related, nearby research, to check it out. Thanks to the organizers for this talk (which is the first session in a series of talks that will continue with Ora Lassila early December).

Our four freedoms for our technology

(This is a draft. Comments are welcome. This is not meant as an attack on any person or company individually, but at certain practises that are becoming increasingly prevalent)

We are not allowed to use the devices we paid for in the ways we want. We are not allowed to use our own data in the way we want. We are only allowed to use them in the way the companies who created the devices and services allow us.

Sometimes these companies are nice and give us a lot of freedom in how to use the devices and data. But often they don’t. They close them down for all kinds of reasons. They may say it is for your protection and safety. They might admit it is for profit. They may say it is for legal reasons. But in the end, you are buying a device, or you are creating some data, and you are not allowed to use that device and that data in the way you want to, you are not allowed to be creative.

The companies don’t want you to think of the devices that you bought and the data that you created as your devices and your data. They want you to think of them as black boxes that offer you services they create for you. They don’t want you to think of a Ring doorbell as a camera, a microphone, a speaker, and a button, but they want you to think of it as providing safety. They don’t want you to think of the garage door opener as a motor and a bluetooth module and a wifi module, but as a garage door opening service, and the company wants to control how you are allowed to use that service. Companies like Chamberlain and SkyLink and Genie don’t allow you to write a tool to check on your garage door, and to close or open it, but they make deals with Google and Amazon and Apple in order to integrate these services into their digital assistants, so that you can use it in the way these companies have agreed on together, through the few paths these digital assistants are available. The digital assistant that you buy is not a microphone and a speaker and maybe a camera and maybe a screen that you buy and use as you want, but you buy a service that happens to have some technical ingredients. But you cannot use that screen to display what you want. Whether you can watch your Amazon Prime show on the screen of a Google Nest Hub depends on whether Amazon and Google have an agreement with each other, not on whether you have paid for access to Amazon Prime and you have paid for a Google Nest Hub. You cannot use that camera to take a picture. You cannot use that speaker to make it say something you want it to say. You cannot use the rich plethora of services on the Web, and you cannot use the many interesting services these digital assistants rely on, in novel and creative combinations.

These companies don’t want you to think of the data that you have created and that they have about you as your data. They don’t want you to think about this data at all. They just want you to use their services in the way they want you to use their services. On the devices they approve. They don’t want you to create other surfaces that are suited to the way you use your data. They don’t want you to decide on what you want to see in your feed. They don’t want you to be able to take a list of your friends and do something with it. They will say it is to protect privacy. They will say that it is for safety. That is why you cannot use the data you and your friends have created. They want to exactly control what you can and cannot do with the data you and your friends have created. They want to control how many ads you must see in order to be allowed to see your friends’ posts. They don't want anyone else to have the ability to provide you creative new interfaces to your feed. They don’t want you yourself the ability to look at your feed and do whatever you want with it.

Those are devices you paid for.

These are data you and your friends have created.

And more and more we are losing our freedom of using our devices and our data as we like.

It would be impossible to invent email today. It would be impossible to invent the telephone today. Both are protocols that allow everyone to communicate with anyone no matter what their email provider or their phone is. Try reading your friend’s Facebook feed on Instagram, or send a direct message from your Twitter account to someone on WhatsApp, or call your Skype contact on Facetime.

It would be impossible to launch the Web today - many companies don’t want you browsing the Web. They want you to be inside of your Facebook feed and consume your content there. They want you to be on your Twitter feed. They don’t want you to go to the Website of the New York Times and read an article there, they don’t want you to visit the Website of your friend and read their blog there. They want you to stay on their apps. Per default, they open Websites inside their app, and not in your browser, so you are always within their app. They don’t want you to experience the Web. The Web is dwindling and all the good things on it are being recut and rebundled within the apps and services of tech companies.

Increasingly, we are seeing more and more walls in the world. Already, it is becoming impossible to pay and watch certain movies and shows without buying into a full subscription in a service. We will likely see the day where you will need a specific device to watch a specific movie. Where the only way to watch a Disney+ exclusive movie is on a Disney+ tablet. You don’t think so? Think about how easy it is to get your Kindle books onto another Ebook reader. How do you enable a skill or capability available in Alexa on your Nest smart speaker? How can you search through the books that you bought and are in your digital library, besides by using a service provided by the company that allows you to search your digital library? When you buy a movie today on YouTube or on iMovies, what do you own? What are you left with when the companies behind these services close that service, or go out of business altogether?

Devices and content we pay for, data we and our friends create, should be ours to use in empowering and creative ways. Services and content should not be locked in with a certain device or subscription service. The bundling of services, content, devices, and locking up user data creates monopolies that stifle innovation and creativity. I am not asking to give away services or content or devices for free, I am asking to be allowed to pay for them and then use them as I see fit.

What can we do?

As far as I can tell, the solution, unfortunately, seems to be to ask for regulation. The market won’t solve it. The market doesn’t solve monopolies and oligopolies.

But don’t ask to regulate the tech giants individually. We don’t need a law that regulates Google and a law that regulates Apple and a law that regulates Amazon and a law to regulate Microsoft. We need laws to regulate devices, laws to regulate services, laws to regulate content, laws that regulate AI.

Don’t ask for Facebook to be broken up because you think Mark Zuckerberg is too rich and powerful. Breaking up Facebook, creating Baby Books, will ultimately make him and other Facebook shareholders richer than ever before. But breaking up Facebook will require the successor companies to work together on a protocol to collaborate. To share data. To be able to move from one service to another.

We need laws that require that every device we buy can be made fully ours. Yes, sure, Apple must still be allowed to provide us with the wonderful smooth User Experience we value Apple for. But we must also be able to access and share the data from the sensors in our devices that we have bought from them. We must be able to install and run software we have written or bought on the devices we paid for.

We need laws that require that our data is ours. We should be able to download our data from a service provider and use it as we like. We must be allowed to share with a friend the parts of our data we want to share with that friend. In real time, not in a dump download hours later. We must be able to take our social graph from one social service and move to a new service. The data must be sufficiently complete to allow for such a transfer, and not crippled.

We need laws that require that published content can be bought and used by us as we like. We should be able to store content on our hard disks. To lend it to a friend. To sell it. Anything I can legally do with a book I bought I must be able to legally do with a movie or piece of music I bought online. Just as with a book you are not allowed to give away the copies if the work you bought still enjoys copyright.

We need laws that require that services and capabilities are unbundled and made available to everyone. Particularly as technological progress with regards to AI, Quantum computing, and providing large amounts of compute becomes increasingly an exclusive domain for trillion dollar companies, we must enable other organizations and people to access these capabilities, or run the risk that sooner or later all and any innovation will be happening only in these few trillion dollar companies. Just because a company is really good at providing a specific service cheaply, it should not be allowed to unfairly gain advantage in all related areas and products and stifle competition and innovation. This company should still be allowed to use these capabilities in their products and services, but so should anyone else, fairly prized and accessible by everyone.

We want to unleash creativity and innovation. In our lifetimes we have seen the creation of technologies that would have been considered miracles and impossible just decades ago. These must belong to everybody. These must be available to everyone. There cannot be equity if all of these marvellous technologies can be only wielded by a few companies on the West coast of the United States. We must make them available to all the people of the world: the people of the Indian subcontinent, the people of Subsaharan Africa,the people of Latin America, and everyone else. They all should own the devices they paid for, the data they created, the content they paid for. They all should have access to the same digital services and capabilities that are available to the engineers at Amazon or Google or Microsoft. The universities and research centers of the world should be able to access the same devices and services and extend them with their novel and creative ideas. The scrappy engineers in Eastern Europe and India and Nigeria and Central Asia should be able to call the AI models trained by Google and Microsoft and use them in novel ways to run their devices and chip-powered cars and agricultural machines. We want a world of freedom, tinkering, where creativity and innovation are unleashed, and where everyone can contribute their ideas, their creativity, and where everyone can build their fortune.

The Center of the Universe

The discovery of the center of the universe led to a series of unexpected consequences. It killed some, it enlightened others, but most people just were left utterly confused in the end.

When the results from the Total Radiating Universal Tessellation Hyperfield satellites measurements came in, it became depressingly clear that the universe was indeed contracting. Very slowly, but without any reasonable doubt — or, as the physicists said, they were five sigma sure about it. As the data from the measurements became available, physicists, cosmologists, topologists, even a few mathematically inclined philosophers, and a huge number of volunteers started to investigate it. And after a short period of time, they came to a whole set of staggering conclusions.

First, the Universe had a rather simple four-dimensional form. The only unfortunate blemishes in this theory were the black holes, but most of the volunteers, philosophers, and topologists decided to ignore these as accidental.

Second, the form was bounded. There was a beginning and an end in time, and there were boundaries in space, and those who understood that these were the same were enlightened about the form of the universe.

Third, since the form of the universe was bounded and simple, it had a center. Whereas this was slightly surprising it was a necessary consequence of the previous findings. What first seemed exciting, but soon will turn out not to be only the heart of this report, but the heart of all humanity, was that the data collected by the satellites allowed to calculate the position of the center of the universe.

Before that, let me recapture what we traditionally knew about how the universe is built. Our sun is a star, around which a few planets travel, one of them being our Earth. Our sun is one of a few tens of billions of stars that form a long curved thread which ties around a supermassive black hole. A small number of such threads are tangled together, forming the spiral arms of our galaxy, the Milky Way. Our galaxy consists of half a trillion stars like our sun.

Galaxies, like everything else in the universe, like to stick together and form groups. A few hundred thousand galaxies make up a supercluster. A few of these superclusters together build enormous walls of stars, filaments traversing the universe. The galaxies of such a wall are all in a single plane, more or less, and sometimes even in a single line.

Between these walls, walls made of superclusters and galaxies and stars and planets, there is, basically, nothing. The walls of stars are like gigantic honeycombs, and between them, are enormous empty spaces, hundred million of light years wide. When you look at a honeycomb, you will see that the empty spaces between the walls are much, much larger than the walls themselves. Such is the universe. You might think that the distance from here to the next grocery store is quite far, or that the ocean is quite big. But the distance from the earth to the sun is so much bigger, and the distance from the sun to the next star again so much more. And from our galaxy to the next, there is a huge empty space. Nevertheless, our galaxy is so close to the next group of galaxies that they together form a building block of a huge wall, separating two unimaginable large empty spaces from each other.

So when we figured out that we can calculate the center of the universe, it was widely expected that the center would be somewhere in one of those vast spaces of nothing. The chances that it would be in one of the filaments were tiny.

It turned out that this was not a question of chance.

The center of the universe was not only inside of a filament, but the first quick calculations (quick, though, has to be understood as taking three and a half years) suggested that the center is actually within our filament. And not only within our filament — but our galaxy. Within a one light year radius of our sun.

The team that made these calculations was working at a small research institute in rural Japan. They did not believe the results, and double and triple checked them. The head of the institute had graduated from Princeton, and called his former advisor there. Although it was deep in the night in Japan, they talked for many hours. In the end he learned that Princeton has made the same calculations, and received their own results about eight months ago. They didn’t dare to publish them. There must have been a mistake. These results had to be wrong.

Science has humiliated the whole of humanity again and again. And it was quite successful in doing so. A scientist would much easier accept that the center of the universe is some mathematical construct pointing to nothing than what the infallible mathematics indicated. But the data was out. And the number of people making the above mentioned realizations and calculations continued growing. It was only a matter of time. And when the Catholic University of Rio de Janeiro finally published the results — in a carefully written paper, without any accompanying press release, and formulated so cautiously and defensively — all the scientists who already knew the results held their breath.

The storm was unimaginable. Everyone demanded an explanation, but no one would listen to anyone offering one. The religions rejoiced, claiming they knew it all along, and many flocked to the mosques and churches and temples, as a proof of God was finally found. The irony of science leading humans to the embrace of religion was profoundly lost at that time, but later recognized as one of the largest jokes in history. Science has dealt its ultimate humiliation, not to humanity, but perversely to its most devout followers, the scientists. The scientists, who, while trashing the superiority of humans over the world, were secretly inflating their own, and were now reminded that they were merely slaves to a most cruel mistress. Their bitter resistance to the results did not stop them from emerging.

The mathematics and calculations were soon made public. The mathematics were deceptively simple, once the required factorizations were done, and easy to check. High school courses went through the proofs, and desperate parents peeked over the shoulders of their daughters and sons who, sometimes for the first time, talked of integrals and imaginary numbers. Television and streaming platforms were explaining discriminants and complex numbers and roots of higher degrees. Websites offering math courses bent under the load and moral weight.

There is one weird thing about roots. The root of a number is the number that, multiplied with itself, gives you the original number. The weird thing is that there is usually not a single, unique result to that question. For example, the root of the number four is not just two, but also minus two, as minus two times minus two results in four, too. There are two roots of the second degree (which we usually call the square root). There are three roots of the third degree (sometimes called the cube root). There are four roots of the fourth degree. And so on. All of them are correct. Sometimes you can discard one or the other because the result has to fit certain constraints (say, you are looking only for the positive root of four), but sometimes, you can not.

As the calculations went public, the methods became more and more refined. The results became increasingly precise, and as the data from the satellites poured in, one of the last steps involved a root of the seventh degree. First, this was regarded as a minor curiosity, especially because these seven results led to basically the same point. Cosmologically speaking.

Earth is moving. Earth is moving around the sun, with a speed of a sixty seven thousand miles per hour, or eighteen miles each second. Also the sun is moving, and the earth is moving with the sun, and our galaxy is moving, and with our galaxy the sun moves along, and with the sun our earth. We are racing with a speed of a thousand miles each second in some direction away from the center of the universe.

And it was realized, maybe we just passed the center of the universe. Maybe it was just an accident, maybe all the planets and stars pass the center of the universe at some point. That we are so close to the center of the universe might be just a funny coincidence.

And maybe they are right. Maybe every star will at some point cross the center of the universe within the distance of a light year.

At some point though it was realized that, since the universe was bounded in all four dimensions, there was not only a center in space, but also a center in time, a midpoint between the beginning of the universe and its future end.

All human history is encompassed in the last hundred thousand years. From the mitochondrial Eve and the Y-Chromosomal Adam who lived in Africa, the mother of our mother of our mother, and so on, that we all share, and the father of our father of our father, and so on, that we all share, their descendants, our ancestors, who crossed the then fertile jungle of the Sahara and who afterwards settled the whole planet, painted on the walls of caves and filled the air with music by blowing over grass blades and into hollow bones, wandered over the land bridge connecting Asia with the Americas and traveled over the vast Pacific to discover tiny islands, until the recent invention of the alphabet, all of this happened in the last hundred thousand years. The universe has an age of hundred thousand times a hundred thousand years, roughly. And the fabled midpoint turned out to be within the last few thousand years.

The hopes that our earth was just accidentally next to the center of the universe was shattered. As the precision of the calculations increased, it became clearer and clearer that earth was not merely close to the center of the universe, but back at the midpoint of history, earth was right there in the center. In every single of the seven possible results, Earth was right at the center of the universe. [1]

As the calculations continued over the years, a new class of mystic mathematicians emerged, and many walls between religion and science were shattered. On both sides the unshakeable ones remained: the scientists who would not admit that these results mean anything, that it all is merely a mathematical abstraction; and the priests who say that these results mean nothing, that they don’t tell us about how to live a good life. That these parallels intersect, is the only trace of infinity left.

[1] As the results refined, it seemed that the seven mathematical solutions for the center of time and space turned out to be some very well known dates. So far the precisions calculated was ten years here or there. The well known dates were: 3760 BC, 541 BC, 30 AD, and 610 AD. The other dates turned out to be quite less well known: 10909 BC, 3114 BC, and 1989 AD. The interpretation of the dates led to a well-known series of events all over the world, which we will not discuss here.

(This story was first published on Medium on February 2, 2014 under CC-BY 4.0).

CodeNet problem descriptions on the Web

Project CodeNet is a large corpus of code published by IBM. It has close to one and a half million programs around a bit more than 4,000 problems.

I took the problem descriptions, created a simple index file to those, and uploaded them to the Web to make them easily browseable.

Wikidata or scraping Wikipedia

Yesterday I was pointed to a blog post describing how to answer an interesting project: how many generations from Alfred the Great to Elizabeth II? Alfred the Great was a king in England at the end of the 9th century, and Elizabeth II is the current Queen of England (and a bit more).

The author of the blog post, Bill P. Godfrey, describes in detail how he wrote a crawler that started downloading the English Wikipedia article of Queen Elizabeth II, and then followed the links in the infobox to download all her ancestors, one after the other. He used a scraper to get the information from the Wikipedia infoboxes from the HTML page. He invested quite a bit of work in cleaning the data, particularly doing entity reconciliation. This was then turned into a graph and the data analyzed, resulting in a number of paths from Elizabeth II to Alfred, the shortest being 31 generations.

I honestly love these kinds of projects, and I found Bill’s write-up interesting and read it with pleasure. It is totally something I would love to do myself. Congrats to Bill for doing it. Bill provided the dataset for further analysis on his Website. Thanks for that!

Everything I say in this post is not meant, in any way, as a criticism of Bill. As said, I think he did a fun project with interesting results, and he wrote a good write-up and published his data. All of this is great. I left a comment on the blog post sketching out how Wikidata could be used for similar results.

He submitted his blog post to Hacker News, where a, to me, extremely surprising discussion ensued. He was pointed rather naturally and swiftly to Wikidata and DBpedia. DBpedia is a project that started and invested heavily in scraping the infoboxes from Wikipedia. Wikidata is a sibling project of Wikipedia where data can be directly maintained by contributors and accessed in a number of machine-readable ways. Asked why he didn’t use Wikidata, he said he didn’t know about it. All fair and good.

But some of the discussions and comments on Hacker News surprised me entirely.

Expressing my consternation, I started discussions on Twitter and on Facebook. And there were some very interesting stories about the pain of using Wikidata, and I very much expect us to learn from them and hopefully make things easier. The number of API queries one has to make in order to get data (although, these numbers would be much smaller than with the scraping approach), the learning curve about SPARQL and RDF (although, you can ignore both, unless you want to use them explicitly - you can just use JSON and the Wikidata API), the opaqueness of the identifiers (wdt:P25 wd:Q9682 instead of “mother” and “Queen Elizabeth II”) were just a few. The documentation seems hard to find, there seem to be a lack of libraries and APIs that are easy to use. And yet, comments like "if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you" surprised me a lot.

Others asked about the data quality of Wikidata, and complained about the huge amount of bad data, duplicates, and the bad ontology in Wikidata (as if Wikipedia wouldn’t have these problems. I mean how do you figure out what a Wikipedia article is about? How do you get a list of all bridges or events from Wikipedia?)

I am not here to fight. I am here to listen and to learn, in order to help figuring out what needs to be made better. I did dive into the question of data quality. Thankfully, Bill provides his dataset on the Website, and downloading the query result for the following query - select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } - is just one click away. The result of this query is equivalent to what Bill was trying to achieve - a list of all ancestors of Elizabeth II. (The actual query is a little bit more complex, because we also fetch the names of the ancestors, and their Wikipedia articles, in order to help match the data to Bill’s data).

I would claim that I invested far less work than Bill in creating my graph data. No data cleansing, no scraping, no crawling, no entity reconciliation, no manual checking. How about the quality of the two datasets?

Update: Note, this post is not a tutorial to SPARQL or Wikidata. You can find an explanation of the query in the discussion on Hacker News about this post. I really wanted to see how the quality of the data using the two approaches compares. Yes, it is an unfamiliar language for many, but I used to teach SPARQL and the basics of the languages seem not that hard to learn. Try out this tutorial for example. Update over

So, let’s look at the datasets. I will refer to the two datasets as the scrape (that’s Bill’s dataset) and Wikidata (that’s the query result from Wikidata, as of the morning of August 20 - in particular, none of the errors in Wikidata mentioned below have been fixed).

In the scrape, we find 2,584 ancestors of Elizabeth II (including herself). They are connected with 3,528 parenthood relationships.

In Wikidata, we find 20,068 ancestors of Elizabeth II (including herself). They are connected with 25,414 parenthood relationships.

So the scrape only found a bit less than 13% of the people that Wikidata knows about, and close to 14% of the relationships. If you ask me, that’s quite a bad recall - almost seven out of eight ancestors are missing.

Did the scrape find things that are missing in Wikidata? Yes. 43 ancestors are in the scrape which are missing in Wikidata, and 61 parenthood relationships are in the scrape which are missing from Wikidata. That’s about 1.8% of the data in the scrape, or 0.24% compared to the overall parent relationship data of Elizabeth II in Wikidata.

I evaluated the complete list of those relationships from the scrape missing from Wikidata. They fall into five categories:

  • Category 1: Errors that come from the scraper. 40 of the 61 relationships are errors introduced by the scrapers. We have cities or countries being parents - which isn’t too terrible, as Bill says in the blog post because they won’t have parents themselves and won’t participate in the original question of findinging the lineage from Alfred to Elizabeth, so no problem. More problematic is when grandparents or great-grandparents are identified as the parent, because this directly messes up the counting of generations: Ügyek is thought to be a son, not a grandson of Prince Csaba, Anna Dalassene is skipping two generations to Theophylact Dalassenos, etc. This means we have an error rate of at least 1.1% in the scraper dataset, besides having the low recall rate mentioned above.
  • Category 2: Wikipedia has an error. Those are rare, it happened twice. Adelaide of Metz had the wrong father and Sophie of Mecklenburg linked to the wrong mother in the infobox (although the text was linking to the right one). The first one has been fixed since Bill ran his scraper (unlucky timing!), and I fixed the second one. Note I am linking to the historic version of the article with the error.
  • Category 3: Wikidata was missing data. Jeanne de Fougères, Countess of La Marche and of Angoulême and Albert Azzo II, Margrave of Milan were missing one or both of their parents, and Bill’s scraping found them. So of the more than 3,500 scraped relationships, only 2 were missing! I added both.
  • In addition, correct data was marked deprecated once. I fixed that, too.
  • Category 4: Wikidata has duplicates, and that breaks the chain. That happened five times, I think the following pairs are duplicates: Q28739301/Q106688884, Q105274433/Q40115489, Q56285134/Q354855, Q61578108/Q546165 and Q15730031/Q59578032. Duplicates were mentioned explicitly in one of the comments as a problem, and here we can see that they happen with quite a bit of frequency, particularly for non-central items. I merged all of these.
  • Category 5: the situation is complicated, and different Wikipedia versions disagree, because the sources seem to disagree. Sometimes Wikidata models that disagreement quite well - but often not. After all, we are talking about people who sometimes lived more than a millennium ago. Here are these cases: Albert II, Margrave of Brandenburg to Ada of Holland; Prince Álmos to Sophia to Emmo of Loon (complicated by a duplicate as well); Oldřich, Duke of Bohemia to Adiva; William III to Raymond III, both Counts of Toulouse; Thored to Oslac of York; Bermudo II of León to Ordoño III of León (Galician says IV); and Robert Fitzhamon to Hamo Dapifer. In total, eight cases. I didn't edit those as these require quite a bit of thought.

Note that there was not a single case of “Wikidata got it wrong”, which surprised me a lot - I totally expected errors to happen. Unless you count the cases in Category 5. I mean, even English Wikipedia had errors! This was a pleasant surprise. Also, the genuine complicated cases are roughly as frequent as missing data, duplicates, and errors together. To be honest, that sounds like a pretty good result to me.

Also, the scraped data? Recall might be low, but the precision is pretty good: more than 98% of it is corroborated by Wikidata. Not all scraping jobs have such a high correctness.

In general, these results are comparable to a comparison of Wikidata with DBpedia and Freebase I did two years ago.

Oh, and what about Bill’s original question?

Turns out that Wikidata knows of a path between Alfred and Elizabeth II that is even shorter than the shortest 31 generations Bill found, as it takes only 30 generations.

This is Bill’s path:

  • Alfred the Great
  • Ælfthryth, Countess of Flanders
  • Arnulf I, Count of Flanders
  • Baldwin III, Count of Flanders
  • Arnulf II, Count of Flanders
  • Baldwin IV, Count of Flanders
  • Judith of Flanders
  • Henry IX, Duke of Bavaria
  • Henry X, Duke of Bavaria
  • Henry the Lion
  • Henry V, Count Palatine of the Rhine
  • Agnes of the Palatinate
  • Louis II, Duke of Bavaria
  • Louis IV, Holy Roman Emperor
  • Albert I, Duke of Bavaria
  • Joanna Sophia of Bavaria
  • Albert II o _Germany
  • Elizabeth of Austria
  • Barbara Jagiellon
  • Christine of Saxony
  • Christine of Hesse
  • Sophia of Holstein-Gottorp
  • Adolphus Frederick I, Duke of Mecklenburg-Schwerin
  • Adolphus Frederick II, Duke of Mecklenburg-Strelitz
  • Duke Charles Louis Frederick of Mecklenburg
  • Charlotte of Mecklenburg-Strelitz
  • Prince Adolphus, Duke of Cambridge
  • Princess Mary Adelaide of Cambridge
  • Mary of Teck
  • George VI
  • Elizabeth II

And this is the path that I found using the Wikidata data:

  • Alfred the Great
  • Edward the Elder (surprisingly, it deviates right at the beginning)
  • Eadgifu of Wessex
  • Louis IV of France
  • Matilda of France
  • Gerberga of Burgundy
  • Matilda of Swabia (this is a weak link in the chain, though, as there might possibly be two Matildas having been merged together. Ask your resident historian)
  • Adalbert II, Count of Ballenstedt
  • Otto, Count of Ballenstedt
  • Albert the Bear
  • Bernhard, Count of Anhalt
  • Albert I, Duke of Saxony
  • Albert II, Duke of Saxony
  • Rudolf I, Duke of Saxe-Wittenberg
  • Wenceslaus I, Duke of Saxe-Wittenberg
  • Rudolf III, Duke of Saxe-Wittenberg
  • Barbara of Saxe-Wittenberg (Barbara has no article in the English Wikipedia, but in German, Bulgarian, and Italian. Since the scraper only looks at English, they would have never found this path)
  • Dorothea of Brandenburg
  • Frederick I of Denmark
  • Adolf, Duke of Holstein-Gottorp (husband to Christine of Hesse in Bill’s path)
  • Sophia of Holstein-Gottorp (and here the two lineages merge again)
  • Adolphus Frederick I, Duke of Mecklenburg-Schwerin
  • Adolphus Frederick II, Duke of Mecklenburg-Strelitz
  • Duke Charles Louis Frederick of Mecklenburg
  • Charlotte of Mecklenburg-Strelitz
  • Prince Adolphus, Duke of Cambridge
  • Princess Mary Adelaide of Cambridge
  • Mary of Teck
  • George VI
  • Elizabeth II

I hope that this is an interesting result for Bill coming out of this exercise.

I am super thankful to Bill for doing this work and describing it. It led to very interesting discussions and triggered insights into some shortcomings of Wikidata. I hope the above write-up is also helpful, particularly in providing some data regarding the quality of Wikidata, and I hope that it will lead to work in making Wikidata more and easier accessible to explorers like Bill.

Update: there has been a discussion of this post on Hacker News.

Double copy in gravity

15 May 2021

When I was younger, I understood these theories much better. Today I read them like a fascinated, but a bit distant bystander.

But it is terribly interesting. What does turning physics into math mean? When we find a mathematical shortcut that works but we don't understand - is this real? What is the relation between mathematical formulas and reality? And will we finally understand gravity some day?

It was an interesting article, but I am not sure I understood it all. I guess, I'm getting old. Or just too specialized.

Zen and the Art of Motorcycle Maintenance

13 May 2021

During my PhD, on the topic of ontology evaluation - figuring out what a good ontology is and what is not - I was running circles up and down trying to define what "good" means for an ontology (Benjamin Good, another researcher on that topic, had it easier, as he could call his metric "Good metric" and be done with it).

So while I was struggling with the definition in one of my academic essays, a kind anonymous reviewer (I think it was Aldo Gangemi) suggested I should read "Zen and the Art of Motorcycle Maintenance".

When I read the title of the suggested book, I first thought the reviewer was being mean or silly and suggesting a made-up book because I was so incoherent. It took me two days to actually check whether that book existed, as I wouldn't believe it.

It existed. And it really helped me, by allowing me to set boundaries of how far I can go in my own work, and that it is OK to have limitations, and that trying to solve EVERYTHING leads to madness.

(Thanks to Brandon Harris for triggering this memory)