Semantic search

Jump to navigation Jump to search

If life was one day

If the evolution of animals was one day... (600 million years)

  • From 1am to 4am, most of the modern types of animals have evolved (Cambrian explosion)
  • Animals get on land a bit at 3am. Early risers! It takes them until 7am to actually breath air.
  • Around noon, first octopuses show up.
  • Dinosaurs arrive at 3pm, and stick around until quarter to ten.
  • Humans and chimpanzees split off about fifteen minutes ago, modern humans and Neanderthals lived in the last minute, and the pyramids were built around 23:59:59.2.

In that world, if that was a Sunday:

  • Saturday would have started with the introduction of sexual reproduction
  • Friday would have started by introducing the nucleus to the cell
  • Thursday recovering from Wednesday's catastrophe
  • Wednesday photosynthesis started, and lead to a lot of oxygen which killed a lot of beings just before midnight
  • Tuesday bacteria show up
  • Monday first forms of life show up
  • Sunday morning, planet Earth forms, pretty much at the same time as the Sun.
  • Our galaxy, the Milky Way, is about a week older
  • The Universe is about another week older - about 22 days.

There are several things that surprised me here.

  • That dinosaurs were around for such an incredibly long time. Dinosaurs were around for seven hours, and humans for a minute.
  • That life started so quickly after Earth was formed, but then took so long to get to animals.
  • That the Earth and the Sun started basically at the same time.

Addendum April 27: Álvaro Ortiz, a graphic designer from Madrid, turned this text into an infographic.

Architecture for a multilingual Wikipedia

I published a paper today:

"Architecture for a multilingual Wikipedia"

I have been working on this for more than half a decade, and I am very happy to have it finally published. The paper is a working paper and comments are very welcome.

Abstract:

Wikipedia’s vision is a world in which everyone can share in the sum of all knowledge. In its first two decades, this vision has been very unevenly achieved. One of the largest hindrances is the sheer number of languages Wikipedia needs to cover in order to achieve that goal. We argue that we need anew approach to tackle this problem more effectively, a multilingual Wikipedia where content can be shared between language editions. This paper proposes an architecture for a system that fulfills this goal. It separates the goal in two parts: creating and maintaining content in an abstract notation within a project called Abstract Wikipedia, and creating an infrastructure called Wikilambda that can translate this notation to natural language. Both parts are fully owned and maintained by the community, as is the integration of the results in the existing Wikipedia editions. This architecture will make more encyclopedic content available to more people in their own language, and at the same time allow more people to contribute knowledge and reach more people with their contributions, no matter what their respective language backgrounds. Additionally, Wikilambda will unlock a new type of knowledge asset people can share in through the Wikimedia projects, functions, which will vastly expand what people can do with knowledge from Wikimedia, and provide a new venue to collaborate and to engage the creativity of contributors from all around the world. These two projects will considerably expand the capabilities of the Wikimedia platform to enable every single human being to freely share in the sum of all knowledge.

Stanford seminar on Knowledge Graphs

My friend Vinay Chaudhri is organising a seminar on Knowledge Graphs with Naren Chittar and Michael Genesereth this semester at Stanford.

I have the honour to present in it as the opening guest lecturer, introducing what Knowledge Graphs are and what are good for.

Due to the current COVID situation, the seminar was turned virtual, and opened to everyone to attend to.

Other speakers during the semester include Juan Sequeda, Marie-Laure Mugnier, Héctor Pérez Urbina, Michael Uschold, Jure Leskovec, Luna Dong, Mark Musen, and many others.

Change is in the air

I'll be prophetic: the current pandemic will shine a bright light on the different social and political systems in the different countries. I expect to see noticeable differences in how disruptive the handling of the situation by the government is, how many issues will be caused by panic, and what effect freely available health care has. The US has always been on the very end of admiring the self sustained individual, and China has been on the other end of admiring the community and its power, and Europe is somewhere in the middle (I am grossly oversimplifying).

This pandemic will blow over in a year or two, it will sweep right through the US election, and the news about it might shape what we deem viable and possible in ways beyond the immediately obvious. The possible scenarios range all the way from high tech surveillance states to a much wider access to social goods such as health and education, and whatever it is, the pandemic might be a catalyst towards that.

Wired: "Wikipedia is the last best place on the Internet"

WIRED published a beautiful ode to Wikipedia, painting the history of the movement with broad strokes, aiming to capture its impact and ambition with beautiful prose. It is a long piece, but I found the writing exciting.

Here's my favorite paragraph:

"Pedantry this powerful is itself a kind of engine, and it is fueled by an enthusiasm that verges on love. Many early critiques of computer-assisted reference works feared a vital human quality would be stripped out in favor of bland fact-speak. That 1974 article in The Atlantic presaged this concern well: “Accuracy, of course, can better be won by a committee armed with computers than by a single intelligence. But while accuracy binds the trust between reader and contributor, eccentricity and elegance and surprise are the singular qualities that make learning an inviting transaction. And they are not qualities we associate with committees.” Yet Wikipedia has eccentricity, elegance, and surprise in abundance, especially in those moments when enthusiasm becomes excess and detail is rendered so finely (and pointlessly) that it becomes beautiful."

They also interviewed me and others for the piece, but the focus of the article is really on what the Wikipedia communities have achieved in our first two decades.

Two corrections: - I cannot be blamed for Wikidata alone, I blame Markus Krötzsch as well - the article says that half of the 40 million entries in Wikidata have been created by humans. I don't know if that is correct - what I said is that half of the edits are made by human contributors

Normbrunnenflasche

It's a pity there's no English Wikipedia article about this marvellous thing that exemplifies Germany so beautifully and quintessentially: the Normbrunnenflasche.

I was wondering the other day why in Germany sparkling water is being sold in 0.7l bottles and not in 1l or 2l or whatever, like in the US (when it's sold here at all, but that's another story).

Germany had a lot of small local producers and companies. To counter the advantages of the Coca Cola Company pressing in the German market, in 1969 a conference of representatives of the local companies decided to introduce a bottle design they all would use. This decision followed a half year competition and discussion on what this bottle should look like.

Every company would use the same bottle for sparkling water and other carbonated drinks, and so no matter which one you bought, the empty bottle would afterwards be routed to the closest participating company, not back home, therefore reducing transport costs and increasing competitiveness against Coca Cola.

The bottle is full of smart features. The 0.7l were chosen to ensure that the drink remained carbonated until the last sip, because larger bottles would last longer and thus gradually loose carbonization.

The form and the little pearls outside were chosen for improved grip, but also to symbolize the sparkles of the carbonization.

The metal screw cap was the real innovation there, useful for drinks that could increase pressure due to the carbonization.

And finally two slightly thicker bands along the lower half of the bottle that would, while being rerouted for another usage, slowly get more opaque due to mechanical pressure, thus indicating how well used the individual bottle was, so they could be taken out of service in time before breaking at the customer.

The bottles were reused an average of fifty times, their boxes an average of hundred times. More than five billion of them have been brought into circulation in the fifty years since their adoption, for an estimated quarter of a trillion fillings.

A new decade?

The job of an ontologist is to define concepts. And since I see some posts commenting on whether a decade is closing and a new decade is starting tonight, here's my private, but entirely official position.

A decade is a consecutive timespan of ten years, and therefore at every given point a new decade starts and one ends. But that's a trivial answer to the question and not very useful.

There are two ways to count calendar decades, and both are arbitrary and rely on retconning, I mean, they really on redefining the past. Therefore there is no right or wrong.

Method one is by using the proleptic Gregorian calendar, and starting with the year 1 and ending with the year 10, and calling that the first decade. If you keep counting, then the twohundredandthird decade will start on January 1st 2021, and we are currently firmly in the twohundredandsecond decade, and will stay there for another year.

Method two is based on the fact that for a millennium now and for many years to come there's a time period that conveniently lasts a decade where the years start with the same three digits. That is, the years starting with 202, which are called the 2020s, the ones with 199 which are called the 1990s (or sometimes just the 90s), etc. For centuries now we can find support for these kind of decades being widely used. According to this method, tonight marks a new decade.

So whether you are celebrating a new year tonight or not (because there are many other calendars out there too), or a new decade or not, I wish you wonderful 2020s!

SWAT4HCLS trip report

This week saw the 12th SWAT4HCLS event in Edinburgh, Scotland. It started with a day of tutorials and workshops on Monday, December 10th, on topics such as SPARQL, querying, ontology matching, and using Wikibase and Wikidata.

Conference presentations went on for two days, Tuesday and Wednesday. This included four keynotes, including mine on Wikidata, and how to move beyond Wikidata (presenting the ideas from my Abstract Wikipedia papers). The other three keynotes (as well as a number of the paper presentation) were all centered on the FAIR concept which I already saw being so prominent at the eScience conference earlier this year. FAIR as in Findable, Accessible, Interoperable, and Reusable publication of data. I am very happy to see these ideas spread out so prominently!

Birgitta König-Ries talked about how to use semantic technologies to manage FAIR data. Dov Greenbaum talked about how licenses interplay with data and what it means for FAIR data - personally, my personal favorite of the keynotes, because of my morbid fascination regarding licenses and intellectual property rights pertaining to data and knowledge. He actually confirmed my understanding of the area - that you can’t really use copyright for data, and thus the application of CC-BY or similar licenses to data would stand on shaky grounds in a court. The last keynote was by Helen Parkinson, who gave a great talk on the issues that come up when building vocabularies, including issues around over-ontologizing (and the siren call of just keeping on modeling) and others. She put the issues in parallel to the travels of Odysseus, which was delightful.

The conference talks and posters were really on spot on the topic of the conference: using semantic web technologies in the life sciences, health care, and related fields. It was a very satisfying experience to see so many applications of the technologies that Semantic Web researchers and developers have been creating over the years. My personal favorite was MetaStanza, web components that visualize SPARQL results in many different ways (a much needed update to SPARK, that Andreas Harth and I had developed almost a decade ago).

On Thursday, the conference closed with a Hackathon day, which I couldn’t attend unfortunately.

Thanks to the organizers for the event, and thanks again for the invitation to beautiful Edinburgh!

Other trip reports (send me more if you have them):

Frozen II in Korea

This is a fascinating story, that just keeps getting better (and Hollywood Reporter is only scratching the surface here, unfortunately): an NGO in South Korea is suing Disney for "monopolizing" the movie screens of the country, because Frozen II is shown on 88% of all screens.

Now, South Korea has a rich and diverse number of movie theatres - they have the large cineplexes in big cities, but in the less populated areas they have many small theatres, often with a small number of screens (I reckon it is similar to the villages in Croatia, where there was only a single screen in the theater, and most movies were shown only once, and there were only one or two screenings per day, and not on every day). The theatres are often independent, so there is no central planning about which movies are being shown (and today, it rarely matters today how many copies of a movie are being made, as many projectors are digital and thus unlimited copies can be created on the fly - instead of waiting for the one copy to travel from one town to the next, which was the case in my childhood).

So how would you ensure that these independent movies don't show a movie too often? By having a centralized way that ensures that not too many screens show the same movie? (Preferably on the Blockchain, using an auction system?) Good luck with that, and allowing the local theatres to adapt their screenings to their audiences.

But as said, it gets better: the 88% number is being arrived at by counting how many of the screens in the country showed Frozen II on a given day. It doesn't mean that that screen was used solely for Frozen II! If the screen was used at noon for a showing of Frozen II, and at 10pm for a Korean horror movie, that screen counts for both. Which makes the percentage a pretty useless number if you want to show monopolistic dominance (also, because the numbers add up to far more than 100%). Again, remember that in small towns there is often a small number of screens, and they have to show several different movies on the same screen. If the ideas of the lawsuit would be enacted, you would need to keep off Frozen II from a certain number of screens! Which basically makes it impossible to allow kids and teens in less populated areas to participate in event movie-going such as Frozen II and trying to avoid spoilers in Social Media afterwards.

Now, if you look how many screenings, instead of screens, were occupied by Frozen II, the number drops down to 46% - which is still impressive, but far less dominant and monopolistic than the 88% cited above (and in fact below the 50% the Korean law requires to establish dominance).

And even more impressive: in the end it is up to the audience. And even though 'only' 46% of the screenings were on Frozen II, every single day since its release between 60% and 85% of all revenue was going to Frozen II. So one could argue that the theatres were actually underserving the audience (but then again, that's not how it really works, because screenings are usually in rooms with hundred or more seats, and they can be very differently filled - and showing a blockbuster three times with almost full capacity, and showing a less popular movie once with only a dozen or so tickets sold might still have served the local community better than only running the block buster).

I bet the NGO's goal is just to raise awareness about the dominance of the American entertainment industry, and for that, hey, it's certainly worth a shot! But would they really want to go back to a system where small local cinemas would not be able to show blockbusters for a long time, involving a complicated centralized planning component?

(Also, I wish there was a way to sign up for updates on a story, like this lawsuit. Let me know if anyone knows of such a system!)


Machine Learning and Metrology

There are many, many papers in machine learning these days. And this paper, taking a step back, and thinking about how researchers measure their results and how good a specific type of benchmarks even can be - crowdsourced golden sets. It brings a convincing example based on word similarity, using terminology and concepts from metrology, to show how many results that have been reported are actually not supported by the golden set, because the resolution of the golden set is actually insufficient. So there might be no improvement at all, and that new architecture might just be noise.

I think this paper is really worth the time of people in the research field. Written by Chris Welty, Lora Aroyo, and Praveen Paritosh.

The story of the Swedish calendar

Most of us are mostly aware how the calendar works. There’s twelve months in a year, each month has 30 or 31 days, and then there’s February, which usually has 28 days and sometimes, in what is called a leap year, 29. In general, years divisible by four are leap years.

This calendar was introduced by no one else then Julius Caesar, before he became busy conquering the known world and becoming the Emperor of Rome. Before that he used to have the job title “supreme bridge builder” - the bridge connecting the human world with the world of the gods. One of the responsibilities of this role was to decide how many days to add to the end of the calendar year, because the Romans noticed that their calendar was getting misaligned with the seasons, because it was simply a bit too short. So, for every year, the supreme bridge builder had to decide how many days to add to the calendar.

Since we are talking about the Roman Republic, this was unsurprisingly misused for political gain. If the supreme bridge builder liked the people in power, he might have granted a few extra weeks. If not, no extra days. Instead of ensuring that the calendar and the seasons aligned, the calendar got even more out of whack.

Julius Caesar spearheaded a reform of the calendar, and instead of letting the supreme bridge builder decide how many days to add, the reform devised rules founded in observation and mathematical rules - leading to the calendar we still have today: twelve months each year, each with 30 or 31 days, besides February, which had 28, but every four years would have 29. This is what we today call the Julian calendar. This calendar was not perfect, but pretty good.

Over the following centuries, the role of the supreme bridge builder - or, in latin, Pontifex Maximus - transferred from the Emperor of Rome to the Bishop of Rome, the Pope. And with continuing observations over centuries it was noticed that the calendar was again getting out of sync with the seasons. So it was the Pope - Gregory XIII, later called The Great - who, in his role as Pontifex Maximus, decided that the calendar should be fixed once again. The committee he set up to work on that came up with fabulous improvements, which would guarantee to keep the calendar in sync for a much longer time frame. In addition to the rules established by the Julian calendar, every hundred years we would drop a leap year. But every four hundred years, we would skip dropping the leap year (as we did in 2000, which not many people noticed). And in 1582, this calendar - called the Gregorian calendar - was introduced.

Imagine leading a committee that comes up with rules on what the whole world would need to do once every four hundred years - and mostly having these rules implemented. How would you lead and design such a committee? I find this idea mind-blowing.

Since the time of Caesar until 1582, about fifteen centuries have passed. And in this time, the calendar was getting slightly out of sync - by one day every century, skipping every fourth. In order to deal with that shift, they decided that ten calendar days need to be skipped. Following the 4th of October 1582 was the 15th of October 1582. In 1582, there was no 5th or 14th of October, nor any of the days in between, in the countries that had the Gregorian calendar adopted.

This lead to plenty of legal discussions, mostly about monthly rents and wages: is this still a full month, or should the rent or wage be paid prorated to the number of days? Should annual rents, interests, and taxes be prorated by these ten days, or not? What day of the week should the 15th of October be?


The Gregorian calendar was a marked improvement over the Julian calendar with regards to keeping the seasons in sync with the calendar. So one might think its adoption should be a no-brainer. But there was a slight complication: politics.

Now imagine that today the Pope gets out on his balcony, and declares that, starting in five years, January to November all have 30 days, and December has 35 or 36 days. How would the world react? Would they ponder the merits of the proposal, would they laugh, would they simply adopt it? Would a country such as Italy have a different public discourse about this topic than a country such as China?

In 1582, the situation was similarly difficult. Instead of pondering the benefits of the proposal, the source of the proposal and the relation to that source became the main deciding factor. Instead of adopting the idea because it is a good idea, the idea was adopted - or not - because the Pope of the Catholic Church declared it. The Papal state, the Spanish and French Kingdoms, were first to adopt it.

Queen Elizabeth wanted to adopt it in England, but the Anglican bishops were fiercely opposed to it because it was suggested by the Pope. Other Protestant and the Orthodox countries simply ignored it for centuries. And thus there was a 5th of October 1582 in England, but not in France, and that lead to a number of confusions over the following centuries.

Ever wondered why the October Revolution started November 7? There we go. There is even a story that Napoleon won an important battle (either the Battle of Austerlitz or the Battle of Ulm) because the Russian and Austrian forces coordinated badly as the Austrians were using the Gregorian and the Russians the Julian calendar. The story is false, but it makes for a great story.

Today, the International Day of the Book is on April 23 - the death date of both Miguel de Cervantes and William Shakespeare in 1616, the two giants of literature in their respective languages - with the amusing side-effect that they actually died about two weeks apart, even though they died on the same calendar day, but in different calendars.

It wasn’t until 1923 that for most purposes all countries had deprecated the Julian calendar, and for religious purposes some still follow it - which is why the Orthodox and the Amish celebrate Christmas on January 6. Starting 2101, that should shift by another day - and I would be very curious to see whether it will, or whether by then January 6th has solidified as the Christmas date.


Possibly the most confusing story about adopting the Gregorian calendar comes from Sweden. Like most protestant countries, Sweden did not initially adopt the Gregorian calendar, and was sticking with the Julian calendar, until in 1699 they decided to switch.

Now, the idea of skipping eleven or twelve days in one go did not sound appealing - remember all the chaos that occurred in the other countries for dropping these days. So in Sweden they decided that instead of dropping the days all at once, they would drop them one by one, by skipping the leap years from 1700 until 1740, when the two calendars would finally catch up.

In 1700, February 29 was skipped in Sweden. Which didn’t bring them any closer to Gregorian countries such as Spain, because they skipped the leap year in 1700 anyway. But it brought them out of alignment with Russia - by one day.

A war with Russia started (not about the calendar, but just a week before the calendars went out of sync, incidentally), and due to the war Sweden forgot to skip the leap days in 1704 and 1708 (they had other things on their mind). And as this was embarrassing, in 1711, King Charles XII of Sweden declared to abandon the plan, and added one extra day the following year to realign it back to Russia. And because 1712 was a leap year anyway, in Sweden there was not only a February 29, but also a February 30, 1712. The only legal February 30 in history so far.

It needed not only for Charles XII to die, but also for his sister (who succeeded him) and her husband (who succeeded her) in 1751, before Sweden could move beyond that embarrassing episode, and in 1752 Sweden switched from the Julian to the Gregorian calendar, by cutting February short and ending it after February 17, following that by March 1.


Somewhere on my To-Do list, I have the wish to write a book on Wikidata. How it came to be, how it works, what it means, the complications we encountered, and the ones we missed, etc. One section in this book is planned to be about calendar models. This is an early, self-contained draft of part of that section. Feedback and corrections are very welcome.


Erdös number, update

I just made an update to a post from 2006, because I learned that my Erdös number has went down from 4 to 3. I guess that's pretty much it - it is not likely I'll ever become a 2.

The Fourth Scream

Janie loved her research. It was at the intersection of so many interesting areas - genetics, linguistics, neuroscience. And the best thing about it - she could work the whole day with these adorable vervet monkeys.

One more time, she showed the video of the flying eagle to Kassandra. The MRI helmet on Kassandra’s little head measured the neuron activation, highlighting the same region on her computer screen as the other times, the same region as with the other monkeys. Kassandra let out the scream that Janie was able to understand herself by now, the scream meaning “Eagle!”, and the other monkeys behind the bars in the far end of the room, in a cage large as half the room, ran to cover in the bushes and small caves, if they were close enough. As they did every time.

That MRI helmet was a masterpiece. She could measure the activation of the neurons in unprecedented high resolution. And not only that, she could even send inferencing waves back, stimulating very fine grained regions in the monkey’s brain. The stimulation wasn’t very fast, but it was a modern miracle.

She slipped a raspberry to Kassandra, and Kassandra quickly snatched it and stuffed it in her mouth. The monkeys came from different populations from all over Southern and Eastern Africa, and yet they all understood the same three screams. Even when the baby monkeys were raised by mute parents, the baby monkeys understood the same three screams. One scream was to warn them from leopards, one scream was to warn them from snakes, and the third scream was to warn them from eagles. The screams were universally understood by everyone across the globe - by every vervet monkey, that is. A language encoded in the DNA of the species.

She called up the aggregated areas from the scream from her last few experiments. In the last five years, she was able to trace back the proteins that were responsible for the growth of these four areas, and thus the DNA encoding these calls. She could prove that these three different screams, the three different words of Vervetian, were all encoded in DNA. That was very different from human language, where every word is learned, arbitrary, and none of the words were encoded in our DNA. Some researchers believed that other parts of our language were encoded in our DNA: deep grammatical patterns, the ability to merge chunks into hierarchies of meaning when parsing sentences, or the categorical difference between hearing the syllable ba and the syllable ga. But she was the first one to provably connect three different concrete genes with three different words that an animal produces and understands.

She told the software to create an overlapping picture of the three different brain areas activated by the three screams. It was a three dimensional picture that she could turn, zoom, and slice freely, in real time. The strands of DNA were highlighted at the bottom of the screen, in the same colors as the three different areas in the brain. One gene, then a break, then the other two genes she had identified. Leopard, snake, eagle.

She started to turn the visualization of the brain areas, as Kassandra started squealing in pain. Her hand was stuck between the cage bars and the plate with raspberries. The little thief was trying to sneak out a raspberry or two! Janie laughed, and helped the monkey get the hand unstuck. Kassandra yanked it back into the cage, looked at Janie accusingly, knowing that the pain was Janie’s fault for not giving her enough raspberries. Janie snickered, took out another raspberry and gave it to the monkey. She snatched it out of Janie’s hand, without stopping the accusing stare, and Janie then put the plate to the other side of the table, in safe distance and out of sight of Kassandra.

She looked back at the screen. When Kassandra cried out, her hand had twitched, and turned the visualization to a weird angle. She just wanted to turn it back to a more common view, when she suddenly stopped.

From this angle, she could see the three different areas, connecting together with the audiovisual cortex at a common point, like the leaves of a clover. But that was just it. It really looked like three leaves of a four-leaf clover. The area where the fourth leaf would be - it looked a lot like the areas where the other three leaves were.

She zoomed into the audiovisual cortex. She marked the neurons that triggered each of the three leaves. And then she looked at the fourth leaf. The connection to the cortex was similar. A bit different, but similar enough. She was able to identify what probably are the trigger-neurons, just like she was able to find them for the other three areas.

She targeted the MRI helmet on the neurons connected to the eagle trigger neurons, and with a click she sent a stimulus. Kassandra looked up, a bit confused. Janie looked at the neurons, how they triggered, unrolled the activation patterns, and saw how the signal was suppressed. She reprogrammed the MRI helmet, refined the neurons to be stimulated, and sent off another stimulus.

Kassandra yanked her head up, looking around, surprised. She looked at her screen, but it showed nothing as well. She walked nervously around inside the little cage, looking worriedly to the ceiling of the lab, confused. Janie again analyzed the activation patterns, and saw how it almost went through. There seemed to be a single last gatekeeper to pass. She reprogrammed the stimulator again. Third time's the charm, they say. She just remembered a former boyfriend, who was going on and on about this proverb. How no one knew how old it was, where it began, and how many different cultures all over the world associate trying something three times with eventual success, or an eventual curse. How some people believed you need to call the devil's name three times to —

Kassandra screamed out the same scream as before, the scream saying “Eagle!”. The MRI helmet had sent the stimulus, and it worked. The other monkeys jumped for cover. Kassandra raised her own arms above her head, peeking through her fingers to find the eagle she had just sensed.

Janie was more than excited! This alone will make a great paper. She could get the monkeys to scream out one of the three words of their language by a simple stimulation of particular neurons! Sure, she expected this to work - why wouldn’t it? But the actual scream, the confirmation, was exhilarating. As expected, the neurons now had a heightened potential, were easier to activate, waiting for more input. They slowly cooled down as Kassandra didn’t see any eagles.

She looked at the neurons connected to the fourth leaf. The gap. Was there a secret, fourth word hidden? One that all the zoologists studying vervet monkeys have missed so far? What would that word be? She reprogrammed the MRI helmet, aiming at the neurons that would trigger the fourth leaf. If her theory was right. With another click she sent a stimulus to the —

Janie was crouching in the corner of the room, breathing heavily, cold sweat was covering her arms, her face, her whole body. Her clothes were clamp. Her arms were slung above her head. She didn’t remember how she got here. The office chair she was just sitting in a moment ago, laid on the floor. The monkeys were quiet. Eerily quiet. She couldn’t see them from where she was, she couldn’t even see Kassandra from here, who was in the cage next to her computer. One of the halogen lamps in the ceiling was flickering. It wasn’t doing that before, was it?

She slowly stood up. Her body was shivering. She felt dizzy. She almost stumbled, just standing up. She slowly lowered her arms, but her arms were shaking. She looked for Kassandra. Kassandra was completely quiet, rolled up in the very corner of her cage, her arms slung around herself, her eyes staring catatonically forward, into nothing.

Janie took a step towards the middle of the room. She could see a bit more of the cage. The monkeys were partly huddled together, shaking in fear. One of them laid in the middle of the cage, his face in a grimace of terror. He was dead. She thought it was Rambo, but she wasn’t sure. She stumbled to the computer, pulled the chair from the floor, slumped into it.

The MRI helmet had recorded the activation pattern. She stepped through it. It did behave partially the same: the neurons triggered the unknown leaf, as expected, and that lead to activate the muscles around the lungs, the throat, the tongue, the mouth - in short, that activated the scream. But, unlike with the eagle scream, the activation potential did not increase, it was now suppressed. Like if it was trying to avoid a second triggering. She checked the pattern: yes, the neuron triggered that suppression itself. That was different. How did this secret scream sound?

Oh no! No, no, no, no, NOO!! She had not recorded the experiment. How stupid!

She was excited. She was scared, too, but she tried to push that away. She needed to record that scream. She needed to record the fourth word, the secret word of vervet monkeys. She switched on all three cameras in the lab, one pointed at the large cage with the monkeys, the other two pointing at Kassandra - and then she changed her mind, and turned one onto herself. What has happened to herself? Why couldn’t she remember hearing the scream? Why was she been crouching on the floor like one of the monkeys?

She checked her computer. The MRI helmet was calibrated as before, pointing at the group of triggering neurons. The suppression was ebbing down, but not as fast as she wanted. She increased the stimulation power. She shouldn’t. She should follow protocol. But this all was crazy. This was a cover story for Nature. With her as first author. She checked the recording devices. All three were on. The streams were feeding back into her computer. She clicked to send the sti—

She felt the floor beneath her. It was dirty and cold. She was laying on the floor, face down. Her ears were ringing. She turned her head, opened her eyes. Her vision was blurred. Over the ringing in her ears she didn’t hear a single sound from the monkeys. She tried to move, and she felt her pants were wet. She tried to stand up, to push herself up.

She couldn’t.

She panicked. Shivered. And when she felt the tears running over her face, she clenched her teeth together. She tried to breath, consciously, to collect herself, to gain control. Again she tried to stand up, and this time her arms and legs moved. Slower than she wanted. Weaker than she hoped. She was shaking. But she moved. She grabbed the chair. Pulled herself up a bit. The computer screen was as before, as if nothing has happened. She looked to Kassandra.

Kassandra was dead. Her eyes were bloodshot. Her face was a mask of pure terror, staring at nothing in the middle of the room. Janie tried to look at the cage with the other monkeys, but she couldn’t focus her gaze. She tried to yank herself into the chair.

The chair rolled away, and she crashed to the floor.

She had went too far. She had made a mistake. She should have had followed protocol. She was too ambitious, her curiosity and her impatience took the best of her. She had to focus. She had to fix things. But first she needed to call for help. She crawled to the chair. She pulled herself up, tried to sit in the chair, and she did it. She was sitting. Success.

Slowly, she rolled back to the computer. Her office didn’t have a phone. She double-clicked on the security app on her desktop. She had no idea how it worked, she never had to call security before. She hoped it would just work. A screen opened, asking her for some input. She couldn’t read it. She tried to focus. She didn’t know what to do. After a few moments the app changed, and it said in big letters: HELP IS ON THE WAY. STAY CALM. She closed her eyes. Breathed. Good.

After a few moments she felt better. She opened her eyes. HELP IS ON THE WAY. STAY CALM. She read it, once, twice. She nodded, her gaze jumping over the rest of the screen.

The recording was still on.

She moved the mouse cursor to the recording app. She wanted to see what has happened. There was nothing to do anyway, until security came. She clicked on the play button.

The recording filled three windows, one for each of the cameras. One pointed at the large cage with the vervet monkeys, two at Kassandra. Then, one of the cameras pointing at Kassandra was moved, pointing at Janie, just moments ago - it was moments, was it? - sitting at the desk. She saw herself getting ready to send the second stimulus to Kassandra, to make her call the secret scream a second time.

And then, from the recording, Kassandra called for a third time.

The end

History of knowledge graphs

An overview on the history of ideas leading to knowledge graphs, with plenty of references. Useful for anyone who wants to understand the background of the field, and probably the best current such overview.

On the competence of conspiracists

“Look, I’ll be honest, if living in the US for the last five years has taught me anything is that any government assemblage large enough to try to control a big chunk of the human population would in no way be consistently competent enough to actually cover it up. Like, we would have found out in three months and it wouldn’t even have been because of some investigative reporter, it would have been because one of the lizards forgot to put on their human suit on day and accidentally went out to shop for a pint of milk and like, got caught in a tik-tok video.” -- Os Keyes, WikidataCon, Keynote "Questioning Wikidata"

Power in California

It is wonderful to live in the Bay Area, where the future is being invented.

Sure, we might not have a reliable power supply, but hey, we have an app that connects people with dogs who don't want to pick up their poop with people who are desperate enough to do this shit.

Another example how the capitalism that we currently live failed massively: last year, PG&E was found responsible for killing people and destroying a whole city. Now they really want to play it safe, and switch off the power for millions of people. And they say this will go on for a decade. So in 2029 when we're supposed to have AIs, self-driving cars, and self-tieing Nikes, there will be cities in California that will get their power shut off for days when there is a hot wind for an afternoon.

Why? Because the money that should have gone into, that was already earmarked for, making the power infrastructure more resilient and safe went into bonus payments for executives (that sounds so cliché!). They tried to externalize the cost of an aging power infrastructure - the cost being literally the life and homes of people. And when told not to, they put millions of people in the dark.

This is so awfully on the nose that there is no need for metaphors.

San Francisco offered to buy the local power grid, to put it into public hands. But PG&E refused that offer of several billion dollars.

So if you live in an area that has a well working power infrastructure, appreciate it.

Academic lineage

Sorry for showing off, but it is just too cool not to: here is a visualization of my academic lineage according to Wikidata.

Query: w.wiki/AE8

Bring me to your leader!

"Bring me to your leader!", the explorer demanded.

"What's a leader?", the natives asked.

"The guy who tells everyone what to do.", he explained with some consternation.

"Oh yeah, we have one like that, but why would you want to talk to him? He's unbearable."

AKTS 2019

September 24 was the AKTS workshop - Advanced Knowledge Technologies for Science in a FAIR world - co-located with the eScience and Gateways conferences in San Diego. As usual with my trip reports, I won't write about every single talk, but offer only my own personal selection and view. This is not an official report on the workshop.

I had the honor of kicking off the day. I made the proposal of using Wikidata for describing datasets so that dataset catalogs can add these descriptions to their indexes. The standard way to do so is to use Schema.org annotations describing the datasets, but our idea here was to provide a fallback solution in case Schema.org cannot be applied for one reason or the other. Since the following talks would also be talking about Wikidata I used the talk to introduce Wikidata in a bit more depth. In parallel, I kicked the same conversation off on Wikidata as well. The idea was well received, but one good question was raised by Andrew Su: why not add Schema.org annotations to Wikidata instead?

After that, Daniel Garijo of USC's ISI presented WDPlus, Wikidata Plus, which presented a prototype for how to extend Wikidata with more data (particularly tabular data) from external data sources, such as censuses and statistical publications. The idea is to surround Wikidata with a layer of so-called satellites, which materialize statistical and other external data into Wikidata's schema. They implemented a mapping languages, T2WDML, that allows to grab CSV numbers and turn them into triples that are compatible with Wikidata's schema, and thus can be queried together. There seems to be huge potential in this idea, particularly if one can connect the idea of federated SPARQL querying with on-the-fly mappings, extending Wikidata to a virtual knowledge base that would be easily several times its current size.

Andrew Su from Scripps Research talked about using Wikidata as a knowledge graph in a FAIR world. He presented their brilliant Gene Wiki project, about adding knowledge about genes and proteins to Wikidata. He presented the idea of using Wikidata as a generalized back-end for customized frontend-applications - which is perfect. Wikidata's frontend is solid and functional, but in many domains there is a large potential to improve the UX for users in specific domains (and we are seeing some if flowering more around Lexemes, with Lucas Werkmeister's work on lexical forms). Su and his lab developed ChlamBase which allows the Chlamydia research community to look at the data they are interested in, and to easily add missing data. Another huge advantage of using Wikidata? Your data is going to live beyond the life of the grant. A great overview of the relevant data in Wikidata can be seen in this rich and huge and complex diagram.

The talks switched more to FAIR principles, first by Jeffrey Grethe of UCSD and then Mark Musen of Stanford. Mark was pointing out how quickly FAIR turned from a new idea to a meme that was pervasive everywhere, and the funding agencies now starting to require it. But data often has issues. One example: BioSample is the best metadata NIH has to offer. But 73% of the Boolean metadata values are not 'true' or 'false' but have values like "nonsmoker" or "recently quitted". 26% of the integers were not parseable. 68% of the entries from a controlled vocabulary were not. Having UX that helped with entering this data would be improving the quality considerably, such as CEDAR.

Carole Goble then talked about moving towards using Schema.org for FAIRer Life Sciences resources and defining a Schema.org profile that make datasets easier to use. The challenges in the field have been mostly social - there was a lot of confidence that we know how to solve the technical issues, but the social ones provide to be challenging. Carol named four of those explicitly:

  1. ontology-itis
  2. building consensus (it's harder than you think)
  3. the Schema.org Catch-22 (Schema.org won't take it if there is no usage, but people won't use it until it is in Schema.org)
  4. dedicated resources (people think you can do the social stuff in your spare time, but you can't)

Natasha Noy gave the keynote, talking about Google Dataset Search. The lessons learned from building it:

  1. Build an ecosystem first, be technically light-weight (a great lesson which was also true for Wikipedia and Wikidata)
  2. Use open, non-proprietary, standard solutions, don't ask people to build it just for Google (so in this case, use Schema.org for describing datasets)
  3. bootstrapping requires influencers (i.e. important players in the field, that need explicit outreach) and incentives (to increase numbers)
  4. semantics and the KG are critical ingredients (for quality assurance, to get the data in quickly, etc.)

At the same time, Natasha also reiterated one of Mark's points: no matter how simple the system is, people will get it wrong. The number of ways a date field can be written wrong is astounding. And often it is easier to make the ingester more accepting than try to get people to correct their metadata.

Chris Gorgolewski followed with a session on increasing findability for datasets, basically a session on SEO for dataset search: add generic descriptions, because people who need to find your dataset probably don't know your dataset and the exact terms (or they would already use it). Ensure people coming to your landing site have a pleasant experience. And the description is markup, so you can even use images.

I particularly enjoyed a trio of paper presentations by Daniel Garijo, Maria Stoica, Basel Shbita and Binh Vu. Daniel spoke about OntoSoft, an ontology to describe software workflows in sufficient detail to allow executing them, and also to create input and output definitions, describe the execution environment, etc. Close to those in- and output definition we find Maria's work on an ontology of variables. Maria presented a lot of work to identify the meaning of variables, based on linguistic, semantic, and ontological reasoning. Basel and Binh talked about understanding data catalogs deepers, being able to go deeper into the tables and understand the actual content in them. If one would connect the results of these three papers, one could potentially see how data from published tables and datasets could become alive and answer questions almost out of the box: extracting knowledge from tables, understanding their roles with regards to the input variables, and how to execute the scientific workflows.

Sure, science fiction, and the question is how well would each of the methods work, and how well would they work in concert, but hey, it's a workshop. It's meant for crazy ideas.

Ibrahim Burak Ozyurt presented an approach towards question answering in the bio-domain using Deep Learning, including Glove and BERT and all the other state of the art work. And it's all on Github! Go try it out.

The day closed with a panel with Mark Musen, Natasha Noy, and me, moderated by Yolanda Gil, discussing what we learned today. It quickly centered on the question how to ensure that people publishing datasets get appropriate credit. For most researchers, and particularly for universities, paper publications and impact factors are the main metric to evaluate researchers. So how do we ensure that people creating datasets (and I might add, tools, workflows, and social consensus) receive the fair share of credit?

Thanks to Yolanda Gil and Andrew Su for organizing the workshop! It was an exhausting, but lovely experience, and it is great to see the interest in this field.

Illuminati and Wikibase

When I was a teenager I was far too much fascinated by the Illuminati. Much less about the actual historical order, and more about the memetic complex, the trilogy by Shea and Wilson, the card game by Steve Jackson, the secret society and esoteric knowledge, the Templar Story, Holy Blood of Jesus, the rule of 5, the secret of 23, all the literature and offsprings, etc etc...

Eventually I went to actual order meetings of the Rosicrucians, and learned about some of their "secret" teachings, and also read Eco's Foucault's Pendulum. That, and access to the Web and eventually Wikipedia, helped to "cure" me from this stuff: Wikipedia allowed me to put a lot of the bits and pieces into context, and the (fascinating) stories that people like Shea & Wilson or von Däniken or Baigent, Leigh & Lincoln tell, start falling apart. Eco's novel, by deconstructing the idea, helps to overcome it.

He probably doesn't remember it anymore, but it was Thomas Römer who, many years ago, told me that the trick of these authors is to tell ten implausible, but verifiable facts, and tie them together with one highly plausible, but made-up fact. The appeal of their stories is that all of it seems to check out (because back then it was hard to fact check stuff, so you would use your time to check the most implausible stuff).

I still understand the allure of these stories, and love to indulge in them from time to time. But it was the Web, and it was learning about knowledge representation, that clarified the view on the underlying facts, and when I tried to apply the methods I was learning to it, it fell apart quickly.

So it is rather fascinating to see that one of the largest and earliest applications of Wikibase, the software we developed for Wikidata, turned out to be actual bona fide historians (not the conspiracy theorists) using it to work on the Illuminati, to catalog the letters they sent to reach other, to visualize the flow of information through the order, etc. Thanks to Olaf Simons for heading this project, and for this write up of their current state.

It's amusing to see things go round and round and realize that, indeed, everything is connected.

Wikidatan in residence at Google

Over the last few years, more and more research teams all around the world have started to use Wikidata. Wikidata is becoming a fundamental resource. That is also true for research at Google. One advantage of using Wikidata as a research resource is that it is available to everyone. Results can be reproduced and validated externally. Yay!

I had used my 20% time to support such teams. The requests became more frequent, and now I am moving to a new role in Google Research, akin to a Wikimedian in Residence: my role is to promote understanding of the Wikimedia projects within Google, work with Googlers to share more resources with the Wikimedia communities, and to facilitate the improvement of Wikimedia content by the Wikimedia communities, all with a strong focus on Wikidata.

One deeply satisfying thing for me is that the goals of my new role and the goals of the communities are so well aligned: it is really about improving the coverage and quality of the content, and about pushing the projects closer towards letting everyone share in the sum of all knowledge.

Expect to see more from me again - there are already a number of fun ideas in the pipeline, and I am looking forward to see them get out of the gates! I am looking forward to hearing your ideas and suggestions, and to continue contributing to the Wikimedia goals.

Deep kick


Mark Stoneward accepted the invitation immediately. Then it took two weeks for his lawyers at the Football Association to check the contracts and non-disclosure agreements prepared by the AI research company. Stoneward arrived at the glass and steel building in London downtown. He signed in at a fully automated kiosk, and was then accompanied by a friendly security guard to the office of the CEO.

Denise Mirza and Stoneward had met at social events, but never had time to talk for a longer time. “Congratulations on the results of the World Cup!” Stoneward nodded, “Thank you.”

“You have performed better than most of our models have predicted. This was particularly due to your willingness to make strategic choices, where other associations would simply have told their players to do their best. I am very impressed.” She looked at Stoneward, trying to read his face.

Stoneward’s face didn’t move. He didn’t want to give away how much was planned, how much was luck. He knew these things travel fast, and every little bit he could keep secret gave his team an edge. Mirza smiled. She recognised that poker face. “We know how to develop a computer system that could help you with even better strategic decisions.”

Stoneward tried to keep his face unmoved, but his body turned to Mirza and his arms opened a bit wider. Mirza knew that he was interested.

“If our models are correct, we can develop an Artificial Intelligence that could help you discuss your plans, help you with making the right strategic decisions, and play through different scenarios. Such AIs are already used in board rooms, in medicine, to create new recipes for top restaurants, or training chess players.”

“What about the other teams?”

“Well, we were hoping to keep this exclusive for two or four years, to test and refine the methodology. We are not in a hurry. Our models give us an overwhelming probability to win both the European Championship and the World Cup in case you follow our advice.”

“Overwhelming probability?”

“About 96%.”

“For the European Championship?”

“No. To win both.”

Stoneward gasped. “That is… hard to believe.”

The CEO laughed. “It is good that you are sceptical. I also doubted these probabilities, but I had two teams double-check.”

“What is that advice?”

She shrugged. “I don’t know yet. We need to develop the AI first. But I wanted to be sure you are actually interested before we invest in it.”

“You already know how effective the system will be without even having developed it yet?”

She smiled. “Our own decision process is being guided by a similar AI. There are so many things we could be doing. So many possible things to work on and revolutionise. We have to decide how to spend our resources and our time wisely.”

“And you’d rather spend your time on football than on… I don’t know, healing cancer or making a product that makes tons of money?”

“Healing cancer is difficult and will take a long time. Regarding money… the biggest impediment to speeding up the impact of our work is currently not a lack of resources, but a lack of public and political goodwill. People are worried about what our technology can do, and parliament and the European Union are eager to throw more and more regulations at us. What we need is something that will make every voter in England fall in love with us. That will open up the room for us to move more freely.”

Stoneward smiled. “Winning the World Cup.”

She smiled. “Winning the World Cup.”


Three months later…

“So, how will this work? Do I, uhm, type something in a computer, or do we have to run some program and I enter possible players we are considering to select?”

Mirza laughed. “No, nothing that primitive. The AI already knows all of your players. In fact, it knows all professional players in the world. It has watched and analyzed every second of TV screening of any game around the world, every relevant online video, and everything written in local newspapers.”

Stoneward nodded. That sounded promising.

“Here comes a little complication, though. We have a protocol for using our AIs. The protocols are overcautious. Our AIs are still far away from human intelligence, but our Ethics and Safety boards insisted on implementing these protocols whenever we use some of the near-human intelligence systems. It is completely overblown, but we are basically preparing ourselves for the time we have actually intelligent systems, maybe even superhuman intelligent systems.”

“I am afraid I don’t understand.”

“Basically, instead of talking to the AI directly, we talk with them through an operator, or medium.”

“Talk to them? You simply talk with the AI? Like with Siri?”

Mirza scoffed. “Siri is just a set of hard coded scripts and triggers.”

Stoneward didn’t seem impressed by the rant.

“The medium talks with the AI, tries its best to understand it, and then relays the AI’s advice to us. The protocol is strict about not letting the AI interact with decision makers directly.”

“Why?”

“Ah, as said, it is just being overly cautious. The protocol is in place in case we ever develop a superhuman intelligence, in which case we want to ensure that the AI doesn’t have too much influence on actual decision makers. The fear is that a superhuman AI could possibly unduly influence the decision maker. But with the medium in between, we have a filter, a normal intelligence, so it won’t be able to invert the relationship between adviser and decision maker.”

Stoneward blinked. “Pardon me, but I didn’t entirely follow what you — ”

“It’s just a Science Fiction scenario, but in case the AI tries to gain control, the fear is that a superhuman intelligence could basically turn you into a mindless muppet. By putting a medium in between, well, even if the medium becomes enslaved, the medium can only use their own intelligence against you. And that will fail.”

The director took a sip of water, and was pondering what he just heard for a few moments. Denise Mirza was burning with frustration. Sometimes she forgets how it is to deal with people this slow. And this guy had more balls banged against his skull than is healthy, which isn’t expected to speed his brain up. After what felt like half an eternity, he nodded.

“Are you ready for me to call the medium in?”

“Yes.”

She tapped her phone.

“Wait, does this mean that these mediums are slaves to your AI?”

She rolled her eyes. “Let us not discuss this in front of the medium, but I can assure you that our systems have not yet reached the level to convince a four year old to give up a lollipop, never mind a grown up person to do anything. We can discuss this more afterwards. Oh, there he is!”

Stoneward looked up surprised.

It was an old acquaintance, Nigel Ramsay. Ramsay used to manage some smaller teams in Lancashire, where Stoneward grew up. Ramsay was more known for his passion than for his talents.

“I am surprised to see you here”

The medium smiled. “It was a great offer, and when I learned what we are aiming for, I was positively thrilled. If this works we are going to make history!”

They sat down. “So, what does the system recommend?”

“Well, it recommends to increase the pressure on the government for a second referendum on Brexit.”

Stoneward stared at Ramsay, stunned. “Pardon me?”

“It is quite clear that the Prime Minister is intentionally sabotaging any reasonable solution for Brexit, but is too afraid to yet call a second referendum. She has been a double agent for the remainers the whole time. Once it is clear how much of a disaster leaving the European Union would be, we should call for a second referendum, reversing the result of the first.”

“I… I am not sure I follow… I thought we are talking football?”

“Oh, but yes! We most certainly are. Being part of an invigorated European Union after Brexit gets cancelled, we should strongly support a stronger Union, even the founding of a proper state.”

Stoneward looked at Ramsay with exasperation. Mirza motioned with her hands, asking for patience.

“Then, when the national football associations merge, this will pave the way for a single, unified European team.”

“The associations… merge?”

“Yes, an EU-wide all stars team. Just imagine that. Also, most of the serious competition would already be wiped out. No German team, no French team, just one European team and — “

“This is ridiculous! Reversing Brexit? Just to get a single European team? Even if we did, a unified European team might kill any interest in international football.”

“Yeah, that is likely true, but our winning chances would go through the roof!”

“But even then, 96% winning chances?”

“Oh, yeah, I asked the same. So, that’s not all. We also need to cause a war between Argentina and Brazil, in order to get them disqualified. There are a number of ways to get to this — ”

“Stop! Stop right there.” Stoneward looked shocked, his hands raised like a goalie waiting for the penalty kick. “Look, this is ridiculous. We will not stop Brexit or cause a war between two countries just to win a game.”

The medium looked at Stoneward in surprise. “To ‘just’ win a game?” His eyes wandered to Mirza in support. “I thought this was the sole reason for our existence. What does he mean, ‘just’ win a game? He is a bloody director of the FA, and he doesn’t care to win?”

“Maybe we should listen to some of the other suggestions?”, the CEO asked, trying to soothe the tension in the room.

Stoneward was visibly agitated, but after a few moments, he nodded. “Please continue.”

“So even if we don’t merge the European associations due to Brexit, we should at least merge the English, Scottish, Welsh, and Northern Irish associations in — ”

“No, no, NO! Enough of this association merging nonsense. What else do you have?”

“Well, without mergers, and wars, we’re down to 44% probability to win both the European and World Cup within the next twenty years.” The medium sounded defeated.

“That’s OK, I’ll take that. Tell me more.” Stoneward has known that the probabilities given before were too good to be true. It was still a disappointment.

“England has some of the best schools in the world. We should use this asset to lure young talent to England, offer them scholarships in Oxford, in Cambridge.”

“But they wouldn’t be English? They can’t play for England.”

“We would need to make the path to citizenship easier for them, immigration laws should be more integrative for top talent. We need to give them the opportunity to become subjects of the Queen before they play their first international. And then offer them to play for England. There is so much talent out there, and if we can get them while they’re young, we could prep up our squad in just a few years.”

“Scholarships for Oxford? How much would that even cost?”

“20, 25 thousand per year and student? We can pay a hundred scholarships and it wouldn’t even show up in our budget.”

“We are cutting budgets left and right!”

“Since we’re not stopping Brexit, why not dip into those 350 million pounds per week that we will save.”

“That was a lie!”

“I was joking.”

“Well, the scholarship thing wasn’t bad. What else is on the table?”

“One idea was to hack the video stream and bribe the referee, and then we can safely gaslight everyone.”

“Next idea.”

“We could poison the other teams.”

“Just stop it.”

“Or give them substances that would mess up their drug tests.”

“Why not getting FIFA to change the rules so we always win?”

“Oh, we considered it, but given the existing corruption inside FIFA it seems that would be difficult to outbid.”

Stonward sighed. “Now I was joking.”

“One suggestion is to create a permanent national team, and have them play in the national league. So they would be constantly competing, playing with each other, be better used to each other. A proper team.”

“How would we even pay for the players?”

“It would be an honor to play for the national team. Also, it could be a new rule to require the best players to play in the national team.”

“I think we are done here. These suggestions were… rather interesting. But I think they were mostly unactionable.” He started standing up.

Mirza looked desperately from one to the other. This meeting did not go as she had intended. “I think we can acknowledge the breadth of the creative proposals that have been on the table today, and enjoy a tea before you leave?”, she said, forcing a smile.

Stoneward nodded politely. “We sure can appreciate the creativity.”

“Now imagine this creativity turned into strategies in the pitch. Tactical moves. Variations to set pieces.”, the medium started, his voice slightly shifting.

“Yes, well, that would certainly be more interesting than most of the suggestions so far.”

“Wouldn’t it? And not only that, but if we could talk to the players. If we could expand their own creativity. Their own willpower. Their focus. Their energy to power through, not to give up.”

“If you’re suggesting to give them drugs, I am out.”

Ramsay laughed. “No, not drugs. But a helmet that emits electromagnetic waves and allows the brain muscles to work in more interesting ways.”

Stoneward looked over to the CEO. “Is that a possibility?”

Mirza looked uncomfortable, but tried to hide it. “Yes, yes, it is. We had tested it a few times, and the results were quite astonishing. It is just not what I would have expected as a proposal.”

“Why? Anything wrong with that?”

“Well, we use it for our top engineers, to help them focus when developing and designing solutions. The results are nothing short of marvelous. It is just, I didn’t think football would benefit that much from improved focus.”

Stoneward chuckled, as he sat down again. “Yes, many people underestimate the role of a creative mind in the game. I think I would now like a tea.” He looked to Ramsay. “Tell me more.”

The medium smiled. The system will be satisfied with the outcome.

(Originally published July 28, 2018 on Medium)

Saturn the alligator

Today at work I learned about Saturn the alligator. Born to humble origins in 1936 in Mississippi, he moved to Berlin where he became acquainted with Hitler. After the bombing of the Berlin Zoo he wandered through the streets. British troops found him, gave him to the Soviets, where against all odds he survived a number of near death situations - among others he refused to eat for a year - and still lives today, in an enclosure sponsored by Lacoste.

I also went to Wikidata to improve the entry on Saturn. For that I needed to find the right property to express the connection between Saturn, and the Moscow Zoo, where he is held.

The following SPARQL query was helpful: https://w.wiki/7ga

It tells you which properties connect animals with zoos how often - and in the Query Helper UI it should be easy to change either types to figure out good candidates for the property you are looking for.

Wikidata reached a billion edits

As of today, Wikidata has reached a billion edits - 1,000,000,000.

This makes it the first Wikimedia project that has reached that number, and possibly the first wiki ever to have reached so many edits. Given that Wikidata was launched less than seven years ago, this means an average edit rate of 4-5 edits per second.

The billionth edit is the creation of an item for a 2006 physics article written in Chinese.

Congratulations to the community! This is a tremendous success.

In the beginning

"Let there be a planet with a hothouse effect, so that they can see what happens, as a warning."

"That is rather subtle, God", said the Archangel.

"Well, let it be the planet closest to them. That should do it. They're intelligent after all."

"If you say so."

Lion King 2019

Wow. The new version of the Lion King is technically brilliant, and story-wise mostly unnecessary (but see below for an exception). It is a mostly beat-for-beat retelling of the 1994 animated version. The graphics are breathtaking, and they show how far computer-generated imagery has come. For a measly million dollar per minute of film you can get a photorealistic animal movies. Because of the photorealism, it also loses some of the charm and the emotions that the animated version carried - in the original the animals were much more anthropomorphic, and the dancing was much more exaggerated, which the new version gave up. This is most noticeable in the song scene for "I can't wait to be king", which used to be a psychedelic, color shifted sequence with elephants and tapirs and giraffes stacked upon each other, replaced by a much more realistic sequence full of animals and fast cuts that simply looks amazing (I never was a big fan of the psychedelic music scenes that were so frequent in many animated movies, so I consider this a clear win).

I want to focus on the main change, and it is about Scar. I know the 1994 movie by heart, and Scar is its iconic villain, one of the villains that formed my understanding of a great villain. So why would the largest change be about Scar, changing him profoundly for this movie? How risky a choice in a movie that partly recreates whole sequences shot by shot?

There was one major criticism about Scar, and that is that he played with stereotypical tropes of gay grumpy men, frustrated, denied, uninterested in what the world is offering him, unable to take what he wants, effeminate, full of cliches.

That Scar is gone, replaced by a much more physically threatening scar, one that whose philosophy in life is that the strongest should take what they want. Chiwetel Ejiofor's voice for Scar is scary, threatening, strong, dominant, menacing. I am sure that some people won't like him, as the original Scar was also a brilliant villain, but this leads immediately to my big criticism of the original movie: if Scar was only half as effing intelligent as shown, why did he do such a miserable job in leading the Pride Lands? If he was so much smarter than Mufasa, why did the thriving Pride Lands turn into a wasteland, threatening the subsistence of Scar and his allies?

The answer in the original movie is clear: it's the absolutist identification of country and ruler. Mufasa was good, therefore the Pride Lands were doing well. When Scar takes over, they become a wasteland. When Simba takes over, in the next few shots, they start blooming again. Good people, good intentions, good outcomes. As simple as that.

The new movie changes that profoundly - and in a very smart way. The storytellers at Disney really know what they're doing! Instead of following the simple equation given above, they make it an explicit philosophical choice in leadership. This time around, the whole Circle of Life thing, is not just an Act One lesson, but is the major difference between Mufasa and Scar. Mufasa describes a great king as searching for what they can give. Scar is about might is right, and about the strongest taking whatever they want. This is why he overhunts and allows overhunting. This is why the Pride Lands become a wasteland. Now the decline of the Pride Lands make sense, and also why the return of Simba and his different style as a king would make a difference. The Circle of Life now became important for the whole movie, at the same time tying with the reinterpretation of Scar, and also explaining the difference in outcome.

You can probably tell, but I am quite amazed at this feat in storytelling. They took a beloved story and managed to improve it.

Unfortunately, the new Scar also means that the song Be Prepared doesn't really work as it used to, and thus the song also got shortened and very much changed in a movie that became much longer otherwise. I am not surprised, they even wanted to remove it, and now I understand why (even though back then I grumbled about it). They also removed the Leni Riefenstahl imaginary from the new version which was there in the original one, which I find regrettable, but obviously necessary given the rest of the movie.

A few minor notes.

The voice acting was a mixed bag. Beyonce was surprisingly bland (speaking, her singing was beautiful), and so was John Oliver (singing, his speaking was perfect). I just listened again to I can't wait to be king, and John Oliver just sounds so much less emotional than Rowan Atkinson. Pity.

Another beautiful scene was the scene were Rafiki receives the massage that Simba is still alive. In the original, this was a short transition of Simba ruffling up some flowers, and the wind takes them to Rafiki, he smells them, and realizes it is Simba. Now the scene is much more elaborate, funnier, and is reminiscent of Walt Disney's animal movies, which is a beautiful nod to the company founder. Simba's hair travels with the wind, birds, a Giraffe, an ant, and more, until it finally reaches the Shaman's home.

One of my best laughs was also due to another smart change: in Hakuna Matata, when they retell Pumbaa's story (with an incredibly cute little baby Pumbaa), Pumbaa laments that all his friends leaving him got him "unhearted, every time that he farted", and immediately complaining to Timon as to why he didn't stop him singing it - a play on the original's joke, where Timon interjects Pumbaa before he finishes the line with "Pumbaa! Not in front of the kids.", looking right at the camera and breaking the fourth wall.

Another great change was to give the Hyenas a bit more character - the interactions between the Hyena who wasn't much into personal space and the other who rather was, were really amusing. Unlike with the original version the differences in the looks of the Hyenas are harder to make out, and so giving them more personality is a great choice.

All in all, I really loved this version. Seeing it on the big screen pays off for the amazing imagery that really shines on a large canvas. I also love the original, and the original will always have a special place in my heart, but this is a wonderful tribute to a brilliant movie with an exceptional story.

210,000 year old human skull found in Europe

A Homo Sapiens skull that is 210,000 years old had been found in Greece, together with a Neanderthal skull from 175,000 years ago.

The oldest European Homo Sapiens remains known so far only date to 40,000 years ago.


Draft: Collaborating on the sum of all knowledge across languages

For the upcoming Wikipedia@20 book, I published my chapter draft. Comments are welcome on the pubpub Website until July 19.

Every language edition of Wikipedia is written independently of every other language edition. A contributor may consult an existing article in another language edition when writing a new article, or they might even use the Content Translation tool to help with translating one article to another language, but there is nothing that ensures that articles in different language editions are aligned or kept consistent with each other. This is often regarded as a contribution to knowledge diversity, since it allows every language edition to grow independently of all other language editions. So would creating a system that aligns the contents more closely with each other sacrifice that diversity?

Differences between Wikipedia language editions

Wikipedia is often described as a wonder of the modern age. There are more than 50 million articles in almost 300 languages. The goal of allowing everyone to share in the sum of all knowledge is achieved, right?

Not yet.

The knowledge in Wikipedia is unevenly distributed. Let’s take a look at where the first twenty years of editing Wikipedia have taken us.

The number of articles varies between the different language editions of Wikipedia: English, the largest edition, has more than 5.8 million articles, Cebuano — a language spoken in the Philippines — has 5.3 million articles, Swedish has 3.7 million articles, and German has 2.3 million articles. (Cebuano and Swedish have a large number of machine generated articles.) In fact, the top nine languages alone hold more than half of all articles across the Wikipedia language editions — and if you take the bottom half of all Wikipedias ranked by size, they together wouldn’t have 10% of the number of articles in the English Wikipedia.

It is not just the sheer number of articles that differ between editions, but their comprehensiveness does as well: the English Wikipedia article on Frankfurt has a length of 184,686 characters, a table of contents spanning 87 sections and subsections, 95 images, tables and graphs, and 92 references — whereas the Hausa Wikipedia article states that it is a city in the German state of Hesse, and lists its population and mayor. Hausa is a language spoken natively by 40 million people and as a second language by another 20 million.

It is not always the case that the large Wikipedia language editions have more content on a topic. Although readers often consider large Wikipedias to be more comprehensive, local Wikipedias may frequently have more content on topics of local interest: the English Wikipedia knows about the Port of Călărași that it is one of the largest Romanian river ports, located at the Danube near the town of Călărași — and that’s it. The Romanian Wikipedia on the other hand offers several paragraphs of content about the port.

The topics covered by the different Wikipedias also overlap less than one would initially assume. English Wikipedias has 5.8 million articles, German has 2.2 million articles — but only 1.1 million topics are covered by both Wikipedias. A full 1.1 million topics have an article in German — but not in English. The top ten Wikipedias by activity — each of them with more than a million articles — have articles on only hundred thousand topics in common. 18 million topics are covered by articles in the different language Wikipedias — and English only covers 31% of these.

Besides coverage, there is also the question of how up to date the different language editions are: in June 2018, San Francisco elected London Breed as its new mayor. Nine months later, in March 2019, I conducted an analysis of who the mayor of San Francisco was, according to the different language versions of Wikipedia. Of the 292 language editions, a full 165 had a Wikipedia article on San Francisco. Of these, 86 named the mayor. The good news is that not a single Wikipedia lists a wrong mayor — but the vast majority are out of date. English switched the minute London Breed was sworn in. But 62 Wikipedia language editions list an out-of-date mayor — and not just the previous mayor Ed Lee, who became mayor in 2011, but also often Gavin Newsom (2004-2011), and his predecessor, Willie Brown (1996-2004). The most out-of-date entry is to be found in the Cebuano Wikipedia, who names Dianne Feinstein as the mayor of San Francisco. She had that role after the assassination of Harvey Milk and George Moscone in 1978, and remained in that position for a decade in 1988 — Cebuano was more than thirty years out of date. Only 24 language editions had listed the current mayor, London Breed, out of the 86 who listed the name at all.

An even more important metric for the success of a Wikipedia are the number of contributors: English has more than 31,000 active contributors — three out of seven active Wikimedians are active on the English Wikipedia. German, the second most active Wikipedia community, already only has 5,500 active contributors. Only eleven language editions have more than a thousand active contributors — and more than half of all Wikipedias have fewer than ten active contributors. To assume that fewer than ten active contributors can write and maintain a comprehensive encyclopedia in their spare time is optimistic at best. These numbers basically doom the mission of the Wikimedia movement to realize a world where everyone can contribute to the sum of all knowledge.

Enter Wikidata

Wikidata was launched in 2012 and offers a free, collaborative, multilingual, secondary database, collecting structured data to provide support for Wikipedia, Wikimedia Commons, the other wikis of the Wikimedia movement, and to anyone in the world. Wikidata contains structured information in the form of simple claims, such as “San Francisco — Mayor — London Breed”, qualifiers, such as “since — July 11, 2018”, and references for these claims, e.g. a link to the official election results as published by the city.

One of these structured claims would be on the Wikidata page about San Francisco and state the mayor, as discussed earlier. The individual Wikipedias can then query Wikidata for the current mayor. Of the 24 Wikipedias that named the current mayor, eight were current because they were querying Wikidata. I hope to see that number go up. Using Wikidata more extensively can, in the long run, allow for more comprehensive, current, and accessible content while decreasing the maintenance load for contributors.

Wikidata was developed in the spirit of the Wikipedia’s increasing drive to add structure to Wikipedia’s articles. Examples of this include the introduction of infoboxes as early as 2002, a quick tabular overview of facts about the topic of the article, and categories in 2004. Over the year, the structured features became increasingly intricate: infoboxes moved to templates, templates started using more sophisticated MediaWiki functions, and then later demanded the development of even more powerful MediaWiki features. In order to maintain the structured data, bots were created, software agents that could read content from Wikipedia or other sources and then perform automatic updates to other parts of Wikipedia. Before the introduction of Wikidata, bots keeping the language links between the different Wikipedias in sync, easily contributed 50% and more of all edits.

Wikidata allowed for an outlet to many of these activities, and relieved the Wikipedias of having to run bots to keep language links in sync or of massive infobox maintenance tasks. But one lesson I learned from these activities is that I can trust the communities with mastering complex workflows spread out between community members with different capabilities: in fact, a small number of contributors working on intricate template code and developing bots can provide invaluable support to contributors who more focus on maintaining articles and contributors who write large swaths of prose. The community is very heterogeneous, and the different capabilities and backgrounds complement each other in order to create Wikipedia.

However, Wikidata’s structured claims are of a limited expressivity: their subject always must be the topic of the page, every object of a statement must exist as its own item and thus page in Wikidata. If it doesn’t fit in the rigid data model of Wikidata, it simply cannot be captured in Wikidata — and if it cannot be captured in Wikidata, it cannot be made accessible to the Wikipedias.

For example, let’s take a look at the following two sentences from the English Wikipedia article on Ontario, California:

“To impress visitors and potential settlers with the abundance of water in Ontario, a fountain was placed at the Southern Pacific railway station. It was turned on when passenger trains were approaching and frugally turned off again after their departure.”

There is no feasible way to express the content of these two sentences in Wikidata - the simple claim and qualifier structure that Wikidata supports can not capture the subtle situation that is described here.

An Abstract Wikipedia

I suggest that the Wikimedia movement develop an Abstract Wikipedia, a Wikipedia in which the actual textual content is being represented in a language-independent manner. This is an ambitious goal — it requires us to push the current limits of knowledge representation, natural language generation, and collaborative knowledge construction by a significant amount: an Abstract Wikipedia must allow for:

  1. relations that connect more than just two participants with heterogeneous roles.
  2. composition of items on the fly from values and other items.
  3. expressing knowledge about arbitrary subjects, not just the topic of the page.
  4. ordering content, to be able to represent a narrative structure.
  5. expressing redundant information.

Let us explore one of these requirements, the last one: unlike the sentences of a declarative formal knowledge base, human language is usually highly redundant. Formal knowledge bases usually try to avoid redundancy, for good reasons. But in a natural language text, redundancy happens frequently. One example is the following sentence:

“Marie Curie is the only person who received two Nobel Prizes in two different sciences.”

The sentence is redundant given a list of Nobel Prize award winners and their respective disciplines they have been awarded to — a list that basically every large Wikipedia will contain. But the content of the given sentence nevertheless appears in many of the different language articles on Marie Curie, and usually right in the first paragraph. So there is obviously something very interesting in this sentence, even though the knowledge expressed in this sentence is already fully contained in most of the Wikipedias it appears in. This form of redundancy is common place in natural language — but is usually avoided in formal knowledge bases.

The technical details of the Abstract Wikipedia proposal are presented in (Vrandečić, 2018). But the technical architecture is only half of the story. Much more important is the question whether the communities can meet the challenges of this project?

Wikipedia and Wikidata have shown that the communities are capable to meet difficult challenges: be it templates in Wikipedia, or constraints in Wikidata, the communities have shown that they can drive comprehensive policy and workflow changes as well as the necessary technological feature development. Not everyone needs to understand the whole stack in order to make a feature such as templates a crucial part of Wikipedia.

The Abstract Wikipedia is an ambitious future project. I believe that this is the only way for the Wikimedia movement to achieve its goal, short of developing an AI that will make the writing of a comprehensive encyclopedia obsolete anyway.

A plea for knowledge diversity?

When presenting the idea of the Abstract Wikipedia, the first question is usually: will this not massively reduce the knowledge diversity of Wikipedia? By unifying the content between the different language editions, does this not force a single point of view on all languages? Is the Abstract Wikipedia taking away the ability of minority language speakers to maintain their own encyclopedias, to have a space where, for example, indigenous speakers can foster and grow their own point of view, without being forced to unify under the western US-dominated perspective?

I am sympathetic with the intent of this question. The goal of this question is to ensure that a rich diversity in knowledge is retained, and to make sure that minority groups have spaces in which they can express themselves and keep their knowledge alive. These are, in my opinion, valuable goals.

The assumption that an Abstract Wikipedia, from which any of the individual language Wikipedias can draw content from, will necessarily reduce this diversity, is false. In fact, I believe that access to more knowledge and to more perspectives is crucial to achieve an effective knowledge diversity, and that the currently perceived knowledge diversity in different language projects is ineffective at best, and harmful at worst. In the rest of this essay I will argue why this is the case.

Language does not align with culture

First, it is wrong to use language as the dimension along which to draw the demarcation line between different content if the Wikimedia movement truly believes that different groups should be able to grow and maintain their own encyclopedias.

In case the Wikimedia movement truly believes that different groups or cultures should have their own Wikipedias, why is there only a single Wikipedia language edition for the English speakers from India, England, Scotland, Australia, the United States, and South Africa? Why is there only one Wikipedia for Brazil and Portugal, leading to much strife? Why are there no two Wikipedias for US Democrats and Republicans?

The conclusion is that the Wikimedia movement does not believe that language is the right dimension to split knowledge — it is a historical decision, driven by convenience. The core Wikipedia policies, vision, and mission are all geared towards enabling access to the sum of all knowledge to every single reader, no matter what their language, and not toward capturing all knowledge and then subdividing it for consumption based on the languages the reader is comfortable in.

The split along languages leads to the problem that it is much easier for a small language community to go “off the rails” — to either, as a whole, become heavily biased, or to adopt rules and processes which are problematic. The fact that the larger communities have different rules, processes, and outcomes can be beneficial for Wikipedia as a whole, since they can experiment with different rules and approaches. But this does not seem to hold true when the communities drop under a certain size and activity level, when there are not enough eyeballs to avoid the development of bad outcomes and traditions. For one example, the article about skirts in the Bavarian Wikipedia features three upskirt pictures, one porn actress, an anime screenshot, and a video showing a drawing of a woman with a skirt getting continuously shorter. The article became like this within a day or two of its creation, and, even though it has been edited by a dozen different accounts, has remained like this over the last seven years. (This describes the state of the article in April 2019 — I hope that with the publication of this essay, the article will finally be cleaned up).

A look on some south Slavic language Wikipedias

Second, a natural experiment is going on, where contributors that are more separated by politics than language differences have separate Wikipedias: there exist individual Wikipedia language editions for Croatian, Serbian, Bosnian, and Serbocroatian. Linguistically, the differences between the dialects of Croatian are often larger than the differences between standard Croatian and standard Serbian. Particularly the existence of the Serbocroatian Wikipedia poses interesting questions about these delineations.

Particularly the Croatian Wikipedia has turned to a point of view that has been described as problematic. Certain events and Croat actors during the 1990s independence wars or the 1940s fascist puppet state might be represented more favorably than in most other Wikipedias.

Here are two observations based on my work on south Slavic language Wikipedias:

First, claiming that a more fascist-friendly point of view within a Wikipedia increases the knowledge diversity across all Wikipedias might be technically true, but is practically insufficient. Being able to benefit from this diversity requires the reader to not only be comfortable reading several different languages, but also to engage deeply enough and spend the time and interest to actually read the article in different languages, which is mostly a profoundly boring exercise, since a lot of the content will be overlapping. Finding the juicy differences is anything but easy, especially considering that most readers are reading Wikipedia from mobile devices, and are just looking to satisfy a quick information need from a source whose curation they trust.

Most readers will only read a single language version of an article, and thus any diversity that exists across different language editions is practically lost. The sheer existence of this diversity might even be counterproductive, as one may argue that the communities should not spend resources on reflecting the true diversity of a topic within each individual language. This would cement the practical uselessness of the knowledge diversity across languages.

Second, many of the same contributors that write the articles with a certain point of view in the Croatian Wikipedia, also contribute on the English Wikipedia on the articles about the same topics — but there they suddenly are forced and able to compromise and incorporate a much wider variety of points of view. One might hope the contributors would take the more diverse points of view and migrate them back to their home Wikipedias — but that is often not the case. If contributors harbor a certain point of view (and who doesn’t?) it often leads to a situation where they push that point of view as much as they can get away with in each of the projects.

It has to be noted that the most blatant digressions from a neutral point of view in Wikipedias like the Croatian Wikipedia will not be found in the most central articles, but in the large periphery of articles surrounding these central articles which are much harder to keep an eye on.

Abstract Wikipedia and Knowledge diversity

The Abstract Wikipedia proposal does not require any of the individual language editions to use it. Each language community can decide for each article whether to fall back on the Abstract Wikipedia or whether to create their own article in their language. And even that decision can be more fine grained: a contributor can decide for an individual article to incorporate sections or paragraphs from the Abstract Wikipedia.

This allows the individual Wikipedia communities the luxury to entirely concentrate on the differences that are relevant to them. I distinctly remember that when I started the Croatian Wikipedia: it felt like I had the burden to first write an article about every country in the world before I could write the articles I cared about, such as my mother’s home village — because how could anyone defend a general purpose encyclopedia that might not even have an article on Nigeria, a country with a population of a hundred million, but one on Donji Humac, a village with a population of 157? Wouldn’t you first need an article on all of the chemical elements that make up the world before you can write about a local food?

The Abstract Wikipedia frees a language edition from this burden, and allows each community to entirely focus on the parts they care about most — and to simply import the articles from the common source for the topics that are less in their focus. It allows the community to make these decisions. As the communities grow and shift, they can revisit these decisions at any time and adapt them.

At the same time, the Abstract Wikipedia makes these differences more visible since they become explicit. Right now there is no easy way to say whether the fact that Dianne Feinstein is listed as the Mayor of San Francisco in the Cebuano Wikipedia is due to cultural particularities of the Cebuano language communities or not. Are the different population numbers of Frankfurt in the different language editions intentional expressions of knowledge diversity? With an Abstract Wikipedia, the individual communities could explicitly choose which articles to create and maintain on their own, and at the same time remove a lot of unintentional differences.

By making these decisions more explicit, it becomes possible to imagine an effective workflow that observes these intentional differences, and sets up a path to integrate them into the common article in the Abstract Wikipedia. Right now, there are 166 different language versions of the article on the chemical element Helium — it is basically impossible for a single person to go through all of them and find the content that is intentionally different between them. With an Abstract Wikipedia, which contains the common shared knowledge, contributors, researchers, and readers can actually take a look at those articles that intentionally have content that replaces or adds to the commonly shared one, assess these differences, and see if contributors should integrate the differences in the shared article.

The differences in content may be reflecting difference in policies, particularly in policies of notability and reliability. Whereas on first glance it might seem that the Abstract Wikipedia might require unified notability and reliability requirements across all Wikipedias, this is not the case: due to the fact that local Wikipedias can overlay and suppress content from the Abstract Wikipedias, they can adjust their Wikipedias based on their own rules. And the increased visibility of such decisions will lead to easier identify biases, and hopefully also to updated rules to reduce said bias.

A new incentive infrastructure

The Abstract Wikipedia will evolve the incentive infrastructure of Wikipedia.

Presently, many underrepresented languages are spoken in areas that are multilingual. Often another language spoken in this area is regarded as a high-prestige language, and is thus the language of education and literature, whereas the underrepresented language is a low-prestige language. So even though the low-prestige language might have more speakers, the most likely recruits for the Wikipedia communities, people with education who can afford internet access and have enough free time, will be able to contribute in both languages.

In which language should I contribute? If I write the article about my mother’s home town in Croatian, I make it accessible to a few million people. If I write the article about my mother’s home town in English, it becomes accessible to more than a hundred times as many people! The work might be the same, but the perceived benefit is orders of magnitude higher: the question becomes, do I teach the world about a local tradition, or do I tell my own people about their tradition? The world is bigger, and thus more likely to react, creating a positive feedback loop.

This cannibalizes the communities for local languages by diverting them to the English Wikipedia, which is perceived as the global knowledge community (or to other high-prestige languages, such as Russian or French). This is also reflected in a lot of articles in the press and in academic works about Wikipedia, where the English Wikipedia is being understood as the Wikipedia. Whereas it is known that Wikipedia exists in many other languages, journalists and researchers are, often unintentionally, regarding the English Wikipedia as the One True Wikipedia.

Another strong impediment to recruiting contributors to smaller Wikipedia communities is rarely explicitly called out: it is pretty clear that, given the current architecture, these Wikipedias are doomed in achieving their mission. As discussed above, more than half of all Wikipedia language editions have fewer than ten active contributors — and writing a comprehensive, up-to-date Wikipedia is not an achievable goal with so few people writing in their free time. The translation tools offered by the Wikimedia Foundation can considerably help within certain circumstances — but for most of the Wikipedia languages, automatic translation models don’t exist and thus cannot help the languages which would need it the most.

With the Abstract Wikipedia though, the goal of providing a comprehensive and current encyclopedia in almost any language becomes much more tangible: instead of taking on the task of creating and maintaining the entire content, only the grammatical and lexical knowledge of a given language needs to be created. This is a far smaller task. Furthermore, this grammatical and lexical knowledge is comparably static — it does not change as much as the encyclopedic content of Wikipedia, thus turning a task that is huge and ongoing into one where the content will grow and be maintained without the need of too much maintenance by the individual language communities.

Yes, the Abstract Wikipedia will require more and different capabilities from a community that has yet to be found, and the challenges will be both novel and big. But the communities of the many Wikimedia projects have repeatedly shown that they can meet complex challenges with ingenious combinations of processes and technological advancements. Wikipedia and Wikidata have both demonstrated the ability to draw on technologically rather simple canvasses, and create extraordinary rich and complex masterpieces, which stand the test of time. The Abstract Wikipedia aims to challenge the communities once again, and the promise this time is nothing else but to finally be able to reap the ultimate goal: to allow every one, no matter what their native language is, to share in the sum of all knowledge.

Acknowledgements

Thanks to the valuable suggestions on improving the article to Jamie Taylor, Daniel Russell, Joseph Reagle, Stephen LaPorte, and Jake Orlowitz.

Bibliography

  • Bao, Patti, Brent J. Hecht, Samuel Carton, Mahmood Quaderi, Michael S. Horn and Darren Gergle. “Omnipedia: bridging the wikipedia language gap.” in Proceedings of the Conference on Human Factors in Computing Systems (CHI 2012), edited by Joseph A. Konstan, Ed H. Chi, and Kristina Höök. Austin: Association for Computing Machinery, 2012: 1075-1084.
  • Eco, Umberto. The Search for the Perfect Language (the Making of Europe). La ricerca della lingua perfetta nella cultura europea. Translated by James Fentress. Oxford: Blackwell, 1995 (1993).
  • Graham, Mark. “The Problem With Wikidata.” The Atlantic, April 6, 2012. https://www.theatlantic.com/technology/archive/2012/04/the-problem-with-wikidata/255564/
  • Hoffmann, Thomas and Graeme Trousdale, “Construction Grammar: Introduction”. In The Oxford Handbook of Construction Grammar, edited by Thomas Hoffmann and Graeme Trousdale, 1-14. Oxford: Oxford University Press, 2013.
  • Kaffee, Lucie-Aimée, Hady ElSahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon S. Hare and Elena Simperl. “Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for Article Placeholders.” in Proceedings of the 15th European Semantic Web Conference (ESWC 2018), edited by Aldo Gangemi, Roberto Navigli, Marie-Esther Vidal, Pascal Hitzler, Raphaël Troncy, Laura Hollink, Anna Tordai, and Mehwish Alam. Heraklion: Springer, 2018: 319-334.
  • Kaffee, Lucie-Aimée, Hady ElSahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon S. Hare and Elena Simperl. “Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata.” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2, edited by Marilyn Walker, Heng Ji, and Amanda Stent. New Orleans: ACL Anthology, 2018: 640-645.
  • Schindler, Mathias and Denny Vrandečić. “Introducing new features to Wikipedia: Case studies for Web Science.” IEEE Intelligent Systems 26, no. 1 (January-February 2011): 56-61.
  • Vrandečić, Denny. “Restricting the World.” Wikimedia Deutschland Blog. February 22, 2013. https://blog.wikimedia.de/2013/02/22/restricting-the-world/
  • Vrandečić, Denny and Markus Krötzsch. “Wikidata: A Free Collaborative Knowledgebase.” Communications of the ACM 57, no. 10 (October 2014): 78-85. DOI 10.1145/2629489.
  • Kaljurand, Kaarel and Tobias Kuhn. “A Multilingual Semantic Wiki Based on Attempto Controlled English and Grammatical Framework.” in Proceedings of the 10th European Semantic Web Conference (ESWC 2013), edited by Philipp Cimiano, Oscar Corcho, Valentina Presutti, Laura Hollink, and Sebastian Rudolph. Montpellier: Springer, 2013: 427-441.
  • Milekić, Sven. “Croatian-language Wikipedia: when the extreme right rewrites history.” Osservatorio Balcani e Caucaso, September 27, 2018. https://www.balcanicaucaso.org/eng/Areas/Croatia/Croatian-language-Wikipedia-when-the-extreme-right-rewrites-history-190081
  • Ranta, Aarne. Grammatical Framework: Programming with Multilingual Grammars. Stanford: CSLI Publications, 2011.
  • Vrandečić, Denny. “Towards a multilingual Wikipedia,” in Proceedings of the 31st International Workshop on Description Logics (DL 2018), edited by Magdalena Ortiz and Thomas Schneider. Phoenix: Ceur-WS, 2018.
  • Wierzbicka, Anna. Semantics: Primes and Universals. Oxford: Oxford University Press, 1996.
  • Wikidata Community: “Lexicographical data.” Accessed June 1, 2019. https://www.wikidata.org/wiki/Wikidata:Lexicographical_data
  • Wulczyn, Ellery, Robert West, Leila Zia and Jure Leskovec. “Growing Wikipedia Across Languages via Recommendation.” in Proceedings of the 25th International World-Wide Web Conference (WWW 2016), edited by Jaqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao. Montréal: IW3C2, 2016: 975-985.

Toy Story 4

Toy Story 4 was great fun!

Toy Story 3 had a great closure (and a lot of tears), so would, what could they do to justify a fourth part? They developed the characters further than ever before. Woody is faced with a lot of decisions, and he has to grow in order to say an even bigger good-bye than last time.

Interesting fact: PETA protested the movie because Bo Peep uses a shepherd's crook, and those are considered a "symbol of domination over animals."

Bo Peep was a pretty cool character in the movie. And she used her crook well.

The cast was amazing: besides the many who kept their roles (Tom Hanks, Tim Allen, Annie Potts, Joan Cusack, Timothy Dalton, even keeping Don Rickles from archive footage after his death, and everyone else) many new voices (Betty White, Mel Brooks, Christina Hendricks, Keanu Reeves, Bill Hader, Tony Hale, Key and Peele, and Flea from the Red Hot Chili Peppers).

The end of civilization?

This might be controversial with some of my friends, but no, there is no high likelihood of human civilization ending within the next 30 years.

Yes, climate change is happening, and we're obviously not reacting fast and effective enough. But that won't kill humanity, and it will not end civilization.

Some highly populated areas might become uninhabitable. No question about this. Whole countries in southern Asia, central and South America, in Africa, might become too hot and too humid or too dry for human living. This would lead to hundreds of millions, maybe billions of people, who will want to move, to save their lives and the lives of their loved ones. Many, many people would die in these migrations.

The migration pressures on the countries that are climatically better off may become enormous, and it will either lead to massive bloodshed or to enormous demographic changes, or, most likely, both.

But look at the map. There are large areas in northern Asia and North America that would dramatically improve their habitability for humans if they would warm a bit. Large areas could become viable for growing wheat, fruits, corn.

As it is already today, and as it was for most of human history, we produce enough food and clean water and shelter and energy for everyone. The problem is not production, it is and will always be distribution. Facing huge upheaval and massive migration the distribution channels will likely break down and become even more ineffective. The disruption of the distribution network will likely also endanger seemingly stable states, and places that thought to pass the events unscathed will be hurt by that breakdown. The fact that there would be enough food will make the humanitarian catastrophes even more maddening.

Money will make it possible to shelter away from the most severe effects, no matter where you start now. It's the poor that will bear the brunt of the negative effects. I don't think that's surprising to anyone.

But even if almost none of today's countries might survive as they are, and if a few billion people die, the chances of humanity to end, of civilization to end, are negligible. Billions will survive into the 21st century, and will carry on history.

So, yes, the changes might be massive and in some areas catastrophic. But humanity and civilization will preserve.

Why this post? I don't think it is responsible to exaggerate the bad predictions too much. It makes the predictions less believable. Also, to have a sober look at the possible changes may make it easier to understand why some countries react as they do. Does this mean we don't need to react and try to reduce climate change? If that's your conclusion, you haven't read carefully along. I said something about possibly billions becoming displaced.

IFLScience: New Report Warns "High Likelihood Of Human Civilization Coming To An End" Within 30 Years

Web Conference 2019

25 May 2019

Last week saw the latest incarnation of the Web Conference (previously known as WWW or dubdubdub), going from May 15 to 17 (with satellite events the two days before). When I was still in academia, WWW was one of the most prestigious conference series for my research area, so when it came to be held literally across the street from my office, I couldn’t resist going to it.

The conference featured two keynotes (the third, by Lawrence Lessig, was cancelled on short notice due to a family emergency):

Watch the talks on YouTube on the links given above. Thanks to Marco Neumann for pointing to the links!

The conference was attended by more than 1,400 people (closer to 1,600?), making it the second largest since its inception (trailing only Lyon from last year), and about double the size than it used to be only four or five years ago. The conference dinner in the Exploratorium was relaxed and enjoyable. Acceptance rate was at 18%, which made for 225 accepted full papers.

The proceedings are available for free (yay!), so browse them for papers you find interesting. Personally, I really enjoyed the papers that looked into the use of WhatsApp to spread misinformation before the Brazil election, Dataset Search, and pre-empting SPARQL queries from blocking the endpoint. The proceedings span 5,047 pages, and are available online.

I had the feeling that Machine Learning was taking much more space in the program than it used to when I used to attend the conference regularly - which is fine, but many of the ML papers were only tenuously connected to the Web (which was the same criticism that we raised against many of the Semantic Web / Description Logic papers back then).

Thanks to the general chairs for organizing the conference, Leila Zia and Ricardo Baeza-Yates, and thanks to the sponsors, particularly Microsoft, Bloomberg, Amazon, and Google.

The two workshops I attended before the Web Conference were the Knowledge Graph Technology and Applications 2019 workshop on Monday, and the Wiki workshop 2019 on Tuesday. They have their own trip reports.

If you have trip reports, let me know and I will link to them.

Wiki workshop 2019

24 May 2019

Last week, May 14, saw the fifth incarnation of the Wiki workshop, co-located with the Web Conference (formerly known as dubdubdub), in San Francisco. The room was tight and very full - I am bad at estimating, but I guess 80-110 people were there.

I was honored to be invited to give the opening talk, and since I had a bit more time than in the last few talks, I really indulged in sketching out the proposal for the Abstract Wikipedia, providing plenty of figures and use cases. The response was phenomenal, and there were plenty of questions not only after the talk but also throughout the day and in the next few days. In fact, the Open Discussion slot was very much dominated by more questions about the proposal. I found that extremely encouraging. Some of the comments were immediately incorporated into a paper I am writing right now and that will be available for public reviews soon.

The other presentations - both the invited and the accepted ones - were super interesting.

Thanks to Dario Taraborelli, Bob West, and Miriam Redi for organizing the workshop.

A little extra was that I smuggled my brother and his wife into the workshop for my talk (they are visiting, and they have never been to one of my talks before). It was certainly interesting to hear their reactions afterwards - if you have non-academic relatives, you might underestimate how much they may enjoy such an event as mere spectators. I certainly did.

See also the #wikiworkshop2019 tag on Twitter.

Knowledge Graph Technology and Applications 2019

23 May 2019

Last week, on May 13, the Knowledge Graph Technology and Applications workshop happened, co-located with the Web Conference 2019 (formerly known as WWW), in San Francisco. I was invited to give the opening talk, and talked about the limits of Knowledge Graph technologies when trying to express knowledge. The talk resonated well.

Just like in last week's KGC, the breadth of KG users is impressive: NASA uses KGs to support air traffic management, Uber talks about the potential for their massive virtual KG over 200,000 schemas, LinkedIn, Alibaba, IBM, Genentech, etc. I found particularly interesting that Microsoft has not one, but at least four large Knowledge Graphs: the generic Knowledge Graph Satori; an Academic Graph for science, papers, citations; the Enterprise Graph (mostly LinkedIn), with companies, positions, schools, employees and executives; and the Work graph about documents, conference rooms, meetings, etc. All in all, they boasted more than a trillion triples (why is it not a single graph? No idea).

Unlike last week, the focus was less on sharing experiences when working with Knowledge Graphs, but more on academic work, such as query answering, mixing embeddings with KGs, scaling, mapping ontologies, etc. Given that it is co-located with the Web Conference, this seems unsurprising.

One interesting point that was raised was the question of common sense: can we, and how can we use a knowledge graph to represent common sense? How can we say that a box of chocolate may fit in the trunk of a car, but a piano would not? Are KGs the right representation for that? The question remained unanswered, but lingered through the panel and some QnA sessions.

The workshop was very well visited - it got the second largest room of the day, and the room didn’t feel empty, but I have a hard time estimating how many people where there (about 100-150?). The audience was engaged.

The connection with the Web was often rather tenuous, unless one thinks of KGs as inherently associated with the Web (maybe because they often could use Semantic Web standards? But also often they don’t). On the other side it is a good outlet within the Web Conference for the Semantic Web crowd and to make them mingle more with the KG crowd, I did see a few people brought together into a room that often have been separated, and I was able to point a few academic researchers to enterprise employees that would benefit from each other.

Thanks to Ying Ding from the Indiana University and the other organizers for organizing the workshop, and for all the discussion and insights it generated!

Update: corrected that Uber talked about the potential of their knowledge graph, not about their realized knowledge graph. Thanks to Joshua Shivanier for the correction! Also added a paragraph on common sense.

Knowledge Graph Conference 2019, Day 1

On Tuesday, May 7, began the first Knowledge Graph Conference. Organized by François Scharffe and his colleagues at Columbia University, it was located in New York City. The conference goes for two days, and aims at a much more industry-oriented crowd than conferences such as ISWC. And it reflected very prominently in the speaker line-up: especially finance was very well represented (no surprise, with Wall Street being just downtown).

Speakers and participants from Goldman Sachs, Capital One, Wells Fargo, Mastercard, Bank of America, and others were in the room, but also from companies in other industries, such as Astra Zeneca, Amazon, Uber, or AirBnB. The speakers and participants were rather open about their work, often listing numbers of triples and entities (which really is a weird metric to cite, but since it is readily available it is often expected to be stated), and these were usually in the billions. More interesting than the sheer size of their respective KGs were their use cases, and particularly in finance it was often ensuring compliance to insider trading rules and similar regulations.

I presented Wikidata and the idea of an Abstract Wikipedia as going beyond what a Knowledge Graph can easily express. I had the feeling the presentation was well received - it was obvious that many people in the audience were already fully aware of Wikidata and are actively using it or planning to use it. For others, particularly the SPARQL endpoint with its powerful visualization capabilities and the federated queries, and the external identifiers in Wikidata, and the approach to references for the claims in Wikidata were perceived as highlights. The proposal of an Abstract Wikipedia was very warmly received, and it was the first time no one called it out as a crazy idea. I guess the audience was very friendly, despite New York's reputation.

A second set of speakers were offering technologies and services - and I guess I belong to this second set by speaking about Wikidata - and among them were people like Juan Sequeda of Capsenta, who gave an extremely engaging and well-substantiated talk on how to bridge the chasm towards more KG adoption; Pierre Haren of Causality Link, who offered an interesting personal history through KR land from LISP to Causal Graphs; Dieter Fensel of OnLim, who had a a number of really good points on the relation between intelligent assistants and their dialogue systems and KGs; Neo4J, Eccenca, Diffbot.

A highlight for me was the astute and frequent observation by a number of the speakers from the first set that the most challenging problems with Knowledge Graphs were rarely technical. I guess graph serving systems and cloud infrastructure have improved so much that we don't have to worry about these parts anymore unless you are doing crazy big graphs. The most frequently mentioned problems were social and organizational. Since Knowledge Graphs often pulled data sources from many different parts of an organization together, with a common semantics, they trigger feelings of territoriality. Who gets to define the common ontology? What if the data a team provides has problems or is used carelessly, who's at fault? What if others benefit from our data more than we did even though we put all the effort in to clean it up? How do we get recognized for our work? Organizational questions were often about a lack of understanding, especially among engineers, for fundamental Knowledge Graph principles, and a lack of enthusiasm in the management chain - especially when the costs are being estimated and the social problems mentioned before become apparent. One particularly visible moment was when Bethany Sehon from Capital One was asked about the major challenges to standardizing vocabularies - and her first answer was basically "egos".

All speakers talked about the huge benefits they reaped from using Knowledge Graphs (such as detecting likely cliques of potential insider trading that later indeed got convicted) - but then again, this is to be expected since conference participation is self-selecting, and we wouldn't hear of failures in such a setting.

I had a great day at the inaugural Knowledge Graph Conference, and am sad that I have to miss the second day. Thanks to François Scharffe for organizing the conference, and thanks to the sponsors, OntoText, Collibra, and TigerGraph.

For more, see:

Golden

I'd say that Golden might be the most interesting competitor to Wikipedia I've seen in a while (which really doesn't mean that much, it's just the others have been really terrible).

This one also has a few red flags:

  • closed source, as far as I can tell
  • aiming for ten billion topics in their first announcement, but lacking an article on Germany
  • obviously not understanding what the point of notability policies are, and no, it is not about server space

They also have a features that, if they work, should be looked at and copied by Wikipedia - such as the editing assistants and some of the social features that are built-in into the platform.

Predictions:

  1. they will make a splash or two, and have corresponding news cycles to it
  2. they will, at some point, make an effort to import or transclude Wikipedia content
  3. they will never make a dent in Wikipedia readership, and will say that they wouldn't want to anyway because they love Wikipedia (which I believe)
  4. they will make a press release of donating all their content to Wikipedia (even though that's already possible thanks to their license)
  5. and then, being a for-profit company, they will pivot to something else within a year or two.

May 2019 talks

I am honored to give the following three invited talks in the next few weeks:

The topics will all be on Wikidata, how the Wikipedias use it, and the Abstract Wikipedia idea.

AI and role playing

An article about AI and role playing games, and thus in the perfect intersection of my interest.

But the article is entirely devoid of any interesting content, and basically boils down to asking the question "could RPGs be a Turing test for AI?"

I mean, the answer is so painfully obviously "yes" that no one ever bothered to write it down. I mean, Turing wrote the test as a role playing game basically!

Papaphobia

In a little knowledge engineering exercise, I was trying to add the causes of a phobia to the respective Wikidata items. There are currently about 160 phobias in Wikidata, and only a few listed in a structured way what they are afraid of. So I was going through them, trying to capture it in s a structured way. Here's a list of the current state:

Now, one of those phobias was the Papaphobia - the fear of the pope. Now, is that really a thing? I don't know. CDC does not seem to have an entry on it. On the Web, in the meantime, some pages have obviously taken to mining lists of phobias and creating advertising pages that "help" you with Papaphobia - such as this one:

This page is likely entirely auto-generated. I doubt it that they have "clients for papaphobia in 70+ countries", whom they helped "in complete discretion" within a single day! "People with severe fears and phobias like papaphobia (which is in fact the formal diagnostic term for papaphobia) are held prisoners by their phobias."

This site offers more, uhm, useful information.

"Group psychotherapy can also help where individuals share their experience and, in the process, understand and recover from their phobia." Really? There are enough cases that we can even set up a group therapy?

Now, maybe I am entirely off here - maybe, papaphobia is really a thing. With search in Scholar I couldn't find any medical sources (the term is mentioned in a number of sociological and historical works, to express general sentiments in a population or government against the authority of the pope, but I could not find any mentions of it in actual medical literature).

Now could those pages up there be benign cases of jokes? Or are they trying to scam people with promises to heal their actual fears, and they just didn't curate the list of fears sufficiently, because, really, you wouldn't find this page unless you actually search for this term?

And now what? Now what if we know these pages are made by scammers? Do we report them to the police? Do we send a tip to journalists? Or should we just do nothing, allowing them to scam people with actual fears? Well, by publishing this text, maybe I'll get a few people warned, but it won't reach the people it has to reach at the right time, unfortunately.

Also, was it always so hard to figure out what is real and what is not? Does papaphobia exist? Such a simple question. How should we deal with it on Wikidata? How many cases are there, if it exists? Did it get worse for people with papaphobia now that we have two people living who have been made pope?

My assumption now is that someone was basically working on a corpus, looking for words ending in -phobia, in order to generate a list of phobias. And then the term papaphobia from sociological and historical literature popped up, and it landed in some list, and was repeated in other places, etc., also because it is kind of a funny idea, and so a mixture of bad research and joking bubbled through, and rolled around on the Web for so long that it looks like it is actually a thing, to the point that there are now organizations who will gladly take your money (CTRN is not the only one) to treat you for papaphobia.

The world is weird.

An indigenous library

Great story about an indigenous library using their own categorization system instead of the Dewey Decimal System (which really doesn't work for indigenous topics - I mean it doesn't really work for the modern world as well, but that's another story).

What I am wondering though if if they're not going far enough. Dewey's system is eventually rooted in Aristotelian logic and categorization - with a good dash of practical concerns of running a physical library.

Today, these practical concerns can be overcome, and it is unlikely that indigenous approaches to knowledge representation would be rooted in Aristotelian logic. Yes, having your own categorization system is a great first step - but that's like writing your own anthem following the logic of European hymns or creating your own flag following the weird rules of European medieval heraldry. How would it look like if you were really going back to the principles and roots of the people represented in these libraries? Which novel alternatives to representing and categorizing knowledge could we uncover?

Via Jens Ohlig.

How much information is in a language?

About the paper "Humans store about 1.5 megabytes of information during language acquisition“, by Francis Mollica and Steven T. Piantadosi.

This is one of those papers that I both love - I find the idea is really worthy of investigation, having an answer to this question would be useful, and the paper is very readable - and can't stand, because the assumptions in the papers are so unconvincing.

The claim is that a natural language can be encoded in ~1.5MB - a little bit more than a floppy disk. And the largest part of this is the lexical semantics (in fact, without the lexical semantics, the rest is less than 62kb, far less than a short novel or book).

They introduce two methods about estimating how many bytes we need to encode the lexical semantics:

Method 1: let's assume 40,000 words in a language (languages have more words, but the assumptions in the paper is about how many words one learns before turning 18, and for that 40,000 is probably an Ok estimation although likely on the lower end). If there are 40,000 words, there must be 40,000 meanings in our heads, and lexical semantics is the mapping of words to meanings, and there are only so many possible mappings, and choosing one of those mappings requires 553,809 bits. That's their lower estimate.

Wow. I don't even know where to begin in commenting on this. The assumption that all the meanings of words just float in our head until they are anchored by actual word forms is so naiv, it's almost cute. Yes, that is likely true for some words. Mother, Father, in the naive sense of a child. Red. Blue. Water. Hot. Sweet. But for a large number of word meanings I think it is safe to assume that without a language those word meanings wouldn't exist. We need language to construct these meanings in the first place, and then to fill them with life. You can't simply attach a word form to that meaning, as the meaning doesn't exist yet, breaking down the assumptions of this first method.

Method 2: let's assume all possible meanings occupy a vector space. Now the question becomes: how big is that vector space, how do we address a single point in that vector space? And then the number of addresses multiplied with how many bits you need for a single address results in how many bits you need to understand the semantics of a whole language. There lower bound is that there are 300 dimensions, the upper bound is 500 dimensions. Their lower bound is that you either have a dimension or not, i.e. that only a single bit per dimension is needed, their upper bound is that you need 2 bits per dimension, so you can grade each dimension a little. I have read quite a few papers with this approach to lexical semantics. For example it defines "girl" as +female, -adult, "boy" as -female,-adult, "bachelor" as +adult,-married, etc.

So they get to 40,000 words x 300 dimensions x 1 bit = 12,000,000 bits, or 1.5MB, as the lower bound of Method 2 (which they then take as the best estimate because it is between the estimate of Method 1 and the upper bound of Method 2), or 40,0000 words x 500 dimensions x 2 bits = 40,000,000 bits, or 8MB.

Again, wow. Never mind that there is no place to store the dimensions - what are they, what do they mean? - probably the assumption is that they are, like the meanings in Method 1, stored prelinguistically in our brains and just need to be linked in as dimensions. But also the idea that all meanings expressible in language can fit in this simple vector space. I find that theory surprising.

Again, this reads like a rant, but really, I thoroughly enjoyed this paper, even if I entirely disagree with it. I hope it will inspire other papers with alternative approaches towards estimating these numbers, and I'm very much looking forward to reading them.

Milk consumption in China

Quiet disappointed by The Guardian. Here's a (rather) interesting article on the history of milk consumption in China. But the whole article is trying to paint how catastrophic this development might be: the Chinese are trying to triple their intake in milk! That means more cows! That's bad because cows fart us into a hot house!

The argumentation is solid - more cows are indeed problematic. But blaming it on milk consumption in China? Let's take a look at a few numbers omitted from the article, or stuffed into the very last paragraph.

  • On average, a European consumes six times as much milk as a Chinese. So, even if China achieves its goal and triples average milk consumption, they will drink only half as much as a European.
  • Europe has double the number of dairy cows than China has.
  • China is planning to increase their milk output by 300% but only increase resources for that by 30% according to the article. I have no idea how that works, but sounds like a great deal to me.
  • And why are we even talking about dairy cows? The number of beef cows in the US or in Europe each outnumber the dairy cows by a fair amount (unsurprisingly - a cow produces quite a lot of milk over a longer time, whereas its meat production is limited to a single event)
  • There are about 13 million dairy cows in China. The US have more than 94 million cattle, Brazil has more than 211 million, world wide it's more than 1.4 billion - but hey, it's the Chinese milk cows that are the problem.

Maybe the problem can be located more firmly in the consumption habits of people in the US and in Europe than the "unquenchable thirst of China".

The article is still interesting for a number of other reasons.

Shazam!

Shazam! was fun. And had more heart than many other superhero stories. I liked that, for the first time, a DC universe movie felt like it's organically part of that universe - with all the backpacks with Batman and Superman logos and stuff. That was really neat.

Since I saw him in the first trailer I was looking forward to see Steve Carell playing the villain. Turns out it was Mark Strong, not Steve Carell. Ah well.

I am not sure the film knew exactly at whom it was marketed. The theater was full with kids, and given the trailers it was clear that the intention was to get as many families into it as possible. But the horror sequences, the graphic violence, the expletives, and the strip club scenes were not exactly for that audience. PG-13 is an appropriate rating.

It was a joy to watch the protagonist and his buddy explore and discover his powers. Colorful, lively, fun. Easily the best scenes of the movie.

The foster family drama gave the movie it's heart, but the movie seemed a bit overwhelmed by it. I wish that part was executed a bit better. But then again, it's a superhero movie, and given that it was far better than many of the other movies of its genre. But as far as High School and family drama superheroes go, it doesn't get anywhere near Spiderman: Homecoming.

Mid credit scenes. A tradition that Marvel started and that DC keeps copying - but unlike Marvel DC hasn't really paid up to the teasers in their scenes. And regarding cameos - also something where DC could learn so much from Marvel. Also, what's up with being afraid of naming their heroes? Be it in Man of Steel with Superman or here with Billy, the hero doesn't figure out his name (until the next movie comes along and everybody refers to him as Superman as if it was obvious all the time).

All in all, an enjoyable movie while waiting for Avengers: Endgame, and hopefully a sign that DC is finally getting on the right path.

EMWCon 2019, Day 2

Today was the second day of the Enterprise MediaWiki Conference, EMWCon, in Daly City at the Genesys headquarters.

The day started with my keynote on Wikidata and the Abstract Wikipedia idea. The idea was received very friendly.

Today, the day was filled with stories from people building systems on top of MediaWiki, and in particularly Semantic MediaWiki, Cargo, and some Wikibase. This included SFMoma presenting their system to collaboratively document art, using Cargo and Lua on the League of Legends wiki, running a whole wiki farm for Finnish memory and language institutions, the Lost Plays database, and - what I found particularly impressive - an engineer at NASA who implemented a workflow for document approval including authorization, audibality, and a full Web interface within a mere week, and still thinking that it could have been done much faster.

A common theme was "how incredibly easy it was". Yes, almost everyone mentioned something they got stumped on, and this really points to the community needing maybe more usage on StackOverflow or IRC or something, but in so many use cases, people who were not developers were able to create pretty complex workflows and apps right there in their browsers. This also ties in with the second common theme, that a lot of the deployments of such wikis are often starting "under the radar".

There were also genuinely complex solutions that were using Semantic MediaWiki as a mere component: Matteo Busanelli was presenting a solution that included lifting external data sources, deploying ontologies, reasoning, and all the whistles and bells - a very impressive and powerful architecture.

The US government uses Semantic MediaWiki in many places, most notably Intellipedia used by more than 16 intelligence agencies, Diplopedia by the Department of State, and Powerpedia for the Department of Energy. EPA's Statipedia is no more, but new wikis are popping up in other agency, such as WikITA for the International Trade Administration, and for the Nuclear Regulatory Commission. Canada's GCpedia was mentioned with a lot of respect, and the wish that the US would have something similar.

NASA has a whole wiki farm: within mission control alone they had 12 different wikis after a short while, many grown bottom up. They noticed that it would make sense to merge them together - which wasn't easy, neither technically nor legally nor managerially. They found that a lot of their knowledge was misclassified - for example, they classified handbooks which can be bought by anyone on Amazon. One of the biggest changes the wiki caused at NASA was that the merged ISS wiki lead to opening more knowledge to more people, and drawing the circles larger. 20% of the people who have access to the wikis actively contribute to the wikis! This is truly impressive.

So far, no edit has been made from space - due to technical issues. But they are working on it.

The day ended with a panel, asking the question where MediaWiki is in the marketplace, and how to grow.

Again, thanks to Yaron Koren and Cindy Cicalese for organizing the conference, and Genesys for hosting us. All presentations are available on YouTube.

EMWCon 2019, Day 1

Today was the first day of the Enterprise MediaWiki Conference, EMWCon, in Daly City. Among the attendees were people from NASA (6 or more people), UIC (International Union of Railways), the UK Ministry of Defence, the US radioactivity safety agencies, cancer research institutes, the Bureaus of Labour Statistics, PG&E, General Electric, and a number of companies providing services around MediaWiki, such as WikiTeq, Wikiworks, dokit, etc., with or without semantic extensions. The conference was located at the Headquarter of Genesys.

I'm not going to comment on all talks, and also I will not faithfully report on the talks - you can just go to YouTube to watch the talks themselves. The following is a personal, biased view of the first day.

NASA made an interesting comment early on: the discussion was about MediaWiki and its lack of fine-grained access control. You can set up a MediaWiki easily for a controlled group (so that not everyone in the world can access it), but it is not so easy to say "oh, this set of pages is available for people in this group, and managers in that org can access the pages with this markers", etc. So NASA, at first, set up a lot of wiki installations, each one for such specific groups - but eventually turned it all around and instead had a small number of well-defined groups and merged the wikis into them, tearing down barriers within the org and making knowledge wider available.

Evita Hollis from General Electric had an interesting point in her presentation on how GE does knowledge sharing: they use SharePoint and Yammer to connect people to people, and MediaWiki to connect people to Knowledge. MediaWiki has been not-exactly-great at allowing people to work together in real-time - it is a different flow, where you capture and massage knowledge slowly into it. There is a reason why Ops at Wikimedia do not use a wiki during an incident that much, but rather IRC. I think there is a lot of insight in her argument - and if we take that serious, we could actually really lift MediaWiki to a new level, and take Wikipedia there too.

Another interesting point is that SharePoint at General Electric had three developers, and MediaWiki had one. The question from the audience was, whether that reflect how difficult it is to work with SharePoint, or whether that reflected some bias of the company towards SharePoint. Hollis was adamant about how much she likes Sharepoint, but the reason for the imbalance was that MediaWiki, particularly Semantic MediaWiki, allows actually much more flexibility and power than SharePoint without having to touch a single line of wiki source code. It is a platform that allows for rapid experimentation by the end user (I am adding the Spiderman adage about great power coming with great responsibility).

Daren Welsh from NASA talked about many different forms of biases and how they can bubble up on your wiki. Very interesting was one effect: if knowledge from the wiki is becoming too readily availble, people may start to become dependent on it. They had tests where they took away the wiki randomly from flight controllers in training, in order to ensure they are resourceful enough to still figure out what to do - and some failed miserably.

Ike Hecht had a brilliant presentation on the kind of quick application development Semantic MediaWiki lends itself to. He presented a task manager, a news feed, and a file management system, calling them "Semantic Structures That Do Stuff" - which is basically a few pages for your wiki, instead of creating extensions for all of these. This also resonated with GE's statement about needling less developers. I think that this is wildly underutilized and there is a lot of value in this idea.

Thanks to Yaron Koren - who also gave an intro to the topic - and Cindy Cicalese for organizing the conference, and Genesys for hosting us. All presentations are available on YouTube.

EMWCon Spring 2019

I'm honored to be invited to keynote the Enterprise MediaWiki conference in Daly City. The keynote is on Thursday, I will talk about Wikidata and beyond - towards an abstract Wikipedia.

The talk is planned to be recorded, so it should be available afterwards for everyone interested.

Turing Award to Bengio, LeCun, and Hinton

Congratulations to Yoshua Bengio, Yann LeCun, and Geoffrey Hinton on being awarded the Turing Award, the most prestigious award in Computer Science.

Their work had revolutionized huge parts of computer science as it is used in research and industry, and has lead to the current impressive results in AI and ML. They were continuing to work on an area that was deemed unpromising, and has suddenly swept through whole industries and reshaped them.

Something Positive in Deutsch wieder online

2005 und 2006 übersetzten Ralf Baumgartner und ich die ersten paar Something Positive comics von R. K. Milholland ins Deutsche. Die 80 Comics, die wir damals übersetzt haben, sind hiermit wieder online. Wir haben noch vier weitere Comics übersetzt, die in den nächsten Tagen auch nach und nach online kommen werden.

Viel Spass! Oh, und die Comics sind für Erwachsene.

DSA Erfolgswahrscheinlichkeiten

Ich fand es immer spannend, auszurechnen, wie hoch die Wahrscheinlichkeit ist, dass eine Talentprobe in DSA gelingt oder nicht. Ich konnte über die Jahre hinweg keine vernünftige, geschlossene Formel finden, und so blieb ich immer bei Überschlagsrechnungen. Dabei visualisierte ich mir im Kopf die drei Würfelwürfe als die drei Dimensionen eines Raumes, in dem ein Teil des Raumes gelungene Proben und der Rest des Raumes misslungene Proben darstellt.

Ich dachte lange darüber nach, dass es interessant ware, diesen Raum tatsächlich zu visualisieren. 2010 musste ich während eines Forschungsaufenthalts in Los Angeles ein paar Webtechniken erlernen - HTML Canvas, jQuery, Blueprint, etc. - und am besten lerne ich, indem ich ein kleines Projekt mache. Also nutzte ich diese Gelegenheit. Damals war DSA4 aktuell, und entsprechend machte ich das Projekt für die Regeln von DSA4.

2017 überarbeitete Hanno Müller-Kalthoff die Visualisierung und passte sie an die neuen Regeln von DSA5 an. Hier sind Links für beide Seiten und eine DSA5 App:

A bitter, better lesson

Rich Sutton is expressing some frustration in his short essay on computation and simple methods beating smart methods again and again.

Rodney Brooks answers with great arguments on why this is not really the case, and how we're just hiding human ingenuity and smartness better.

They're both mostly right, and it was interesting to read the arguments on both sides. And yet, not really new - it's mostly rehashing the arguments from The unreasonable effectiveness of data by Alon Halevy, Peter Norvig, and Fernando Pereira ten years ago. But nicely updated and much shorter. So worth a read!

Wikipedia demonstriert

Eine Reihe von Wikipedien (Deutsch, Dänisch, Estnisch, Tschechisch) tragen heute schwarz um schlecht gemachte Gesetzesänderungen zu verhindern. Ich bin stolz auf die Freiwilligen der Wikipedien, die das organisiert bekommen haben.

Spring cleaning

Going through my old online presence and cleaning it up is really a trip down memory lane. I am also happy that most - although not all - of the old demos still work. This is going to be fun to release it all again.

Today I discovered that we had four more German translations of Something Positive that we never published. So that's another thing that I am going to publish soon, yay!

Prediction coming true

I saw my post from seven years ago, where I said that I really like Facebook and Google+, but I want a space where I have more control about my content so it doesn't just disappear. "Facebook and Google+ -- maybe they won't disappear in a year. But what about ten?"

And there we go, Google+ is closing in a few days.

I downloaded my data from there (as well as my data from Facebook and Twitter), to see if there is anything to be salvaged from that, but I doubt it.

Restarting, 2019 edition

I had neglected Simia for too long - there were five entries in the last decade. A combination of events lead me to pour some effort back into it - and so I want to use this chance to restart it, once again.

Until this weekend, Simia was still running on a 2007 version of Semantic MediaWiki and MediaWiki - which probably helped with Simia getting hacked a year or two ago. Now it is up to date with a current version, and I am trying to consolidate what is already there with some content I had created in other places.

Also, life has been happening. If you have been following me on Facebook (that link only works if you're logged in), you have probably seen some of that. I married, got a child, changed jobs, and moved. I will certainly catch up on this too, but there is no point in doing that all in one huge mega-post. Given that I am thinking about starting a new project just these days, this might be the perfect outlet to accompany that.

I make no promises with regards to the quality of Simia, or the frequency of entries. What I would really love recreate would be a space that is as interesting and fun for as my Facebook wall was, before I stopped engaging there earlier this year - but since you cannot even create comments here, I have to figure out how to make this even remotely possible. For now, suggestions on Twitter or Facebook are welcome. And no, moving to WordPress or another platform is hardly an option, as I really want to stay with Semantic MediaWiki - but pointers to interesting MediaWiki extensions are definitely welcome!

Stars in our eyes

I grew up in a suburban, almost rural area in southern Germany, and I remember the hundreds, if not thousands of stars I could see at night. In the summers, that I spent on an island in Croatia, it was even more marvelous, and the dark night sky was breathtaking.

As I grew up, the stars dimmed, and I saw fewer and fewer of those, until only the brightest stars were visible. It was blindingly obvious that air and light pollution have swallowed that every-night miracle and confined it to my memory only.

Until in my late twenties I finally accepted and got glasses. Now the stars are back, as beautiful and brilliant as they have ever been.

Croatian Elections 2016

Croatian elections are upcoming.

The number of Croatians living abroad - in the so called Croatian diaspora - is estimated to be almost 4 Million according to the Croatian state office for Croatians abroad - only little less than the 4.3 Million who live in Croatia. The estimates vary wildly, and most of them actually do not have Croatian citizenship. But it is estimated that between 9-10% of holders of the Croatian citizenship live abroad.

These 9-10% are represented in the Croatian parliament: out of the 151 Members of Parliament, there are 3 (three) voted by the diaspora. That's 2% of the parliament representing 10% of the population.

In order for a member of the diaspora to vote, they have to register well before the election with their nearest diplomatic mission or consulate. The registration deadline is today, at least for my consulate. But for the election itself, you have to personally appear and vote at the consulate. For me, that would mean to drive or fly all the way to Los Angeles from San Francisco. And I am rather close to one of the 9 consulates in the US. There are countries that do not have Croatian embassies at all. Want to vote? Better apply for a travel visa to the country with the next embassy. Live in Nigeria? Have a trip to Libya or South Africa. There is no way to vote per mail or - ohwow21stcentury? - electronically. For one of the three Members of Parliament that represent us.

I don't really feel like the parliament wants us to vote. Making the vote mean so little and making it so hard to vote.

Gödel and physics

"A logical paradox at the heart of mathematics and computer science turns out to have implications for the real world, making a basic question about matter fundamentally unanswerable."

I just love this sentence, published in "Nature". It raises (and somehow exposes the author's intuition about) one of the deepest questions in science: how are mathematics, logic, computer science, i.e. the formal sciences, on the one side, and the "real world" on the other side, related? What is the connection between math and reality? The author seems genuinely surprised that logic has "implications for the real world" (never mind that "implication" is a logical term), and seems to struggle with the idea that a counter-intuitive theorem by Gödel, which has been studied and scrutinized for 85 years, would also apply to equations in physics.

Unfortunately the fundamental question does not really get tackled: the work described here, as fascinating as it is, was an intentional, many year effort to find a place in the mathematical models used in physics where Gödel can be applied. They are not really discussing the relation between maths and reality, but between pure mathematics and mathematics applied in physics. The original deep question remains unsolved and will befuddle students of math and the natural sciences for the next coming years, and probably decades (besides Stephen Wolfram, who beieves to have it all solved in NKS, but that's another story).

Nature: Paradox at the heart of mathematics makes physics problem unanswerable

Phys.org: Quantum physics problem proved unsolvable: Godel and Turing enter quantum physics

AI is coming, and it will be boring

I was asked about my opinion on this topic, and I thought I would have some profound thoughts on this. But I ended up rambling, and this post doesn’t really make any single strong point. tl;dr: Don’t worry about AI killing all humans. It’s not likely to happen.

In an interview with the BBC, Stephen Hawking stated that “the development of full artificial intelligence could spell the end of the human race”. Whereas this is hard to deny, it is rather trivial: any sufficiently powerful tool could potentially spell the end of the human race given a person who knows how to use that tool in order to achieve such a goal. There are far more dangerous developments - for example, global climate change, the arsenal of nuclear weapons, or an economic system that continues to sharpen inequality and social tension?

AI will be a very powerful tool. Like every powerful tool, it will be highly disruptive. Jobs and whole industries will be destroyed, and a few others will be created. Just as electricity, the car, penicillin, or the internet, AI will profoundly change your everyday life, the global economy, and everything in between. If you want to discuss consequences of AI, here are a few that are more realistic than human extermination: what will happen if AI makes many jobs obsolete? How do we ensure that AIs make choices compliant with our ethical understanding? How to define the idea of privacy in a world where your car is observing you? What does it mean to be human if your toaster is more intelligent than you?

The development of AI will be gradual, and so will the changes in our lifes. And as AI keeps developing, things once considered magical will become boring. A watch you could talk to was powered by magic in Disney’s 1991 classic “The Beauty and the Beast”, and 23 years later you can buy one for less than a hundred dollars. A self-driving car was the protagonist of the 80s TV show “Knight Rider”, and thirty years later they are driving on the streets of California. A system that checks if a bird is in a picture was considered a five-year research task in September 2014, and less than two months later Google announces a system that can provide captions for pictures - including birds. And these things will become boring in a few years, if not months. We will have to remind ourselves how awesome it is to have a computer in our pocket that is more powerful than the one that got Apollo to the moon and back. That we can make a video of our children playing and send it instantaneously to our parents on another continent. That we can search for any text in almost any book ever written. Technology is like that. What’s exciting today, will become boring tomorrow. So will AI.

In the next few years, you will have access to systems that will gradually become capable to answer more and more of your questions. That will offer advice and guidance towards helping you navigate your life towards the goal you tell it. That will be able to sift through text and data and start to draw novel conclusions. They will become increasingly intelligent. And there are two major scenarios that people are afraid of at this point:

  1. That the system will become conscious and develop their own intentions and their own will, and they will want to destroy humanity: the Skynet scenario from the Terminator movies.
  2. That the system might get a task, and figure out a novel solution for the task which unfortunately wipes out humanity. This is the paperclip scenario— an AI gets the task to create paperclips, and kills all humans by doing so — , which has not yet been turned into a blockbuster.

The Skynet scenario is just mythos. There is no indication that raw intelligence is sufficient to create intrinsic intention or will.

The paperclip scenario is more realistic. And once we get closer to systems with such power, we will need to put the right safeguards in place. The good news is that we will have plenty of AIs at our disposal to help us with that. The bad news is that discussing such scenarios now is premature: we simply don’t know how these systems will look like. That’s like starting a committee a hundred years ago to discuss the danger coming from novel weaponry: no one in 1914 could have predicted nuclear weapons and their risks. It is unlikely that the results of such a committee would have provided much relevant ethical guidance for the Manhattan project three decades later. Why should that be any different today?

In summary: there are plenty of consequences of the development of AI that warrant intensive discussion (economical consequences, ethical decisions made by AIs, etc.), but it is unlikely that they will bring the end of humanity.

Further reading

Published originally on Medium on December 14, 2014

Start the website again

This is no blog anymore. I haven't had entries for years, and even before then sporadically. This is a wiki, but somehow it is not that either. Currently you cannot make comments. Updating the software is a pain in the ass. But I like to have a site where I can publish again. Switch to another CMS? Maybe one day. But I like Semantic MediaWiki. So what will I do? I do not know. But I know I will slowly refresh this page again. Somehow.

A new part of my life is starting soon. And I want to have a platform to talk about it. And as much as I like Facebook or Google+, I like to have some form of control over this platform. Facebook and Google+ -- maybe they won't disappear in a year. But what about ten? Twenty? Fifty years? I'll still be around (I hope), but they might not...

Let's see what will happen here. For now, I republished the retelling of a day as a story I first published on Google+ (My day in Jerusalem) and a poem that feels eerily relevant whenever I think about it (Wenn ich wollte)

Popculture in logics

  1. You ⊑ ∀need.Love (Lennon, 1967)
  2. ⊥ ≣ compare.You (Nelson, 1985)
  3. Cowboy ⊑ ∃sing.SadSadSong (Michaels, 1988)
  4. ∀t : I ⊑ love.You (Parton, 1973)
  5. ∄better.Time ⊓ ∄better­­­­­­­⁻¹.Time (Dickens, 1859)
  6. {god} ⊑ Human ⊨ ? (Bazilian, 1995)
  7. Bad(X)? (Jackson, 1987)
  8. ⃟(You ⊑ save.I) (Gallagher, 1995)
  9. Dreamer(i). ∃x : Dreamer(x) ∧ (x ≠ i). ⃟ ∃t: Dreamer(you). (Lennon, 1971)
  10. Spoon ⊑ ⊥ (Wachowski, 1999)
  11. ¬Cry ← ¬Woman (Marley, 1974)
  12. ∄t (Poe, 1845)

Solutions: Turn around your monitor to read them.

sǝlʇɐǝq ǝɥʇ 'ǝʌol sı pǝǝu noʎ llɐ ˙ǝuo
ǝɔuıɹd ʎq ʎllɐuıƃıɹo sɐʍ ʇı 'ƃuos ǝɥʇ pǝɹǝʌoɔ ʇsnɾ pɐǝuıs ˙noʎ oʇ sǝɹɐdɯoɔ ƃuıɥʇou ˙oʍʇ
˙uosıod ʎq uɹoɥʇ sʇı sɐɥ ǝsoɹ ʎɹǝʌǝ ɯoɹɟ '"ƃuos pɐs pɐs ɐ sƃuıs ʎoqʍoɔ ʎɹǝʌǝ" ˙ǝǝɹɥʇ
ʞɔɐɹʇpunos ǝıʌoɯ pɹɐnƃʎpoq ǝɥʇ ɹoɟ uoʇsnoɥ ʎǝuʇıɥʍ ʎq ɹɐlndod ǝpɐɯ ʇnq uoʇɹɐd ʎllop ʎq ʎllɐuıƃıɹo 'noʎ ǝʌol sʎɐʍlɐ llıʍ ı 'ɹo - ",noʎ, ɟo ǝɔuɐʇsuı uɐ ɥʇıʍ pǝllıɟ ,ǝʌol, ʎʇɹǝdoɹd ɐ ƃuıʌɐɥ" uoıʇdıɹɔsǝp ǝɥʇ ʎq pǝɯnsqns ɯɐ ı 'ʇ sǝɯıʇ llɐ ɹoɟ ˙ɹnoɟ
suǝʞɔıp sǝlɹɐɥɔ ʎq sǝıʇıɔ oʍʇ ɟo ǝlɐʇ ɯoɹɟ sǝɔuǝʇuǝs ƃuıuǝdo ǝɥʇ sı sıɥʇ ˙(ʎʇɹǝdoɹd ǝɥʇ ɟo ǝsɹǝʌuı suɐǝɯ 1- ɟo ɹǝʍod" ǝɥʇ) ǝɯıʇ ɟo ʇsɹoʍ ǝɥʇ sɐʍ ʇı ˙sǝɯıʇ ɟo ʇsǝq ǝɥʇ sɐʍ ʇı ˙ǝʌıɟ
(poƃ)ɟoǝuo ƃuıɯnsqns sn ʎq pǝlıɐʇuǝ sı ʇɐɥʍ sʞsɐ ʇı ʎllɐɔısɐq ˙ʇıɥ ɹǝpuoʍ ʇıɥ ǝuo 5991 ǝɥʇ 'sn ɟo ǝuo sɐʍ poƃ ɟı ʇɐɥʍ ˙xıs
pɐq ǝlƃuıs ʇıɥ ǝɥʇ uı "pɐq s,oɥʍ" ƃuıʞsɐ 'uosʞɔɐɾ lǝɐɥɔıɯ ˙uǝʌǝs
ɔıƃol lɐpoɯ ɯoɹɟ ɹoʇɐɹǝdo ʎılıqıssod ǝɥʇ sı puoɯɐıp ǝɥʇ ˙"ǝɯ ǝʌɐs oʇ ǝuo ǝɥʇ ǝɹ,noʎ ǝqʎɐɯ" ǝuıl ǝɥʇ sɐɥ ʇı ˙sısɐo ʎq 'llɐʍɹǝpuoʍ ˙ʇɥƃıǝ
˙ooʇ ǝuo ǝɹɐ noʎ ǝɹǝɥʍ ǝɯıʇ ɐ sı ǝɹǝɥʇ ǝqʎɐɯ puɐ ˙(ǝɯ ʇou ǝɹɐ sɹǝɥʇo ǝsoɥʇ puɐ 'sɹǝɯɐǝɹp ɹǝɥʇo ǝɹɐ ǝɹǝɥʇ) ǝuo ʎluo ǝɥʇ ʇou ɯɐ ı ʇnq ˙ɹǝɯɐǝɹp ɐ ɯɐ ı" ˙ǝuıƃɐɯı 'uıɐƃɐ uouuǝl uɥoɾ ˙ǝuıu
(ǝlɔɐɹo ǝɥʇ sʇǝǝɯ ǝɥ ǝɹoɟǝq ʇsnɾ oǝu oʇ ƃuıʞɐǝds pıʞ oɥɔʎsd ǝɥʇ) xıɹʇɐɯ ǝıʌoɯ ǝɥʇ ɯoɹɟ ǝʇonb ssɐlɔ ˙uoods ou sı ǝɹǝɥʇ ˙uǝʇ
ʎǝuoɯ ǝɯos sʇǝƃ puǝıɹɟ sıɥ os ƃuıʎl ʎlqɐqoɹd sɐʍ ǝɥ ʇnq 'puǝıɹɟ ɐ oʇ sɔıɹʎl ǝɥʇ pǝʇnqıɹʇʇɐ ʎǝlɹɐɯ ˙"ʎɹɔ ʇou" sʍolloɟ "uɐɯoʍ ʇou" ɯoɹɟ ˙uǝʌǝlǝ
ǝod uɐllɐ ɹɐƃpǝ ʎq '"uǝʌɐɹ ǝɥʇ" ɯoɹɟ ɥʇonb ˙ǝɹoɯɹǝʌǝu :ɹo ˙ǝɯıʇ ou sı ǝɹǝɥʇ ˙ǝʌlǝʍʇ

My horoscope for today

Here's my horoscope for today:

You may be overly concerned with how your current job prevents you from reaching your long-term goals. The Sun's entry into your 9th House of Big Ideas can work against your efficiency by distracting you with philosophical discussions about the purpose of life. These conversations may be fun, but they should be kept to a minimum until you finish your work.

(from Tarot.com via iGoogle)

How the heck did they know??

England eagerly lacking cofidence

My Google Alerts just send me the following news alert about Croatia. At least the reporters checked all their sources :)

England players lacking confidence against Croatia International Herald Tribune - France AP ZAGREB, Croatia: England's players confessed to a lack of confidence when they took on football's No. 186-ranked nation in their opening World Cup ...

England eager to break Croatia run Reuters UK - UK By Igor Ilic ZAGREB (Reuters) - England hope to put behind their gloomy recent experiences against Croatia when they travel to Zagreb on Wednesday for an ...

Beating the Second Law

Yihon Ding has an interesting blogpost taking analogies to the laws of thermodynamics and why this means trouble for the Semantic Web.

I disagree in one aspect: I think it is possible to invest the amount of human power to the system and to still keep it going. I can't nail it down exactly -- I didn't read "Programming the Universe" yet, so I can't really discuss it, but the feeling goes along the following lines: the value of a network increases superlinearly, if not even quadratic (Metcalfe's Law), whereas the amount of information increases sublinearly (due to redundancies in human knowledge). Or, put it in another way: get more people and Wikipedia or Linux gets better, because they have a constrained scope. The more you constrain the scope the more value is added by more people.

This is an oversimplification.

Blogging from an E90

28 May 2008

After pondering it for far too long, I finally got a new mobile phone: a Nokia E90. It is pretty big and heavy, but I don't mind really. I am looking at it as a light-weight laptop replacement. But I am not sure I will learn to love the keyboard, really. Experimenting.

But since it has a full keyboard, programming in Python is indeed an option. I had Python on my previous phone too, but heck, T9 is not cool to type code.

One world. One web.

I am in Beijing at the opening of the WWW2008 conference. Like all WWWs I was before, it is amazing. The opening ceremony was preceded by a beautiful dance, combining tons of symbols. First a woman in a traditional Chinese dress, then eight dancers in astronaut uniforms, a big red flag with "Welcome to Beijing" on it (but not on the other side, when he came back), and then all of them together... beautiful.

Boris Motik's paper is a best paper candidate! Yay! Congratulations.

I rather listen to the keynote now :)

Blogging from my XO One.

Certificate of Coolness

Now that the Cool URIs for the Semantic Web note by Richard and Leo have been published -- congratulation guys! -- I am sure looking forward if anyone will create a nice badge and a procedure to get official Certificates of Coolness. Pretty please?

On a different note: I know, I should have blogged from New Zealand. It sure was beautiful Maybe I will still blog about it a bit later. My sister has blogged extensively, and also made a few great pictures, take a look over there if you're interested.

Coming to New Zealand

Yes! Three weeks of vacation in New Zealand, which is rumoured to be quite a beauty. This also means: three weeks no work, no projects, no thesis, no Semantic We...

Oh, almost. Actually I will enjoy to have the opportunity to give a talk on Semantic Wikipedia while staying in Auckland. If you're around, you may want to come by.

It is on February 22nd, 1pm-2pm at the AUT. You may want to tell Dave Parry that you're coming, he is my host.

Looking forward to this trip a lot!

Charlie Wilson's War

Ein einfacher Kongressmann (Tom Hanks). Eine sehr rechte reiche Texanering (Julia Roberts). Ein äußerst guter CIA Agent (Philip Seymour Hoffman). Und Sowjets die in Afghanistan einmarschieren, im Kalten Krieg. Amerika muss sich wehren, auch am Hindukusch!

Der Film behandelt den afghanischen Krieg (den aus den 1980er Jahren), und es geht um sehr ernste Themen. Zudem beruht er auf wahren Ereignissen. Dennoch verpackt er es in abstrusen Witz, liefert uns charmante Antihelden, und deutet schließlich auch an, wie es zum nächsten Afghanistankrieg kommen konnte (den aus den 2000er Jahren), doch ist das kaum das Thema in diesem Film.

Wie bereits Hunting Party ein Film, der doch sensibel mit jenen Wahrheiten umgeht, für die er eintritt, und mit umso derberen Humor und Aufklärung die Missstände anprangert. Ein besonderer Spagat gelingt dem Film da er die Aufteilung des politischen Spektrums in rechts und links nicht einfach mit Böse und Gut gleichsetzt, wie es etwa Michael Moore gerne macht, sondern ähnlich wie Team America nach beiden Seiten austeilt -- Lob, wie auch Kritik.

Wir sahen den Film gestern in der Sneak, leider in der Originalversion -- insbesondere die Texanischen Dialekte waren echt schwer zu verstehen, weswegen wohl der eine oder andere Gag verloren ging. Ich hoffe ihn dann auch in der deutschen Synchro zu sehen.

Charlie Wilson's War (Der Krieg des Charlie Wilson) läuft in Deutschland am 7. Februar 2008 in den Kinos an.

Bewertung: 4 von 5

7 Jahre

Heute sind es genau sieben Jahre, dass ich diese Website eröffnet habe. Und heute wird sie umbenannt! In den letzten Wochen stellte ich die Software vollständig auf Semantic MediaWiki um, welches es mir deutlich einfacher erlaubt, die Seite zu pflegen als jemals zuvor.

In den nächsten Wochen will ich langsam, aber sicher, die alten Inhalte, die seit einem Hacker-Angriff auf Nodix verschwunden sind, wieder hochladen.

Eine wichtige Änderung gibt es freilich (nicht die Hintergrundfarbe, die ist geblieben): der Name der Website wurde geändert. Kein Nodix mehr, ab jetzt heißt die Seite Simia. Und ihr werdet merken, viele der Seiten sind Englisch, andere Deutsch. Einfach auch, weil inzwischen vieles von dem was ich mache, Englisch ist. Ich hoffe, dass das nicht abschreckt. Es sind ja dennoch noch viele Deutschsprachige Inhalte vorhanden.

Und was leider noch nicht funktioniert, ist das Kommentieren, sorry. Das heißt, vorübergehend ist das nur per Email möglich. Ich arbeite dran.

Kindheitsträume wahr werden lassen

Randy Pausch ist Professor für User Interfaces and der CMU, einer der bekanntesten Universitäten der USA. Im September 2006 wurde bei ihm Bauchspeicheldrüsenkrebs diagnostiziert. Seitdem kämpft er um jeden Tag.

In der Vortragsreihe Journeys (Reisen) der CMU, welche Randy mit seinem Vortrag eröffnete, sollen die Vortragenden sich überlegen, was sie den Zuhörern sagen würden, wenn dies ihre letzte Gelegenheit für einen Vortrag wäre. Ihr Erbe, sozusagen.

Der Vortrag -- auch wenn er knappe anderthalb Stunden dauert -- stellt flott und unterhaltend Randys Kindheitsträume vor, und wie sie wahr geworden sind, oder nicht. Er erzählt viele Anekdoten, und fasst wichtige Weisheiten zusammen.

Das Video des Vortrags, mit Untertiteln in Deutsch oder Englisch, ist bei Google Video erhältlich. Sehenswert.

Darjeeling Limited

Wunderschöner Film. Auch wenn der Sympatexter offenbar nicht allzu begeistert war, mir gefiel er sehr. Wem Wes Andersons andere Filme gefallen haben (insbesondere The Royal Tenenbaums und The Life Aquatic with Steve Zissou), der wird sich auch an Darjeling Limited sehr freuen.

Der Film ist farbenfroh, hat einen witzigen Soundtrack, unzählige skurrile Situationen, und hin und wieder auch sehr tiefer Stoff zum Nachdenken. Anderson setzt seine Schauspieler hervorragend in Szene, verzaubert mit den wunderbaren Macken der drei Brüder, und lässt einen mit der Gewissheit wieder aus dem Kino gehen, dass die eigene Familie gar nicht so verrückt ist, wie man immer angenommen hat. Die drei Brüder auf dem Weg durch Indien können erst sich selbst finden, wenn sie zueinander gefunden haben -- und das ist erst möglich, nachdem sie mit dem Gesicht nach vorne auf ein echtes Schicksal treffen.

Bewertung: 5 von 5

Social Web and Knowledge Management

Obviously, the social web is coming. And it's also coming to this year's WWW conference in Beijing!

I find this topic very interesting. The SWKM picks up the theme of last year's very successful CKC2007 workshop, also at the WWW, where we aimed at allowing the collaborative knowledge construction. The SWKM is a bit broader, since it is not just about knowledge construction, but about the whole topic of knowledge management, and how the web changes everything.

If you are interested in the social web, or the semantic web, or specifically about the intersection of these two, and how it can be applied for knowledge management within or without an organisation, you will like the SWKM workshop at the WWW2008. You can submit papers until January 21st, 2008. All information can be found at the Social Web and Knowledge management workshop website.

Semantic MediaWiki 1.0 released

After about two years of development and already with installations all over the world, we are very happy to announce the release of Version 1.0 of Semantic MediaWiki, and thus the first stable version. No alpha, no beta, it's out now, and we think you can use it productively. Markus managed to release it in 2007 (on the last day of the year), and it has moved far beyond what 0.7 was, in stability, features, and performance. The biggest change is a completely new ask syntax, much more powerful since it works much smoother with MediaWiki's other systems like the parser functions, and we keep constantly baffling ourselves about what is possible with the new system.

We have finally reached a point where we can say, OK, let's go for massive user testing. We want big and heavy used installations to test our system. We are fully aware that the full power of the queries can easily kill an installation, but there are many ways to tweak performance and expressivity. We are now highly interested in performance reports, and then moving towards our actual goal, Wikipedia.

A lot has changed. You can find a full list of changes in the release notes. And you can download and install Semantic MediaWiki form SourceForge. Spread the word!

There remains still a lot of things to do. We have plenty of ideas how to make it more useful, and our users and co-developers also seem to have plenty of ideas. It is great fun to see the numbers of contributors to the code increase, and also to see the mailing lists being very lively. Personally, I am very happy to see Semantic MediaWiki flourish as it does, and I am thankful to Markus for starting this year (or rather ending the last) with such a great step.

Willkommen auf Simia

Willkommen auf Simia, der neuen Website von Denny Vrandecic. Nachdem ich in meinem Blog seit gefühlten drei Zeitaltern nix mehr geschrieben habe, und auf meinen Seiten seit Anbeginn der Zeiten keine neuen Inhalte eingestellt habe, kann ich euch jetzt sagen, es lag daran, dass ich die ganze Technik umstellen wollte.

Womit ich endlich gut vorangekommen bin. Zur Zeit finden sich hier alle Blogeinträge von Nodix und die Kommentare. Die Funktion zum Erstellen neuer Kommentare funktioniert noch nicht, aber ich arbeite daran. Ihr werdet auch merken, dass deutlich mehr Inhalte auf der Seite in Englisch sind als früher.

Technisch gesehen ist Simia eine Semantic MediaWiki Installation. Damit gehört dieser Blog auch zu meiner Forschung, indem ich ein wenig Erfahrung aus erster Hand sammeln möchte, wie es ist, sein Blog und seine persönliche Homepage mit Semantic MediaWiki zu führen. (Insofern ist das natürlich kein Blog mehr, sondern ein so genanntes Bliki, aber wen schert's?). Und da das ganze semantisch ist, will ich herausfinden, wie so eine persönliche Website ins Semantic Web passt...

Um Up to date zu bleiben, gibt es eine Reihe von feeds auf Simia. Wählt Euch aus, was ihr wollt. Schöne Grüße, und ich hoffe, Ihr habt Euch gut durch die Weihnachtszeit gemampft! :)

San Francisco and Challenges

Time is running totally crazy on me in the last few weeks. Right now I am in San Francisco -- if you like to suggest a meeting, drop me a line.

The CKC Challenge is going on and well! If you didn't have the time yet, check it out! Everybody is speaking about how to foster communities for shared knowledge building, this challenge is actually doing it, and we hope to get some good numbers and figures out of it. An fun -- there is a mystery prize involved! Hope to see as many of you as possible at the CKC 2007 in a few days!

Yet another challenge with prizes is going on at Centiare. Believe it or not, you can actually make money with using a Semantic MediaWiki, wih the Centiare Prize 2007. Read more there.

First look at Freebase

I got the chance to get a close look at Freebase (thanks, Robert!). And I must say -- I'm impressed. Sure, the system is still not ready, and you notice small glitches happening here and there, but that's not what I was looking for. What I really wanted to understand is the idea behind the system, how it works -- and, since it was mentioned together with Semantic MediaWiki one or twice, I wanted to see how the systems compare.

So, now here are my first impressions. I will sure play more around with the system!

Freebase is a databse with a flexible schema and a very user friendly web front end. The data in the database is offered via an API, so that information from Freebase can be included in external applications. The web front end looks nice, is intuitive for simple things, and works for the not so simple things. In the background you basically have a huge graph, and the user surfs from node to node. Everything can be interconnected with named links, called properties. Individuals are called topics. Every topic can have a multitude of types: Arnold Schwarzenegger is of type politician, person, actor, and more. Every such type has a number of associated properties, that can either point to a value, another topic, or a compound value (that's their solution for n-ary relations, it's basically an intermediate node). So the type politician adds the party, the office, etc. to Arnold, actor adds movies, person adds the family relationships and dates of birth and death (I felt existentially challenged after I created my user page, the system created a page of me inside freebase, and there I had to deal with the system asking me for my date of death).

It is easy to see that types are crucial for the system to work. Are they the right types to be used? Do they cover the right things? Are they interconnected well? How do the types play together? A set of types and their properties form a domain, like actor, movie, director, etc. forming the domain "film", or album, track, musician, band forming the domain "music". A domain is being administrated by a group of users who care about that domain, and they decide on the properties and types. You can easily see ontology engineering par excellence going on here, done in a collaborative fashion.

Everyone can create new types, but in the beginning they belong to your personal domain. You may still use them as you like, and others as well. If your types, or your domain, turns out to be of interest, it may become promoted as being a common domain. Obviously, since they are still alpha, there is not yet too much experience with how this works out with the community, but time will tell.

Unsurprising I am also very happy that Metaweb's Jamie Taylor will give an invited talk at the CKC2007 workshop in Banff in May.

The API is based on JSON, and offers a powerful query language to get the knowledge you need out of Freebase. The description is so good that I bet it will find almost immediate uptake. That's one of the things the Semantic Web community, including myself, did not yet manage to do too well: selling it to the hackers. Look at this API description for how it is done! Reading it I wanted to start hacking right away. They also provide a few nice "featured" applications, like the Freebase movie game. I guess you can play it even without a freebase account. It's fun, and it shows how to reuse the knowledge from Freebase. And they did some good tutorial movies.

So, what are the differences to Semantic MediaWiki? Well, there are quite a lot. First, Semantic MediaWiki is totally open source, Metaweb, the system Freebase runs on, seems not to be. Well, if you ask me, Metaweb (also the name of the company) will probably want to sell MetaWeb to companies. And if you ask me again, these companies will make a great deal, because this may replace many current databases and many problems people have with them due to their rigid structure. So it may be a good idea to keep the source closed. On the web, since Freebase is free, only a tiny amount of users will care that the source of Metaweb is not free, anyway.

But now, on the content side: Semantic MediaWiki is a wiki that has some features to structure the wiki content with a flexible, collaboratively editable vocabulary. Metaweb is a database with a flexible, collaboratively editable schema. Semantic MediaWiki allows to extend the vocabulary easier than Metaweb (just type a new relation), Metaweb on the other hand enables a much easier instantiation of the schema because of its form based user interface and autocompletion. Metaweb is about structured data, even though the structure is flexible and changing. Semantic MediaWiki is about unstructured data, that can be enhanced with some structure between blobs of unstructured data, basically, text. Metaweb is actually much closer to a wiki like OntoWiki. Notice the name similarity of the domains: freebase.com (Metaweb) and 3ba.se (OntoWiki).

The query language that Metaweb brings along, MQL, seems to be almost exactly as powerful as the query language in Semantic MediaWiki. Our design has been driven by usability and scalability, and it seems that both arrived at basically the same conclusions. Just a funny coincidence? The query languages are both quite weaker than SPARQL.

One last difference is that Semantic MediaWiki is fully standards based. We export all data in RDF and OWL. Standard-compliant tools can simply load our data, and there are tons of tools who can work with it, and numerous libraries in dozens of programming languages. Metaweb? No standard. A completely new vocabulary, a completely new API, but beautifully described. But due to the many similarities to Semantic Web standards, I would be surprised if there wasn't a mapping to RDF/OWL even before Freebase goes fully public. For all who know Semantic Web or Semantic MediaWiki, I tried to create a little dictionary of Semantic Web terms.

All in all, I am looking forward to see Freebase fully deployed! This is the most exciting Web thingy 2007 until now, and after Yahoo! pipes, and that was a tough one to beat.


Comments are still missing on this post.

The benefit of Semantic MediaWiki

I can't comment on Tim O'Reilly's blog right now it seems, maybe my answer is too long, or it has too many links, or whatever. It only took some time, my mistake. He blogged about Semantic MediaWiki -- yaay! I'm a fanboy, really -- but he asks "but why hasn't this approach taken off? Because there's no immediate benefit to the user." So I wanted to answer that.

"About Semantic MediaWiki, you ask, "why hasn't this approach taken off?" Well, because we're still hacking :) But besides that, there is a growing number of pages who actually use our beta software, which we are very thankful to (because of all the great feedback). Take a look at discourseDB for example. Great work there!

You give the following answer to your question: "Because there's no immediate benefit". Actually, there is benefit inside the wiki: you can ask for the knowledge that you have made explicit within the wiki. So the idea is that you can make automatic tables like this list of Kings of Judah from the Bible wiki, or this list of upcoming conferences, including a nice timeline visualization. This is immediate benefit for wiki editors: they don't have to make pages like these examples (1, 2, 3, 4, 5, or any of these) by hand. Here's were we harness self-interest: wiki editors need to put in less work in order to achieve the same quality of information. Data needs to be entered only once. And as it is accessible to external scripts with standard tools, they can even write scripts to check the correctness or at some form of consistency of the data in the wiki, and they are able to aggregate the data within the wiki and display it in a nice way. We are using it very successfully for our internal knowledge management, where we can simply grab the data and redisplay it as needed. Basically, like a wiki with a bit more DB functionality.

I will refrain from comparing to Freebase, because I haven't seen it yet -- but from what I heard from Robet Cook it seems that we are partially complementary to it. I hope to see it soon :)"

Now, I am afraid since my feed's broken this message will not get picked up by PlanetRDF, and therefore no one will ever see it, darn! :( And it seems I can't use trackback. I really need to update to a real blogging software.


Comments are still missing on this post.

DL Riddle

Yesterday we stumbled upon quite a hard description logics problem. At least I think it is hard. The question was, why is this ontology unsatisfiable? Just six axioms. The ontology is availbe in OWL RDF/XML, in PDF (created with the owl tools), and here in Abstract Syntax.

Class(Rigid complete restriction(subclassof allValuesFrom(complementOf(AntiRigid))))
Class(NonRigid partial)
DisjointClasses(NonRigid Rigid)
ObjectProperty(subclassof Transitive)
Individual(publishedMaterial type(NonRigid))
Individual(issue type(Rigid) value(subclassof publishedMaterial))

So, the question is, why is this ontology unsatisfiable? It is even a minimally unsatisfiable subset, actually, that means, remove any of the axioms and you get a satisfiable ontology. Maybe you like to use it to test your students. Or yourself. The debugger in SWOOP actually gave me the right hint, but it didn't offer the full explanation. I figured it out, after a few minutes of hard thinking (so, now you know how bad I am at DL).

Do you know? (I'll post the answer in the comments if no one else does in a few days)

(Just in case you wonder, this ontology is based on a the OntOWLClean ontology from Chris Welty, see his paper at FOIS2006 if you like more info)


Comments are still missing on this post.

Zur Macht der Blogger

sympatexter hat einen Eintrag dazu geschrieben, dass sich Blogger gerne für zu wichtig nehmen (als Antwort auf ein Stück von Robert Basic, der darüber schreibt, dass sich Blogger noch nicht wichtig genug nehmen). Als Randbemerkung: es ist amüsant zu sehen, dass ausgerechnet sympatexter auf diesen Missstand hinweißt, insbesondere da die Tagline des eigenen Blogs sympatexter rules the world ist. (Mein Fehler, sorry)

Was ist der Sinn des Bloggens? Das würde vielleicht zu weit führen. Aber einzelne Argumente des sympatexters möchte ich doch genauer beleuchten:

  • "Was die Blogger interessiert, interessiert auch leider NUR die Blogger." Stimmt nicht ganz - oder zumindest würde ich dafür gerne mehr Beleg sehen. Blogger werden von der Werbewirtschaft als Multiplikatoren betrachtet - eine Eigenschaft, die sie nur haben können, wenn mehr Leute Blogs lesen als sie schreiben. Außerdem bloggen viele über allgemein interessante Themen, von Lost über Britney Spears, Verbrauchererfahrungen mit Produkten und Dienstleistungen, die Bundestagswahlen bis hin zu Menschenrechtsverletzungen in Guantanamo oder direkten Berichten aus Krisengebieten im Nahen Osten oder Thailand. Glaubt Ihr nicht? Schaut auf Technorati nach, die haben eine aktuelle Liste von populären Themen. Heutige Favoriten: die Oscars, Antonella Barba, und Al Gore. Alles Themen die auch außerhalb der Blogosphäre relevant sind.
  • Meine Zustimmung zu der Beobachtung bezüglich der Statistiken. Die Zahlen, die in den Medien genannt werden, sind häufig irreführend, aber das ist eine Eigenschaft von Statistiken und Medien. Verfolgt man die Zahlen auf die Quelle, wird man oft enttäuscht sein.
  • "In Deutschland lesen sehr wenige Menschen Blogs." Auch hier hätte ich gerne Zahlen. Ich bin mir sicher, dass ein großer Teil der webnutzenden Bevölkerung schon mal einen Blog gelesen hat, schlicht, weil sie bei Anfragen bei den Suchmaschinen häufig auf Blogeinträge stoßen. Vielleicht sind sich die Leser nicht mal bewusst, dass sie einen Blog lesen (ebenso wie der Anteil der Wikipedia-Leser, der nicht weiß, dass die Wikipedia von jedem verändert werden kann, stark zugenommen hat). Einige meiner bestbesuchten Einträge haben mit dem Kochen von Milchreis, den Machenschaften des Kleeblatt-Verlags, und Filmen zu tun. Die Leute, die das Lesen sind nicht die üblichen Leser meines Blogs -- aber ein wichtiger Anteil.
  • "Die meisten Blogs haben noch nicht mal dreistellige Zugriffszahlen pro Tag und werden meistens von Freunden gelesen." Zustimmung, und gleichzeitig die Frage: na und? Ich erwarte ja, dass dieser Blog hier eigentlich nur von Leuten gelesen wird, die mich kennen. Das kann wieder für einzelne Beiträge anders sein, aber im Allgemeinen trifft das zu. Und das, was ich schreibe, interessiert auch meistens nur diese Wenigen -- wenn überhaupt. Aber das ist OK. Blogs werden vielfach dafür verwendet, die Kommunikation zu Freunden und Bekannten, oder gar zur Familie, zu vereinfachen, gar zu ermöglichen, oder sie schlicht aufrechtzuerhalten. Und das ist gut so. Nicht jeder Blog muss Hunderttausende von Lesern haben, das wäre nicht mal möglich. Man darf halt als Blogger dann aber auch nicht erwarten, dass Hunderttausende lesen und durch die Einträge beeinflusst werden.
  • "Etwas zu verlinken, was älter als eine Woche ist, ist ja schon fast Blasphemie - so versinkt das meiste, kaum wahrgenommen, in den Archiven." Zurecht beanstandet. Man sollte häufiger in die Archive verlinken, und strukturierte Einträge machen, die langfristig von Interesse sind. Semantische Technologien, wie ich sie auch in meiner Arbeit entwickle, sollen auch konkret an diesen Baustellen arbeiten. Ein Probekapitel zu semantischen Blogs und Wikis aus einem jüngst erschienen Buch über Wikis und Blogs gibt dazu ein wenig Einsicht, wie man sich das vorstellen kann. Leider sind nur die ersten 8 Seiten online verfügbar. (Achtung Werbung!) Kauft das Buch! (Werbung Ende) Solche Technologien sollen helfen, Blogeinträge dann verfügbar zu machen, wenn sie relevant sind. Einen ersten Vorgeschmack bietet die Firefox Extension Blogger Web Comments von Google.

Letztlich aber bleibt ein Argument vor allem: selbst wenn es wenige lesen, und es viel zu häufig Nabelschau ist, was die Blogger machen -- dieser Eintrag mit eingeschlossen, ironischerweise -- ist das Bloggen eine Technik, die es zum ersten Mal in der Geschichte der Menschheit tatsächlich so vielen Leuten konkret ermöglicht, aktiv eine Stimme zu haben. Ob das, was diese Leute damit anfangen, gut ist oder nicht, dass sei eine Entscheidung des Einzelfalls. Aber allein die Tatsache, dass heute Klein-Gretchen aus Hintertupfingen ihre handgekrakelten Bilder hochladen kann, und sie sofort weltweit zugänglich sind, ist ein Schritt auf dem Weg zu einer globalen Gesellschaft. Ein kleiner, ja, aber ein notwendiger und auch wichtiger.

Talk in Korea

If you're around this Tuesday, February 13th, in Seoul, come by the Semantic Web 2.0 conference. I had the honour to be invited to give a talk on the Semantic Wikipedia (where a lot is happening right now, I will blog about this when I come back from Korea, and when the stuff gets fixed).

Looking forward to see you there!

Mail problems

The last two days my mail account had trouble. If you could not send something to me, sorry! Now it should work again.

Since it is hard to guess who tried to eMail me in the last two days (I guess three persons right), I hope to reach some this way.

Building knowledge together - extended

In case you did not notice yet -- the CKC2007 Workshop on Social and Collaborative Construction of Structured Knowledge at the WWW2007 got an extended deadline due to a number of requests. So, you have time to rework your submissions or finish yours! Also the demo submission deadline is upcoming. We want to have a shootout of the tools that have been created in the last few years, and get hands on to the differences, problems, and best ideas.

See you in Banff!

Nutkidz Jubiläum

Die 50. Folge der nutkidz ist erschienen! Und zum Jubiläum haben wir uns was besonderes einfallen lassen.

Viel Spaß! Und übrigens, ja, sie erscheinen wieder regelmäßig. Zwar nur monatlich, aber immerhin regelmäßig. Und wieviele monatliche Webcomics gibt es schon?

Collaborative Knowledge Construction

The deadline is upcoming! This weekend the deadline for submissions to the Workshop on Social and Collaborative Construction of Structured Knowledge at the WWW2007 will be over. And this may be easily the hottest topic of the year, I think: how do people construct knowledge in a community?

Ontologies are meant to be shared conceptualizations -- but how many tools really allow to build ontologies in a widely shared manner?

I am especially excited about the challenge that comes along with the workshop, to examine different tools, and to see how their perform. If you have a tool that fits here, write us.

So, I know you have thought a lot about the topic of collaboratively building knowledge -- write your thoughts down! Send them to us! Come to Banff! Submit to CKC2007!

Was für ein Zufall!

Ich schreibe einem Kollegen in den Niederlanden. Der Antwortet mir, dass er bald nach Barcelona zieht, auf eine neue Stelle. Keine zwei Minuten später schickt mir Sixt eine eMail mit einem Spezialangebot, Hotel und Mietwagen für drei Tage Barcelona für nur X Euro.

Was für ein Zufall!

Semantic MediaWiki goes business

... but not with the developers. Harry Chen writes about it, and several places copy the press release about Centiare. Actually, we didn't even know about it, and were a bit surprised to hear that news after our trip to India (which was very exciting, by the way). But that's OK, and actually, it's pretty exciting as well. I wish Centiare all the best! Here is their press release.

They write:

Centiare's founder, Karl Nagel, genuinely feels that the world is on the verge of an enormous breakthrough in MediaWiki applications. He says, "What Microsoft Office has been for the past 15 years, MediaWiki will be for the next fifteen." And Centiare will employ the most robust extension of that software, Semantic MediaWiki.

Wow -- I'd never claim that SMW is the most robust extension of MediaWiki -- there are so many of them, and most of them have a much easier time of being robust! But the view of MediaWiki taking the place of Office -- Intriguing. Although I'd put my bets rather on stuff like Google Docs (former Writely), and add some semantic spice to it. Collaborative knowledge construction will be the next big thing. Really big I mean. Oh, speaking about that, check out this WWW workshop on collaborative knowledge construction. Deadline is February 2nd, 2007.

Click here for more information about Centiare.


Comments are still missing on this post.

Goldener Würfel

Nein, kein Rollenspielpreis, das war der Goldene Becher. Vielmehr hatte Schwesterchen von der Post einen Brief abgeholt, während ich in Indien war, und nun, da ich zurück bin, habe ich ihn endlich aufgemacht, und deswegen kann ich jetzt nicht über Indien schreiben sondern widme mich diesem Brief.

Der Inhalt? Ein Würfel, scheinbar mit goldener Folie überzogen, und wo die Eins sein sollte ist ein kleines Bild von etwas kaum erkennbaren. Ich dachte im ersten Moment, es sei eine Rorschach-Figur. Nachdem Schwesterchen ja schon letztes Jahr beim Hustle-the-Sluff dabei war, nehme ich an, dass es diesmal etwas ähnliches ist. Also, ab zu Googles Blogsearch, und danach gesucht, und, wer sagt's denn, gleich ein Treffer, bei Daniel Gramsch, dem Zeichner von Alina Fox.

Aus einem Kommentar ist zu entnehmen, dass Daniel Rüd ebenfalls einen solchen Brief enthalten hat, aber noch nicht darüber gebloggt hat. Ein weiterer Kommentar von Angie enthält sogar einen Link, auf dem man erkennt, dass der vermeintliche Rorschachtest doch eine Kuppel ist, von Schloss Charlottenburg in Berlin, mit der Fortuna drauf. Passend für einen Würfel, durchaus.

Laut dem Fuchsbau deutet das ganze auf ein sogenanntes Alternate Reality Game hin, eine Art elaborierter Mischung aus Live-Rollenspiel und Schnitzeljagd. Etwas, für das ich zur Zeit überhaupt nicht die Zeit habe, aber dann wiederum klingt es so spannend, dass ich sehr viel Lust darauf habe. Ich habe erst unlängst das spannende Buch Convergence Culture von Henry Jenkins vom MIT CMS verschlungen, in dem er solche und ähnliche Phänomene beschreibt.

Drum fänd ich's spannend, doch dabei zu sein. Also, auf zu der von Angie entdeckten Webseite, um mehr Informationen auszugraben. Angie, wenn Du das liest -- wie hast Du die Seite ausfindig gemacht? Und wer bist Du?

Wer sonst hat noch einen Würfel erhalten?


Template:Comments missing

Frohes Neues Jahr!

Letztes Jahr schrieb ich einen langen 2005er Abschiedseintrag. Dieses Jahr nicht, nicht etwa weil das Jahr nicht gut zu mir war -- es war sehr gut zu mir! -- sondern einfach, weil mir die Zeit fehlt. Ich muss noch packen. In weniger als zwei Stunden beginnt meine Reise nach Indien. Nein, nicht wegen der Arbeit, ganz privat. Langsam werde ich nervös, dass ich das mit dem Packen nicht mehr packe.

2006 war unglaublich gut. Ich konnte von mancher Arbeit die Früchte ernten, und andere Pflanzen weiter wachsen sehen. 2007 und 2008 stehen dann weitere Ernten an. Allerdings fällt mir auf, dass ich wahrscheinlich einer der schlechtesten Blogger des Planeten bin. Da schreibe ich an Büchern mit, und hier erwähne ich die nicht mal! Das werde ich nächstes Jahr nachholen müssen. Allerdings hängt das auch ein wenig mit dem geplanten Relaunch von Nodix zusammen. Viele Inhalte warten noch darauf, dass ich sie wieder hochlade, aber das mache ich erst, wenn ich die neue Software eingerichtet habe. Den Wunschtermin - zum Nodix-Geburtstag - werde ich wohl nicht mehr schaffen, schade. Aber ich lenke wieder ab, ich sollte wirklich packen. Wir hören uns ja wieder, in ein paar Wochen. Vielleicht erzähle ich sogar von Indien. Und von ein paar anderen Reisen von diesem Jahr. Zu erzählen gäbe es zumindest manches.

Allen Lesern meine besten Wünsche zum Neuen Jahr! Allen eine schöne Feier, einen guten Rutsch, mögen ein paar Eurer Wünsche für 2007 in Erfüllung gehen.

Five things you don't know about me

Well, I don't think I have been tagged yet, but I could be within the next few days (the meme is spreading), and as I won't be here for a while, I decided to strike preemptively. If no one tags me, I assume to take one of danah's.

So, here we go:

  1. I was born without fingernails. They grew after a few weeks. But nevertheless, whenever they wanted to cut my nails when I was a kid, no one could do it alone -- I always panicked and needed to be held down.
  2. Last year, I contributed to four hardcover books. Only one of them was scientific. The rest were modules for Germany's most popular role playing game, The Dark Eye.
  3. I am a total optimist. OK, you knew that. But you did not know that I actually tend to forget everything bad. Even in songs, I noticed that I only remember the happy lines, and I forget the bad ones.
  4. I co-author a webcomic with my sister, the nutkidz. We don't manage to meet any schedule, but we do have a storyline. I use the characters quite often in my presentations, though.
  5. I still have an account with Ultima Online (although I play only three or four times a year), and I even have a CompuServe Classic account -- basically, because I like the chat software. I did not get rid of my old PC, because it still runs the old CompuServe Information Manager 3.0. I never figured out how to run IRC.

I bet no one of you knew all of this! Now, let's tag some people: Max, Valentin, Nick, Elias, Ralf. It's your turn.


Comments are still missing on this post.

Semantic Web patent

Tim Finin and Jim Hendler are asking about the earliest usage of the term Semantic Web. Tim Berners-Lee (who else?) spoke about the need of semantics in the web at the WWW 1994 plenary talk in Geneva, though the term Semantic Web does not appear there directly. Whatever. What rather surprised me, though, is, when surfing a bit for the term, I discovered that Amit Sheth, host of this year's ISWC, filed the patent on it, back in 2000: System and method for creating a Semantic Web. My guess would be, that is the oldest patent of it.

Der am schnellsten gebrochene Vorsatz

  1. Keine Vorsätze haben.

Schneller als den kann man keinen Vorsatz brechen.

Supporting disaster relief with semantics

Soenke Ziesche, who has worked on humanitarian projects for the United Nations for the last six years, wrote an article for xml.com on the use of semantic wikis in disaster relief operations. That is a great scenario I never thought about, and basically one of these scenarios I think of when I say in my talks: "I'll be surprised if we don't get surprised by how this will be used." Probably I would even go to state the following: if nothing unexpected happens with it, the technology was too specific.

Just the thought that semantic technology in general, and maybe even Semantic MediaWiki in particular, could relief the effects of a natural disaster, or maybe even safe a life, this thought is so incredible exciting and rewarding. Thank you so much Soenke!

All problems solved

Today I feel a lot like the nameless hero from the PhD comics, and what is currently happening to him (begin of the storyline, continuation, especially here, and very much like here, but pitily, not at all like here). Today we had Boris Motik visiting the AIFB, who is one of the brightest people on this planet. And he gave us a more than interesting talk on how to integrate OWL with relational databases. What especially interested me was his great work on constraints -- especially since I was working on similar issues, unit tests for ontologies, as I think constraints are crucial for evaluating ontologies.

But Boris just did it much cleaner, better, and more thorough. So, I will dive into his work and try to understand it to see, if there is anything left to do for me, or if I have to refocus. There's still much left, but I am afraid the most interesting part from a theoretic point is solved. Or rather, in the name of progress, I am happy it is solved. Let's get on with the next problem.

(I *know* it is my own fault)

Semantic Wikipedia presentations

Last week on the Semantics 2006 Markus and I gave talks on the Semantic MediaWiki. I was happy to be invited to give one of the keynotes at the event. A lot of people were nice enough to come to me later to tell me how much they liked the talk. And I got a lot of requests for the slides. I decided to upload them, but wanted to clean them a bit. I am pretty sure that the slides are not self-sufficient -- they are tailored to my style of presentations a lot. But I added some comments to the slides, so maybe this will help you understand what I tried to say if you have not been in Vienna. Find the slides of the Semantics 2006 keynote on Semantic Wikipedia here. Careful, 25 MB.

But a few weeks ago I was at the KMi Podium for an invited talk there. The good thing is, they don't have just the slides, they also have a video of the talk, so this will help much more in understanding the slides. The talk at KMi has been a bit more technical and a lot shorter (different audiences, different talks). Have fun!

Rollenspiel und Web 2.0

Letzte Woche hielt ich in Wien einen Vortrag auf der Semantics 2006. Danach wurde ich um ein Radio-Interview gebeten, und dabei sprachen wir über das Semantic Web, Web 2.0 und ähnliche Themen -- meine Arbeit halt. Semantic Web, das Web der Daten, Web 2.0, das Mitmach-Web (ganz grob).

Plötzlich aber wechselte die Reporterin das Thema, meinte, ich würde ja auch an Deutschlands beliebtestem Rollenspiel Das Schwarze Auge arbeiten. Ob ich den Zuhörern erklären könnte, was denn Rollenspiel sei. Und da erklärte ich Rollenspiel als Geschichtenerzählen 2.0 -- Geschichtenerzählen zum Mitmachen, wo es darum geht, in der Gruppe eine gemeinsame Geschichte zu erzählen.

Na, wenn das mal keine neue Definition ist.

Zeitverschiebung

Es ist Mittag in Hawaii, und ich bin müde! Herrje.

Liegt wahrscheinlich daran, dass ich in Karlsruhe bin.

Semantic MediaWiki 0.6: Timeline support, ask pages, et al.

It has been quite a while since the last release of Semantic MediaWiki, but there was enormous work going into it. Huge thanks to all contributors, especially Markus, who has written the bulk of the new code, reworked much of the existing, and pulled together the contributions from the other coders, and the Simile team for their great Timeline code that we reused. (I lost overview, because the last few weeks have seen some travels and a lot of work, especially ISWC2006 and the final review of the SEKT project I am working on. I will blog on SEKT more as soon as some further steps are done).

So, what's new in the second Beta-release of the Semantic MediaWiki? Besides about 2.7 tons of code fixes, usability and performance improvements, we also have a number of neat new features. I will outline just four of them:

  • Timeline support: you know SIMILE's Timeline tool? No? You should. It is like Google Maps for the fourth dimension. Take a look at the Timeline webpage to see some examples. Or at ontoworld's list of upcoming events. Yes, created dynamically out of the wiki data.
  • Ask pages: the simple semantic search was too simple, you think? Now we finally have a semantic search we dare not to call simple. Based on the existing Ask Inline Queries, and actually making them also fully functional, the ask pages allow to dynamically query the wiki knowledge base. No more sandbox article editing to get your questions answered. Go for the semantic search, and build your ask queries there. And all retrievable via GET. Yes, you can link to custom made queries from everywhere!
  • Service links: now all attributes can automatically link to further resources via the service links displayed in the fact box. Sounds abstract? It's not, it's rather a very powerful tool to weave the web tighter together: service links specify how to connect the attributes data to external services that use that data, for example, how to connect geographic coordinates with Yahoo maps, or ontologies with Swoogle, or movies with IMdb, or books with Amazon, or ... well, you can configure it yourself, so your imagination is the limit.
  • Full RDF export: some people don't like pulling the RDF together from many different pages. Well, go and get the whole RDF export here. There is now a maintenance script included which can be used via a cron job (or manually) to create an RDF dump of the whole data inside the wiki. This is really useful for smaller wikis, and external tools can just take that data and try to use it. By the way, if you have an external tool and reuse the data, we would be happy if you tell us. We are really looking forward to more examples of reuse of data from a Semantic MediaWiki installation!

I am looking much forward to December, when I can finally join Markus again with the coding and testing. Thank you so very much for your support, interest, critical and encouraging remarks with regards to Semantic MediaWiki. Grab the code, update your installation, or take the chance and switch your wiki to Semantic MediaWiki.

Just a remark: my preferred way to install both MediaWiki and Semantic MediaWiki is to pull it directly from the SVN instead of taking the releases. It's actually less work and helps you tremendously in keeping up to date.

Semantic Web Challenge 2006 winners

Sorry for the terseness, but I am sitting in the ceremony.

18 submissions. 14 passed the minimal criteria.

Find more information on challenge.semanticweb.org -- list of Finalists, links, etc. See also on ontoworld.

And the winners are ...

3. Enabling Semantic Web communities with DBin: an overview (by Christian Morbidoni, Giovanni Tummarello, Michele Nucci)

2. Foafing the Music: Bridging the semantic gap in music recommendation (by Oscar Celma)

1. MultimediaN E-Culture demonstrator (by Alia Amin, Bob Wielinga, Borys Omelayenko, Guus Schreiber, Jacco van Ossenbruggen, Janneke van Kersen, Jan Wielemaker, Jos Taekema, Laura Hollink, Lynda Hardman, Marco de Niet, Mark van Assem, Michiel Hildebrand, Ronny Siebes, Victor de Boer, Zhisheng Huang)

Congratulations! It is great to have such great projects to show off! :)


Comments are still missing on this post.

ISWC 2008 coming to Karlsruhe

Yeah! ISWC2006 is just starting, and I am really looking forward to it. The schedule looks more than promising, and Semantic MediaWiki is among the finalists for the Semantic Web Challenge! I will write more about this year's ISWC the next few days.

But, now the news: yesterday it was decided that ISWC2008 will be hosted by the AIFB in Karlsruhe! It's a pleasure and a honor -- and I am certainly looking forward to it. Yeah!


Comments are still missing on this post.

Semantic Web and Web 2.0

I usually don't just point to other blog entries (thus being a bad blogger regarding netiquette), but this time Benjamin Nowack nailed it in his post on the Semantic Web and Web 2.0. I read the iX article (a popular German computer technology magazine), and I lost quite some respect for the magazine as there were so many unfounded claims, off-the-hand remarks, and so much bad attitude in the article (and in further articles scuttered around the issue) towards the Semantic Web that I thought the publisher was personally set on a crusade. I could go through the article and write a commentory on it, and list the errors, but honestly, I don't see the point. At least it made me appreciate peer review and scientific method a lot more. The implementation of peer review is flawed as well, but I realize it could be so much worse (and it could be better as well - maybe PLoS is a better implementation of peer review).

So, go to Benji's post and convince yourself: there is no "vs" in Semantic Web and Web 2.0.

Java developers f*** the least

Andrew Newman conducted a brilliant and significant study on how often programmers use f***, and he splitted it on programming languages. Java developers f*** the least, whereas LISP programmers use it on every fourth opportunity. In absolute term, there are still more Java f***s, but less than C++ f***s.

Just to add a further number to the study -- because Andrew unexplicably omitted Python -- here's the data: about 196,000 files / 200 occurences -> 980. That's the second highest result, placing it between Java and Perl (note that the higher the number, the less f***s -- I would have normalized that by taking it 1/n, but, fuck, there's always something to complain).

Note that Google Code Search actually is totally inconsisten with regards to their results. A search for f*** alone returns 600 results, but if you look for f*** in C++ it returns 2000. So, take the numbers with more than a grain of salt. The bad thing is that Google counts are taken as a basis for a growing number of algorithms in NLP and machine learning (I co-authored a paper that does that too). Did anyone compare the results with Yahoo counts or MSN counts or Ask counts or whatever? This is not the best scientific practice, I am afraid. And I comitted it too. Darn.


Comments are still missing on this post.