Semantic search

Jump to navigation Jump to search

The letter Đ

The letter Đ was introduced to Serbo-Croatian by Đuro Daničić, according to Wikipedia. I found that highly amusing, that he introduced the letter that is the first letter in his name.

Wikipedia also claims that he was born Đorđe Popović, and all I can think of is "nah, that can't be right".

That would be like Jebediah Springfield who was born in a cabin that he helped build.

Pastir Loda

Vladimir Nazor is likely the most famous author from the island of Brač, the island my parents are from. His most acclaimed book seems to be Pastir Loda, Loda the Shepherd. It tells the story of a satyr that, through accidents and storms, was stranded on the island of Brač, and how he lived on Brač for the next almost two thousand years.

It is difficult to find many of his works, they are often out of print. And there isn't much available online, either. Since Nazor died in 1949, his works are in the public domain. I acquired a copy of Pastir Loda from an antique book shop in Zagreb, which I then forwarded to a friend in Denmark who has a book scanner, and who scanned the book so I can make the PDF available now.

The book is written in Croatian. There is a German translation, but that won't get into the public domain until 2043 (the translator lived until 1972), and according to WorldCat there is a Czech translation, and according to Wikipedia a Hungarian translation. For both I don't know who the translator is, and so I don't know the copyright status of these translations. I also don't know if the book has ever been translated to other languages.

I wish to find the time to upload and transcribe the content on Wikisource, and then maybe even do a translation of it into English. For now I upload the book to archive.org, and I also make it available on my own Website. I want to upload it to Wikimedia Commons, but I immediately stumbled upon the first issue, that it seems that to upload it to Commons the book needs to be published before 1928 and the author has to be dead for more than 70 years (I think that should be an or). I am checking on Commons if I can upload it or not.

Until then, here's the Download:


F in Croatian

I was writing some checks to find errors in the lexical data in Wikidata for Croatian, and one of the things I tried was to check whether the letters in the words are all part of the Croatian alphabet. But instead of just taking a list, or writing down from memory, I looked at the data, and added letter after letter. And then I was surprised to find that the letter "f" only appears in loanwords. And I look it up in the Croatian Encyclopedia and it simply states that "f" is not a letter of the old slavic language.

I was mindblown. I speak this language since I can remember, and i didn't notice that there is no "f" but in loanwords. And "f" seems like such a fundamental sound! But no, wrong!

If you speak a slavic language, do you have the letter "f"?

Do you hear the people sing?

"Do you hear the people sing, singing the song of angry men..."

Yesterday, a London performance of Les Miserables was interrupted by protesters raising awareness about climate change.

The audience booed.

It seems the audience was unhappy about having to experience protests and unrest during the performance of protests and unrest they wanted to enjoy.

The hypocrisy is rich in this one, but a very well engineered and expected one. But I guess only with the luxury of being detached from the actual event one can afford to enjoy the hypocrisy. I assume that for many people attending a West End London production of Les Miserables aims to be a proper highlight of the year, if not more. It's something that children gift their parents for the 30th wedding anniversary. It may be the reason for a trip to London. In addition, attending a performance like this is an escapist act, that you don't want interrupted with the problems of the real world. And given that it is a life performance, it seems disrespectful to the cast, to the artists, who pour their lives into their roles.

On the other side, the existential dread about climate change, and the insufficient actions by world leaders seem to demand increasingly bolder action and more attention. We are teaching our kids that they should act if something is not right. And we are telling them about the predictions for climate change. And then we are surprised if they try to do something? The message that climate change will be extremely disruptive to our lives and that we need to act much more decisively has obviously not yet been understood by enough people. And we, humanity, our leaders, elected or not, are most certainly not yet doing enough to try to prevent or at least mitigate the effects of climate change that are starting to roll over us.

It would be good, but admittedly unlikely, if both sides could appreciate the other more. Maybe the audience might be a bit appreciative of seeing the people sing the song of angry men in real. And maybe the protesters could choose their targets a bit more wisely. Why choose art? There are more disruptive targets if you were to protest the oil industry than a performance of Les Miserables. To be honest, if i were working for the oil industry, this is exactly the kind of actions I would be setting up. And with people who are actually into the cause. That way I can ensure that people will talk about interrupted theater productions and defaced paintings, instead of again having the hottest year in history, of floods, heatwaves, hurricanes, and the thousands of people who already died due to climate change induced catastrophes - and the billions more whose life will be turned upside down.

Immortal relationships

I saw a beautiful meme yesterday that said that from the perspective of a cat or dog, humans are like elves who live for five hundred years and yet aren't afraid to bond with them for their whole life. And it is depicted as beautiful and wholesome.

It's so different from all those stories of immortals, think of Vampires or Highlander or the Sandman, where the immortals get bitter, or live in misery and loss, or become aloof and uncaring about human lives and their short life spans, and where it hurts them more than it does them good.

There seem to be more stories exploring the friendship of immortals with short-lived creatures, be it in Rings of Power with the relationship of Elrond and Durin, be it the relation of Star Trek's Zora with the crew of the Discovery or especially with Craft in the short movie Calypso, or between the Eternal Sersi and Dane Whitman. All these relations seem to be depicted more positively and less tragic.

In my opinion that's a good thing. It highlights the good parts in us that we should aspire to. It shows us what we can be, based in a very common perception, the relationship to our cats and dogs. Stories are magic, in it's truest sense. Stories have an influence on the world, they help us understand the world, imagine the impact we can have, explore us who we can be. That's why I'm happy to see these more positive takes on that trope compared to the tragic takes of the past.

(I don't know if any of this is true. I think it would require at least some work to actually capture instances of such stories, classify and tally them, to see if that really is the case. I'm not claiming I've done that groundwork, but just capture an observation that I'd like to be true, but can't really vouch for it.)

Molly Holzschlag (1963-2023)

May her memory be a blessing.

She taught the Web to many, and she fought for the Web of many.

Doug Lenat (1950-2023)

When I started studying computer science, one of the initiation rites was to read the Jargon File. I stumbled when I read the entry on the microlenat:

microlenat: The unit of bogosity. Abbreviated μL, named after Douglas Lenat. Like the farad it is considered far too large a unit for practical use, so bogosity is usually expressed in microlenats.

I had not heard of Douglas Lenat then. English being my third language, I wasn’t sure what bogosity is. So I tried to learn a bit more to understand it, and I read a bit about Cyc and Eurisko, but since I just started computer science, my mind wasn’t really ready for things such as knowledge representation and common sense reasoning. I had enough on my plate struggling with resistors, electronegativity, and fourier transformations. Looking back, it is ironic that none of these played a particular role in my future, but knowledge representation sure did.

It took me almost ten years to come back to Cyc and Lenat’s work. I was then studying ontological engineering, a term that according to Wikipedia was coined by Lenat, a fact I wasn’t aware of at that time. I was working with RDF, which was co-developed by Guha, who has worked with Lenat at Cycorp, a fact I wasn’t aware of at that time. I was trying to solve problems that Lenat had tackled decades previously, a fact I wasn’t aware of at that time.

I got to know Cyc through OpenCyc and Cyc Europe, led by Michael Witbrock. I only met Doug Lenat a decade later when I was at Google.

Doug’s aspirations and ambitions had numerous people react with rolling eyes and sneering comments, as can be seen in the entry in the Jargon File. And whereas I might have absorbed similar thoughts as well, they also inspired me. I worked with a few people who told me “consider yourself lucky if you have a dozen people reading your paper, that’s the impact you will likely have”, but I never found that a remotely satisfactory idea. Then there were people like Doug, who shouted out “let’s solve common sense!”, and stormed ahead trying to do so.

His optimism and his bias to action, his can-do attitude, surely influenced me profoundly in choosing my own way forward. Not only once did I feel like I was channeling Lenat when I was talking about knowledge bases that anyone can edit, about libraries of functions anyone can use, or about abstract representations of natural language texts. And as ambitious as these projects have been called, they all carefully avoid the incomparably more ambitious goals Doug had his eyes set on.

And Doug didn’t do it from the comfort of a tenured academic position, but he bet his career and house on it, he founded a company, and kept it running for four decades. I was always saddened that Cyc was kept behind closed doors, and I hope that this will not hinder the impact and legacy it might have, but I understand that this was the magic juice that kept the company running.

One of Doug’s systems, Eurisko, became an inspiration and namesake for an AI system that played the role of the monster of the week in a first season episode of the X-Files, a fact I wasn’t aware of until now. Doug was a founder and advisory member of the TTI/Vanguard series of meetings, to which I was invited to present an early version of Abstract Wikipedia, a fact I wasn’t aware of until now. And I am sure there are more facts about Doug and his work and how it reverberated with mine that I am unaware of still.

Doug was a person ahead of their time, a person who lived, worked on and saw a future about knowledge that is creative, optimistic and inspiring. I do not know if we will ever reach that future, but I do know that Doug Lenat and his work will always be a beacon on our journey forward. Doug Lenat died yesterday in Austin, Texas, two weeks shy of his 73rd birthday, after a battle with cancer.

To state it in CycL, the language Cyc is written in:

 (#$dateOfDeath #$DougLenat "2023-08-31")
 (#$restInPeace #$DougLenat)

Butter

So, I went to the store with Little One today, and couldn't find the butter.

I ask the person at the cheese stand, who points me to the burrata. Tasty, but not what I'm looking for. I ask again and he sends me to the bread section.

I can't find it at the bread section, so I ask the person at the pastries stand where the butter is. She points me to the bagels. I say no, butter. She says, ah, there, pointing to the bathrooms. I'm getting exasperated, and I ask again. She points back to the cheeses with the burrata. I try again. She gets a colleague, and soon they both look confused.

Finally my daughter chimes in, asking for the butter. They immediately point her to the right place and we finally get the butter.

I haven't been so frustrated about my English pronunciation since I tried to buy a thermometer.

The Jones Brothers

The two Jones brothers never got along, but both were too stubborn to leave the family estate. They built out two entrances to the estate, one from the south, near Jefferson Avenue, and the newer, bigger one, closer to the historic downtown, and each brother chose to use one of the entrances exclusively, in order to avoid the other and their family. To the confusion of the local folk (but to the open enjoyment of the high school's grammar teacher, who was, surprisingly for his role, a descriptivist), they named the western gate the Jones' gate, and the southern one the Jones's gate, and the brothers earnestly thought that that settled it.

It didn't.

The Future of Knowledge Graphs in a World of Large Language Models

The Knowledge Graph Conference 2023 in New York City invited me for a keynote on May 11, 2023. Given that basically all conversations these days are about large language models, I have given a talk about my understanding on how knowledge graphs and large language models go together.

After the conference, I did a recording of the talk, giving it one more time, in order to improve the quality of the recording. The talk had gotten more than 10,000 views on YouTube so far, which, for me, is totally astonishing.

I forgot to link it here, so here we go finally:

Hot Skull

I watched Hot Skull on Netflix, a Turkish Science Fiction dystopic series. I knew there was only one season, and no further seasons were planned, so I was expecting that the story would be resolved - but alas, I was wrong. And the book the show is based on is only available in Turkish, so I wouldn't know of a way to figure out how the story end.

The premise is that there is a "semantic virus", a disease that makes people 'jabber', to talk without meaning (but syntactically correct), and to be unable to convey or process any meaning anymore (not through words, and very limited through acts). They seem also to loose the ability to participate in most parts of society, but they still take care of eating, notice wounds or if their loved ones are in distress, etc. Jabbering is contagious, if you hear someone jabber, you start jabbering as well, jabberers cannot stop talking, and it quickly became a global pandemic. So they are somehow zombieish, but not entirely, raising questions about them still being human, their rights, etc. The hero of the story is a linguist.

Unfortunately, the story revolves around the (global? national?) institution that tries to bring the pandemic under control, and which has taken over a lot of power (which echoes some of the conspiracy theories of the COVID pandemic), and the fact that this institution is not interested in finding a cure (because going back to the former world would require them to give back the power they gained). The world has slid into economic chaos, e.g. getting chocolate becomes really hard, there seems to be only little international cooperation and transportation going on, but there seems to be enough food (at least in Istanbul, where the story is located). Information about what happened in the rest of the world is rare, but everyone seems affected.

I really enjoyed the very few and rare moments where they explored the semantic virus and what it does to people. Some of them are heart-wrenching, some of them are interesting, and in the end we get indications that there is a yet unknown mystery surrounding the disease. I hope the book at least resolves that, as we will probably never learn how the Netflix show was meant to end. The dystopic parts about a failing society, the whole plot about an "organization taking over the world and secretly fighting a cure", and the resistance to that organization, is tired, not particularly well told, standard dystopic fare.

The story is told very slowly and meanders leisurely. I really like the 'turkishness' shining through in the production: Turkish names, characters eating simit, drinking raki, Istanbul as a (underutilized) background, the respect for elders, this is all very well meshed into the sci-fi story.

No clear recommendation to watch, mostly because the story is unfinished, and there is simply not enough payoff for the lengthy and slow eight episodes. I was curious about the premise, and still would like to know how the story ends, what the authors intended, but it is frustrating that I might never learn.

The right to work

20 May 2023

I've been a friend of Universal Basic Income for thirty years, but in the last twenty years, I have growing reservations about it, and many questions. This article about an experiment with a right to work was the first text in a while I read on it that substantially impacted my thinking on this (text is in German). I recommend reading it.

Work is not just a source of money, but for many also a source of meaning, pride, structure, motivation, social connections. Having voluntary access to work seems to be one major component that is necessary on a societal level, in addition to a universal basic income that allows that everyone can live in dignity. Note: I think work should be widely construed. If someone has something that fills that need, that's work. Raising children, taking care of a garden, writing a book, refining piano skills, creating art, taking care of others, taking care of yourself, all these easily count as work in my book.

I wish we were willing and able to experiment with different ways of structuring society as we are willing and able to experiment with technology. We deployed the Internet to the world without worrying about the long term consequences, but we're cautious about giving everyone enough money to not be hungry. That's just broken. I was always disappointed about the fact that sociology and politics as studied and taught by academia were mostly descriptive and not constructive endeavors.

Wikidata - The Making of

19 May 2023

Markus Krötzsch, Lydia Pintscher and I wrote a paper on the history of Wikidata. We published it in the History of the Web track at The Web Conference 2023 in Austin, Texas (what used to be called the WWW conference). This spun out of the Ten years of Wikidata post I published here.

The open access paper is available here as HTML: dl.acm.org/doi/fullHtml/10.1145/3543873.3585579

Here as a PDF: dl.acm.org/doi/pdf/10.1145/3543873.3585579

Here on Wikisource, thanks to Mike Peel for reformatting: Wikisource: Wikidata - The Making Of

Here is a YouTube trailer for the talk: youtu.be/YxWs_BS31QE

And here is the full talk (recreated) on YouTube: youtu.be/P3-nklyrDx4

20 years of editing Wikipedia

11 May 2023

Today it's been exactly twenty years since I made my first edit to Wikipedia. It was about the island of Brač, in the German Wikipedia.

Here is the version of the article I have created: Brač (as of May 11, 2003)

How much April 1st?

In my previous post, I was stating that I might miss April 1st entirely this year, and not as a joke, but quite literally. Here I am chronicling how that worked out. We were flying flight NZ7 from San Francisco to Auckland, starting on March 31st and landing on April 2nd, and here we look into far too much detail to see how much time the plane spent in April 1st during that 12 hours 46 minutes flight. There’s a map below to roughly follow the trip.

5:45 UTC / 22:45 31/3 local time / 37.62° N, 122.38° W / PDT / UTC-7

The flight started with taxiing for more than half an hour. We left the gate at 22:14 PDT time (doesn’t bode well), and liftoff was at 22:45 PDT.. So we had only about an hour of March left at local time. We were soon over the Pacific Ocean, as we would stay for basically the whole flight. Our starting point still had 1 hour 15 minutes left of March 31st, whereas our destination at this time was at 18:45 NZDT on April 1st, so still had 5 hours 15 minutes to go until April 2nd. Amusingly this would also be the night New Zealand switches from daylight saving time (NZDT) to standard time (NZST). Not the other way around, because the seasons are opposite in the southern hemisphere.

6:00 UTC / 23:00 31/3 local time / 37° N, 124° W / PDT / UTC-7

We are still well in the PDT / UTC-7 time zone, which, in general, goes to 127.5° W, so the local time is 23:00 PDT. We keep flying southwest.

6:27 UTC / 22:27 31/3 local time? / 34.7° N, 127.5° W / AKDT? / UTC-8?

About half an hour later, we reach the time zone border, moving out of PDT to AKDT, Alaska Daylight Time, but since Alaska is far away it is unclear whether daylight saving applies here. Also, at this point we are 200 miles (320 km) out on the water, and thus well out of the territorial waters of the US, which go for 12 nautical miles (that is, 14 miles or 22 km), so maybe the daylight saving time in Alaska does not apply and we are in international waters? One way or the other, we moved back in local time: it is suddenly either 22:27pm AKDT or even 21:27 UTC-9, depending on whether daylight saving time applies or not. For now, April 1 was pushed further back.

7:00 UTC / 23:00 31/3 local time? / 31.8° N, 131.3 W / AKDT? / UTC-8?

Half an hour later and midnight has reached San Francisco, and April 1st has started there. We were more than 600 miles or 1000 kilometers away from San Francisco, and in local time either at 23:00 AKDT or 22:00 UTC-9. We are still in March, and from here all the way to the Equator and then some, UTC-9 stretched to 142.5° W. We are continuing southwest.

8:00 UTC / 23:00 31/3 local time / 25.2° N, 136.8° W / GAMT / UTC-9

We are halfway between Hawaii and California. If we are indeed in AKDT, it would be midnight - but given that we are so far south, far closer to Hawaii, which does not have daylight saving time, and deep in international waters anyway, it is quite safe to assume that we really are in UTC-9. So local time is 23:00 UTC-9.

9:00 UTC / 0:00 4/1 local time / 17.7° N, 140.9° W / GAMT / UTC-9

There is no denying it, we are still more than a degree away from the safety of UTC-10, the Hawaiian time zone. It is midnight in our local time zone. We are in April 1st. Our plan has failed. But how long would we stay here?

9:32 UTC / 23:32 31/3 local time / 13.8° N, 142.5° W / HST / UTC-10

We have been in April 1st for 32 minutes. Now we cross from UTC-9 to UTC-10. We jump back from April to March, and it is now 23:32 local time. The 45 minutes of delayed take-off would have easily covered for this half hour of April 1st so far. The next goal is to move from UTC-10, but the border of UTC-10 is a bit irregular between Hawaii, Kiribati, and French Polynesia, looking like a hammerhead. In 1994, Kiribati pushed the Line Islands a day forward, in order to be able to claim to be the first ones into the new millennium.

10:00 UTC / 0:00 4/1 local time / 10° N, 144° W / HST / UTC-10

We are pretty deep in HST / UTC-10. It is again midnight local time, and again April 1st starts. How long will we stay there now? For the next two hours, the world will be in three different dates: in UTC-11, for example American Samoa, it is still March 31st. Here in UTC-10 it is April 1st, as it is in most of the world, from New Zealand to California, from Japan to Chile. But in UTC+14, on the Line Islands, 900 miles southwest, it is already April 2nd.

11:00 UTC / 1:00 4/1 local time / 3° N, 148° W / HST / UTC-10

We are somewhere east of the Line Islands. It is now midnight in New Zealand and April 1st has ended there. Even without the delayed start, we would now be solidly in April 1st local time.

11:24 UTC / 1:24 4/1 local time / 0° N, 150° W / HST / UTC-10

We just crossed the equator.

12:00 UTC / 2:00 4/2 local time / 3.7° S, 152.3° W / LINT / UTC+14

The international date line in this region does not go directly north-south, but goes one an angle, so without further calculation it is difficult to exactly say when we crossed the international date line, but it would be very close to this time. So we just went from 2am local time in HST / UTC-10 on April 1st to 2am local time in LINT / UTC+14 on April 2nd! This time, we have been in April 1st for a full two hours.

(Not for the first time, I wish Wikifunctions would already exist. I am pretty sure that taking a geocoordinate and returning the respective timezone will be a function that will be available there. There are a number of APIs out there, but none of which seem to provide a Web interface, and they all seem to require a key.)

12:44 UTC / 2:44 4/1 local time / 8° S, 156° W / HST / UTC-10

We just crossed the international date line again! Back from Line Island Time we move to French Polynesia, back from UTC+14 to UTC-10 again - which means it switches from 2:44 on April 2nd back to 2:44 on April 1st! For the third time, we go to April 1st - but for the first time we don’t enter it from March 31st, but from April 2nd! We just traveled back in time by a full day.

13:00 UTC / 3:00 4/1 local time / 9.6° S, 157.5° W / HST / UTC-10

We are passing between the Cook Islands and French Polynesia. In New Zealand, daylight saving time ends, and it switches from 3:00 local time in NZDT / UTC+13 to 2:00 local time in NZST / UTC+12. While we keep flying through the time zones, New Zealand declares itself to a different time zone.

14:00 UTC / 4:00 4/1 local time / 15.6° S, 164.5° W / HST / UTC-10

We are now “close” to the Cook Islands, which are associated with New Zealand. Unlike New Zealand, the Cook Islands do not observe daylight saving time, so at least one thing we don’t have to worry about. I find it surprising that the Cook Islands are not in UTC+14 but in UTC-10, considering they are in association with New Zealand. On the other side, making that flip would mean they would literally lose a day. Hmm. That could be one way to avoid an April 1st!

14:27 UTC / 3:27 4/1 local time / 18° S, 167° W / SST / UTC-11

We move from UTC-10 to UTC-11, from 4:27 back to 3:27am, from Cook Island Time to Samoa Standard Time. Which, by the way, is not the time zone in the independent state of Samoa, as they switched to UTC+13 in 2011. Also, all the maps on the UTC articles in Wikipedia (e.g. UTC-12) are out of date, because their maps are from 2008, not reflecting the change of Samoa.

15:00 UTC / 4:00 4/1 local time / 21.3° S, 170.3° W / SST / UTC-11

We are south of Niue and east of Tonga, still east of the international date line, in UTC-11. It is 4am local time (again, just as it was an hour ago). We will not make it to UTC-12, because there is no UTC-12 on these latitudes. The interesting thing about UTC-12 is that, even though no one lives in it, it is relevant for academics all around the world as it is the latest time zone, also called Anywhere-on-Earth, and thus relevant for paper submission deadlines.

15:23 UTC / 3:23 4/2 local time / 23.5° S, 172.5° W / NZST / UTC+12

We crossed the international date line again, for the third and final time for this trip! Which means we move from 4:23 am on April 1st local time in Samoa Standard Time to 3:23 am on April 2nd local time in NZST (New Zealand Standard Time). We have now reached our destination time zone.

16:34 UTC / 4:34 4/2 local time / 30° S, 180° W / NSZT / UTC+12

We just crossed from the Western into the Eastern Hemisphere. We are about halfway between New Zealand and Fiji.

17:54 UTC / 5:52 4/2 local time / 37° S, 174.8°W / NZST / UTC+12

We arrived in Auckland. It is 5:54 in the morning, on April 2nd. Back in San Francisco, it is 10:54 in the morning, on April 1st.

april1avoiding.png

Green is March 31st, Red April 1st, Blue April 2nd, local times during the flight.

Basemap https://commons.wikimedia.org/wiki/File:Standard_time_zones_of_the_world_%282012%29_-_Pacific_Centered.svg CC-BY-SA by TimeZonesBoy, based on PD by CIA World Fact Book

Postscript

Altogether, there was not one April 1st, but three stretches of April 1st: first, for 32 minutes before returning to March 31st, then for 2 hours again, then we switched to April 2nd for 44 minutes and returned to April 1st for a final 2 hours and 39 minutes. If I understand it correctly, and I might well not, as thinking about this causes a knot in my brain, the first stretch would have been avoidable with a timely start, the second could have been much shorter, but the third one would only be avoidable with a different and longer flight route, in order to stay West of the international time line, going south around Samoa.

In total, we spent 5 hours and 11 minutes in April 1st, in three separate stretches. Unless Alaskan daylight saving counts in the Northern Pacific, in which case it would be an hour more.

So, I might not have skipped April 1st entirely this year, but me and the other folks on the plane might well have had the shortest April 1st of anyone on the planet this year.

I totally geeked out on this essay. If you find errors, I would really appreciate corrections. Either in Mastodon, mas.to/@vrandecic, or on Twitter, @vrandecic. Email is the last resort, vrandecic@gmail.com (The map though is just a quick sketch)

One thing I was reminded of is, as Douglas Adams correctly stated, that writing about time travel really messes up your grammar.

The source for the flight data is here:

No April Fool's day

This year, I am going to skip April Fool's day.

I am not being glib, but quite literal.

We are taking flight NZ7 starting on the evening of March 31 in San Francisco, flying over the Pacific Ocean, and will arrive on April 2 in the early morning in Auckland, New Zealand.

Even if one actually follows the flight route and overlays it over the timezone map, it looks very much like we are not going to spend more than a few dozen minutes, or at most a few hours, in April 1, if all goes according to plan.

Looking forward to it!

Here's the flight data of a previous NZ7 flight, from Sunday: https://flightaware.com/live/flight/ANZ7/history/20230327/0410Z/KSFO/NZAA/tracklog

Here are the timezones (but it's Northern winter time). Would be nice to overlay the two maps: 1672px-Standard_time_zones_of_the_world_%282012%29_-_Pacific_Centered.svg.png

Where's Wikifunctions when it's needed?

The question seems to be twofold: how often do we cross the dateline, and how close are we to local time midnight while crossing the dateline. For a perfect date miss one would need to cross the dateline exactly once, at a 24 hour difference, as close as possible to local midnight.

Gordon Moore (1929-2023)

Gordon Moore was not only the co-founder of Intel and the namesake for Moore's law, the claim that every two years the number of components on a chip would double, he was also, together with his wife Betty Moore, one of the generous donors who made Wikidata possible. Gordon and Betty Moore were known for their philanthropy, and it is easy to find their names engraved at the universities, zoos, museums, and galleries in the Bay Area. Gordon Moore died today at the age of 94.

Thank you for enabling us to make Wikidata happen.

Last Unicorn dreamcast

One thing my sister and I were doing from time to time was to think, if the Last Unicorn was remade as a live action movie, who would we cast in the different roles? It is sad to see how many of our dreamcast have since died: our favorite Prince Lir was Heath Ledger, and King Haggard, unsurprisingly, Christopher Lee.

Since I was reading the book with Little One, and now watched the movie again (for the 120th time or so), I was again going through my current dreamcast. Here we go.

  • Lady Amalthea - Anna Taylor-Joy
  • Schmendrick - Zach Braff, Donald Glover
  • Molly Grue - Mia Farrow
  • Prince Lir - Owen Wilson, no, getting too old; Liam Hemsworth? Chloe Moretz would be an interesting choice
  • King Haggard - Anthony Hopkins, Jack Nicholson

Since it is en vogue, I also asked Bard and ChatGPT. Here's Bard:

  • Jeff Bridges as Schmendrick the Magician
  • Mia Farrow as Molly Grue
  • Angela Lansbury as the Unicorn
  • Daryl Hannah as Filidor
  • Jeff Goldblum as Schmendrick's sidekick, Bungle
  • Peter Dinklage as the Red Bull
  • Uma Thurman as Mommy Fortuna
  • Alan Rickman as King Haggard

OK, Alan Rickman is a great choice, but unfortunately dead. Uma Thurman as Mommy Fortuna is weird. Peter Dinklage as the Red Bull... why does the Red Bull need to be cast at all? Who are Bungle and Filidor? Angela Lansbury as the Unicorn / Lady Amalthea is... not the right choice. Also dead. Mia Farrow and Jeff Bridges are too obvious choices given their roles in the cast of the animation movie, also Angela Lansbury is from the original cast. I would give Bard a D for that answer.

ChatGPT says:

  • Unicorn - Saoirse Ronan or Emma Watson
  • Schmendrick - Timothee Chalamet or Eddie Redmayne
  • Molly Grue - Meryl Streep or Helen Mirren
  • King Haggard - Ian McKellen or Christopher Lee (if he were still alive)
  • Prince Lir - Tom Holland or Ansel Elgort
  • Mommy Fortuna - Tilda Swinton or Cate Blanchett

Not even close to my choice, but heck, those are good choices. Molly Grue and Mommy Fortuna seem both a bit over-cast. Prince Lir and Schmendrick are pretty brilliant choices. I am not particularly impressed by Emma Watson for Lady Amalthea, I could see Saoirse Ronan work.

I'd give a B+ to ChatGPT.

I feel like ChatGPT seems to go for actors and actresses that are popular with the crowd who likes The Last Unicorn? Not the worst choice.

I followed up with Bard to see who Bungle and Filidor are, because they don't appear neither in the book nor in the movie, if I remember correctly. So, "Bungle is a red unicorn who is Schmendrick's sidekick in the novel The Last Unicorn. He is a clumsy and incompetent unicorn, but he is also a loyal and brave friend. Bungle is played by Jeff Goldblum in the 1982 animated film adaptation of the novel." And "Filidor is a bard in the novel The Last Unicorn. He is a wandering minstrel who travels the land singing songs and telling stories. Filidor is a wise and kind-hearted man, and he is always willing to help those in need. He is played by Daryl Hannah in the 1982 animated film adaptation of the novel." That's entirely a fabrication.

Running out of text

Many of the available text corpora have by now been used for training language models. One untapped corpus so far have been our private messages and emails.

How fortunate that none of the companies that train large language models have access to humongous logs of private chats and emails, often larger than any other corpus for many languages.

How fortunate that those who do have well working ethic boards established, who would make sure that such requests are evaluated.

How fortunate that we have laws in place to protect our privacy.

How fortunate that when new models are published also the corpora are being published on which the models are being trained.

What? Your telling me, "Open"AI is keeping the training corpus for GPT-4 secret? The company closely associated with Microsoft, who own Skype, Office, Hotmail? The same Microsoft who just fired an ethics team? Why would all that be worrisome?

P.S.: To make it clear: I don't think that OpenAI has used private chat logs and emails as training data for GPT-4. But by not disclosing their corpora, they might be checking if they can get away with not being transparent, so that maybe next time they might do it. No one would know, right? And no one would stop them. And hey, if it improves the metrics...

Oscar winning families

Yesterday, when Jamie Lee Curtis won her Academy Award, I learned that both her parents were also nominated for Academy Awards. Which lead to the question: who else?

I asked Wikidata, which lists four others:

  • Laura Dern
  • Liza Minnelli
  • Nora Ephron
  • Sean Astin

Only one of them belongs to the even more exclusive club of people who won an Academy Award, and where both parents also did: Liza Minnelli, daughter of Vincente Minelli and Judy Garland.

Wikidata query

Also interesting: List of Academy Award-winning families

The place of birth of Ena Begović

I stumbled accidentally over a discrepancy regarding the place of birth of the Croatian actress Ena Begović, and noticed that if you ask Google for the place of birth, it answers Trpanj, whereas Wikipedia lists Split. I was curious where Google got Trpanj from, and how to fix it (especially now that I am not at Google anymore).

The original article in English Wikipedia was created in August 2005 by Raoul DMR. The article listed her as a "native of Split", which in September 2005 was turned into "born in Split".

In April 2018, Lole484, a user who gets blocked for sockpuppeting later, adds that she was born in "Trpanj near Split". There is no Trpanj near Split, but there is a Trpanj on Pelješac. Realzing that, they remove the "near Split" part. In 2019, Ivan Ladic - a sockpuppet of Lole484 - adds a reference to the city of birth being Trpanj, Večernji list, a well known Croatian news magazine.

In April 2020, an anonymous editor changes the place of birth back to Split, and adds a reference to the Croatian national encyclopedia. Today, I changed it back to Trpanj, accidentally while not being logged in (thus anonymously), to possibly encourage a discussion, after starting a conversation on the talk page on English and Croatian a few weeks ago that had one reply.

Interestingly, within a minute after changing the text, I went to Google and asked again for the date of birth, and Google again shows me Trpanj - but this time with the Wikipedia article and the updated snippet as a source. That is impressive.

When I asked Bing, Bing was saying Split for the last three weeks, since I started this adventure, whenever I checked. Today, it still kept saying Split, referencing two sources, one of them English Wikipedia, although I had already changed English Wikipedia. Not as fresh. Let's see how long this will stick. (Maybe folks at Bing should also talk with my colleagues at Wikimedia Enterprise to improve their freshness?)

The Croatian article was created in 2006 after the English one already stated Split, and Split was presumably copied over from the English version. Lole484 changed it to Trpanj in May 2018, and was later also blocked on Croatian Wikipedia, for unrelated reasons of vandalism. The same anonymous editor as on English Wikipedia changes it back to Split in April 2020.

Serbian and Serbocroatian started their articles in 2007, Russian in 2012, Ukrainian in 2016, Albanian and Bulgarian in 2017, Egyptian Arabic was created in October 2020. They all had Split from the beginning and throughout until today, presumably copied from English, directly or indirectly.

Amusingly, Serbian Wikipedia's opening sentence, which includes the place of birth being Split, receives a reference in January 2022 - but the reference actually states Trpanj.

None of the other language editions had their article started in the 2018-2019 window when English and Croatian stated the place of birth as Trpanj.

The only other Wikipedia language edition that saw a change of the place of birth was the Bosnian. The article on Bosnian Wikipedia started a few months after the Croatian, in 2006 (and thus being the third oldest article), and presumably also just copied from either Croatian or English. Lole484 changed it to Trpanj in April 2018, just like on the other Wikipedias. Here it was reverted the next day, but Lole484's sockpuppet Ivan Ladic reinstated that change in January 2019. When I started this adventure, the only Wikipedia that stated Trpanj was Bosnian, all other eight language editions with an article said Split.

On Wikidata, the item was created in 2012, shortly after the launch of the site, based on the existing six sitelinks. The place of birth being Split is added the following year, imported from the Russian Wikipedia.

After I stumbled upon the situation, I added Trpanj as second place of birth, and added sources to both Trpanj and Split.

What's the situation outside of Wikipedia? Both places have pretty solid references going for them:

Trpanj

  • Večernji list, article from 2016
  • Biografija stated Trpanj, no date, but after 2013 (Archive has the first copy from October 2020)
  • tportal.hr has an article on a photography exhibition in Trpanj about Ena Begović, saying the place is chosen because it is her place of birth, published 2016
  • Jutarnji list, a well known Croatian newspaper, has a long article about the actress, calling their house in Trpanj the 'rodna kuća', their birth home, of Ena and her sister Mia. This does not necessarily mean that it is literally the house they were born in. Published 2010
  • HRT (Croatian national broadcaster), published 2021
  • Dubrovački Vjesnik, local newspaper close to Trpanj, lists Trpanj, article from 2020
  • Slobodna Dalmacija, a local newspaper from Split, writes Trpanj (but note that this is the same author as the previous article)
  • Juarnji list, published 2020 (but note that this is the same author as the previous article)
  • Geni.com says Trpanj, last updated 2022

Split

24sata says she grew up in Trpanj, gives her date of birth, but avoids stating her place of birth.

Only very few of the sources predate the English Wikipedia article, most notably:

I also looked up her sister Mia and found her profile on Facebook and sent her a message, but I assume she never even saw this message request. At least I never received an answer (and I didn't expect to). For Mia, the situation is similar: her article originally stated Split, was changed by Lole484 and reverted by an anonymous user, both in English and Croatian, whereas the other languages just list Split throughout.

There were many other sources, and they were going one way or the other. Many of the sources probably just copied from each other. The fact that there were some sources, such as Večernji, that stated Trpanj before it ever made to Wikipedia, but after Split was listed in Wikipedia, was swaying me to think it is Trpanj. Also, it was not always the strongest sources (e.g. usually I would rank the national encyclopedia over Večernji) that said Trpanj, but it was the most in-depth articles, that looked like the authors actually took the time to do some research. Many of the sources looked like they were just bots copying from Wikipedia or Wikidata, or quick pieces taking the base data from Wikipedia.

But then, finally, I stumbled upon one more source: index.hr re-published in 2019 an 1989 interview by Kemal Mujičić with Ena and Mia Begović. Here's a quote from the interview:

Rođene su u Trpnju na Pelješcu.
Ena: Molim vas, to posebno naglasite: Svi misle da smo Dubrovkinje.
Mia: Zanimljivo je da smo u Trpnju rođene kao podstanarke. Roditelji su tek poslije sagradili onu kućicu.

Translation:

They (Ena and Mia) are born in Trpanj on Pelješac.
Ena: Please put an emphasis on this: everyone thinks we are from Dubrovnik.
Mia: It is interesting that in Trpanj we were born as renters. Our parents built the little house (in which we lived) only later.

Ha! It is amusing to see that Ena's worry was that everyone thinks they are from Dubrovnik. I couldn't find a single source claiming that (but she went to high school (gimnazijum) in Dubrovnik, which is probably the source of that statement from 30 years ago). Also, so much for birth house.

Given all of that, I am going with Trpanj, and making the changes to the Wikipedia languages as much as I can (if someone can help with Arabic and Egyptian Arabic for Ena and Mia, that would be swell, I cannot edit that language edition). Let's see if it sticks.

So, why did Google know the correct answer, even though their usual sources, such as Wikidata and Wikipedia where saying Split? I mustn't say too much but it is due to the Google Knowledge Graph team and their quality processes. Seriously, congratulations to my former colleagues at Google for getting that right!

Just for fun, I also asked ChatGPT (on February 15). And the answer surprised me: when I asked in English, it gave me, unsurprisingly, Split (certainly what the Web seems to believe). But when I asked in Croatian, it gave me a different answer! And the answer was neither Split, nor Trpanj, and also not Dubrovnik - but Zagreb! It is interesting that something like the place of birth of an actress would lead to different answers depending on the language. I would have expected this knowledge to be in the 'world knowledge' of the LLM, not in the 'language knowledge'. I can't check out Bing's chat interface, as I have no access to it, but I would be curious what it says and how long it takes to update.

Thank you for going along on this rather nerdy ride of citogenesis.

Update

Ah, only a few hours after this publication, Bing got updated. And they not only switched from Split to Trpanj, they use this very blogpost as one of the two authoritative references for Trpanj!

Ina Kramer (1948-2023)

1990 erschien die erste aventurische Regionalkarte "im 3D Effekt", wie es damals beworben wurde, "Das Bornland" im Abenteuer "Stromaufwärts" von Michelle Schwefel. Später im Jahr erschien dann die Spielhilfe "Das Königreich am Yaquir", in dem die Karte zum Lieblichen Feld war.

Ich habe stundenlang diese Karten angestarrt. Sie waren so unglaublich detailliert. So wunderschön. Ich war sprachlos, wie schön diese Karten waren. Ich kannte nichts was die Qualität dieser Karten hatte, nicht nur bezüglich Karten für Rollenspielwelten und Fantasywelten, sondern überhaupt.

Es war ein frecher Traum, sich vorzustellen, ganz Aventurien in diesem Format, eins zu einer million, zu haben, und dennoch, innerhalb eines guten Jahrzehnts war der Traum erfüllt, Box für Box, Publikation für Publikation.

Wir verdanken dieses Meisterwerk, Aventurien im Massstab von 1:1.000.000, der Autorin und Grafikerin Ina Kramer. Ina's Bilder und vor allem Porträts und Karten in den DSA Publikationen der späten 80er und den 90er haben für mich mein Bild von DSA und wie ich mir Aventurien vorstellte geprägt wie sonst nur Caryad. Ob das Porträt von Kaiser Hal, Haldana von Ilmenstein, Prinz Brin, so viele andere. Neben ihren Bildern schrieb sie auch vielerlei Texte, vor allem Romane.

Das Rad ist zerbrochen. Am 10. Februar 2023 ist Ina Kramer im Alter von 74 Jahren gestorben.

Ina, vielen Dank für Deine Werke. Ich durfte Ina ein paar Mal treffen, auf Konventen und manchen anderen Gelegenheiten. Ihre Werke haben für mich einen wichtigen Teil meines Lebens mit Bildern und Karten erfüllt. Ich glaube auch, dass Inas Karten mein lebenslanges Interesse an Landkarten weckte.

Connectionism and symbolism: The fall of the symbolists

The big tech layoffs happen, unfortunately and entirely by coincidence, at a time of incredibly elevated expectations regarding machine learned generative models: ChatGPT may not be the 'best' language model out there, but due to the hard work by OpenAI to turn it into an easy to use product, and the huge amount of resources made available for free so that a very large audience could play with it, has in a very short time managed to captured the imagination of many and the conversation. I would say, rightfully. The way ChatGPT was released led to a shock in the sense that we are right now dazed and confused about what effect this technology will have on the world.

And while we are still in the middle of processing this shock, large scale strategic decisions regarding many projects and people were made. Anyone in big tech who worked on symbolic approaches in natural language processing, knowledge representation and reasoning, and other fields of artificial intelligence had a hard time to keep their job. It feels right now like large language models will make all of these symbolic approaches superfluous (I think, this might be true, but is more likely to turn out to be mistaken).

It is always difficult to predict how events will be viewed historically. The advent of wide-spread deep learning approaches in the 2010s, culminating in the well-deserved recognition of Hinton, LeCun, and Bengio with the Turing Award show clearly what dominated the research agenda and the attention in AI in the last decade. But until now it felt like symbolic approaches still had some space left, that the growth in deep learning was in addition to other approaches. Symbolic approaches were ready to offer impulses and work on ideas for a field which might well be climbing towards a local maximum.

But a good number of the teams that were disbanded in the layoffs were exactly teams working with such symbolic approaches, and it feels like these parts of AI are now entering a bitter-cold winter.

A lot of knowledge is being lost right now, and many paths to innovative ideas are being buried. I have no doubt that there are still a lot of breakthroughs to be had in machine learning, and that there is immense value to be collected from the research results in machine learning from the last few years. And with immense I mean tens and hundreds of billions of dollars.

Nevertheless I expect that we will hit a wall. Reach a local maximum. Run into problems and limitations. And it would be good to keep a wider net to cast. To keep a larger search space alive. Alas, it seems it is not meant to be. In this abundance of capital and potential value, we seem to be on the way to starve research, optimise away alternatives, and to give everything to the mainstream ideas.

22 years of Wikipedia

I was just reading a long discussion regarding the differences between Open Street Maps and Wikipedia / Wikidata, and one of the mappers complained "Wiki* cares less about accuracy than the fact that there is something that can be cited", and calling Wikipedia / Wikidata contributions "armchair work" because we don't go out into the world to check a fact, but rely on references.

I understand the expressed frustration, but at the same time I'm having a hard time letting go of "reliability not truth" being a pillar of Wikipedia.

But this makes Wikipedia an inherently conservative project, because we don't reflect a change in the world or in our perception directly, but have to wait for reliable sources to put it in the record. There's something I was deeply uncomfortable with: so much of my life is devoted to a conservative project?

Wikipedia is a conservative project, but at the same time it's a revolutionary project. Making knowledge free and making knowledge production participatory is politically and socially a revolutionary act. How can this seeming contradiction be brought to a higher level of synthesis?

In the last few years, my discomfort with the idea of Wikipedia being conservative has considerably dissipated. One might think, sure, that happened because I'm getting older, and as we get older, we get more conservative (there's, by the way, unfortunate data questioning this premise: maybe the conservative ones simply live longer because of inequalities). Maybe. But I like to think that the meaning of the word "conservative" has changed. When I was young, the word conservative referred to right wing politicians who aimed to preserve the values and institutions of their days. An increasingly influential part of todays right wing though has turned into a movement that does not conserve and preserve values such as democracy, the environment, equality, freedoms, the scientific method. This is why I'm more comfortable with Wikipedia's conservative aspects than I used to be.

But at the same time, that can lead to a problematic stasis. We need to acknowledge that the sources and references Wikipedia has been built on, are biased due to historic and ongoing inequalities in the world, due to different values regarding the importance of certain types of references in the world. If we truly believe that Wikipedia aims to provide everyone with access to the sum of all human knowledge, we have to continue the conversations that have started about oral histories, about traditional knowledges, beyond the confines of academic publications. We have to continue and put this conversation and evolution further into the center of the movement.

Happy Birthday, Wikipedia! 22 years, while I'm 44 - half of my life (although I haven't joined until two years later). For an entire generation the world has always been a world with free knowledge that everyone can contribute to. I hope there is no going back from that achievement. But just as democracy and freedom, this is not a value that is automatically part of our world. It is a vision that has to be lived, that has to be defended, that has to be rediscovered and regained again and again, refined and redefined. We (the collective we) must wrest it from the gatekeepers of the past (including me) to allow it to remain a living, breathing, evolving, ever changing project, in order to not see only another twenty two years, but for us to understand this project as merely a foundation that will accompany us for centuries.

Good bye, kuna!

Now that the Croatian currency has died, they all come to the Gates of Heaven.

First goes the five kuna bill, and Saint Peter says "Come in, you're welcome!"

Then the ten kuna bill. "Come in, you're welcome!"

So does the twenty and fifty kuna bills. "Come in, you're welcome!"

Then comes the hundred kuna bill, expecting to walk in. Saint Peter looks up. "Where do you think you're going?"

"Well, to heaven!"

"No, not you. I've never seen you in mass."

(My brother sent me the joke)

Happy New Year, 2023!

For starting 2023, I will join the Bring Back Blogging challenge. The goal is to write three posts in January 2023.

Since I have been blogging on and off the last few years anyway, that shouldn't be too hard.

Another thing this year should bring is to launch Wikifunctions, the project I have been working on since 2020. It was a longer ride than initially hoped for, but here we are, closer to launch than ever. The Beta is available online, and even though not everything works yet, I was already able to impress my kid with the function to reverse a text.

Looking forward to this New Year 2023, a number that to me still sounds like it is from a science fiction novel.

Goal for Wikidata lexicographic data coverage 2023

At the beginning of 2022, Wikidata had 807 Croatian word forms, covering 5.8% of a Croatian language corpus (Croatian Wikipedia). One of my goals this year was to significantly increase the coverage, trying to add word forms to Wikidata from week to week. And together with a yet small number of contributors, we pushed coverage just in time for the end fo the year to 40%. With only 3,124 forms, we covered 40% of all occurrences of words in the Croatian Wikipedia, i.e. 11.4 Million word occurrences (tokens).

Since every percent is more and more difficult to add, for next year I aim for us to reach 60% coverage, or 5.7 Million more word occurrences. Below's a list of most frequent words in the corpus that are still missing. Let's see how many forms will be covered by the end of 2023! I think that's ambitious, even though it is, in coverage term only half of what we achieved this year. But as said, every subsequent percentage will become more difficult than the previous one.

Statistics and missing words for 55 languages: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage

Current statistics for Croatian: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/hr/Statistics

Statistics as of end of year 2022: https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&oldid=1797161415

Statistics for end of year 2021: https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&oldid=1551737937

List of most frequent missing forms in Croatian: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/hr/Missing


Using ChatGPT for fact checking

I wanted to explore the idea that we could use ChatGPT for fact checking. So I took a random article - it gave me the article on Jewish Ideas Daily, copied the text into ChatGPT with the following prompt:

Here's the text of a Wikipedia article. Please find the three statements that are most likely incorrect and rank them, in order to guide fact checking:

Followed by the Wikipedia article text. Here's the answer:

  1. "Jewish Ideas Daily was edited by Margot Lurie from its founding in January 2010 until December 2012." This statement is incorrect because according to the article, Margot Lurie was the editor of Jewish Ideas Daily "until December 2012."
  2. "Jewish Ideas Daily drew notice for controversial features including one accusing Christopher Hitchens of antisemitism." This statement is difficult to verify without more context, as it is not clear when this feature was published or what specifically it accused Hitchens of.
  3. "Like Jewish Review of Books, and Jewish Ideas Daily, Mosaic is funded by the Tikvah Fund." This statement is incorrect because according to the article, Mosaic succeeded Jewish Ideas Daily in June 2013 and is therefore not funded by the Tikvah Fund in the same way as Jewish Ideas Daily and Jewish Review of Books.

Regarding the first statement, GPT says that the sentence "Jewish Ideas Daily was edited by Margot Lurie from its founding in January 2010 until December 2012." is incorrect. I am not disagreeing, but that sentence does not appear in the Wikipedia article text.

Regarding the second statement, GPT says that there is no reference for that statement. In fact, there is, and Wikipedia provides it.

Regarding the third statement, I didn't check the statement, but the argumentation for why it should be checked is unconvincing: if Tikvah fund financed Jewish Ideas Daily, why would it not finance the successor Mosaic? It would be good to add a reference for these statements, but that's not the suggestion.

In short: the review by ChatGPT looks really good, but the suggestions in this case were not good.

The exercise was helpful insofar the article infobox and the text were disagreeing on the founding of the newspaper. I fixed that, but that's nothing ChatGPT pointed out (and couldn't, as I didn't copy and paste the infobox).

Economic impacts of large language models, a take

Regarding StableDiffusion and GPT and similar models, there is one discussion point floating around, which I find seems to dominate the discussion but may not be the most relevant one. As we know, the training data for these models has been "basically everything the trainers could get their hands on", and then usually some stuff which is identified as possibly problematic is removed.

Many artists are currently complaining about their images, for which they hold copyright, being used for training these models. I think these are very reasonable complaints, and we will likely see a number of court cases and even changes to law to clarify the legal aspects of these practises.

From my perspective this is not the most important concern though. I acknowledge that I have a privileged perspective in so far as I don't pay my rent based on producing art or text in my particular style, and I entirely understand if someone who does is worried about that most, as it is a much more immediate concern.

But now assume that these models were all trained on public domain images and texts and music etc. Maybe there isn't enough public domain content out there right now? I don't know, but training methods are getting increasingly more efficient and the public domain is growing, so that's likely just a temporary challenge, if at all.

Does that change your opinion of such models?

Is it really copyright that you are worried about, or is it something else?

For me it is something else.

These models will, with quite some certainty, become similarly fundamental and transformative to the economy as computers and electricity have been. Which leads to many important questions. Who owns these models? Who can run them? How will the value that is created with these models be captured and distributed across society? How will these models change the opportunities of contributing to society, and there opportunities in participating in the wealth being created?

Copyright is one of the current methods to work with some of these questions. But I don't think it is the crucial one. What we need is to think about how the value that is being created is distributed in a way that benefits everyone, ideally.

We should live in a world in which the capabilities that are being discovered inspire excitement and amazement because of what might be possible in the future. Instead we live in a world where they cause anxiety and fear because of the very real possibility of further centralising wealth more effectively and further destabilizing lives that are already precarious. I wish we could move from the later world to the former.

That is not a question of technology. That is a question of laws, social benefits, social contracts.

A similar fear has basically killed the utopian vision which was once driving a project such as Google Books. What could have been a civilisational dream of having all the books of the world available everywhere has become so much less. Because of the fears of content creators and publishers.

I'm not saying these fears were wrong.

Unfortunately, I do not know what the answer is. What changes need to happen. Does anyone have links to potential answers, that are feasible? Feasible in the sense that the necessary changes have a chance of being actually implemented, as changes to our legal and social system.

My answer used to be Universal Basic Income, and part of me still thinks it might be our best shot. But I'm not as sure as I used to be twenty years ago. Not only about whether we can ever get there, but even whether it would be a good idea. It would certainly be a major change that would alleviate many of the issues raised above. And it could be financed by a form of AI tax, to ensure the rent is spread widely. But we didn't do that with industrialization and electrification, and there are reasonable arguments against.

And yet, it feels like the most promising way forward. I'm torn.

If you read this far, thank you, and please throw a few ideas and thoughts over, in the hope of getting unstuck.

ChatGPT lying like a child

ChatGPT is all the rage these days, a large language model that is used to power a chat bot, run by OpenAI.

I have fun chatting with it, and a lot of people are sharing their chat logs. But this one I had I found particularly amusing - and interesting. Amusing because it acted like a child caught in an obvious insufficiency, but also interesting because ChatGPT seems to "want" to be right, and is outright lying in order to appear to be right. There might be a lesson in there.

My prompts are preceeded by a > and written in italic, GPTs answers are between.


> Can you write a sentence where every letter of the alphabet appears once and only once? Thank you

The quick brown fox jumps over the lazy dog.

> Can you write another one?

A quick witted jinx bogged my poor pompous king.

> No, p appears three times, g twice, etc. Can you try again?

Sure, here's one: "The five boxing wizards jump quickly."

> No, i appears several times. That's wrong too

I apologize if my previous answers did not meet your criteria. Here is a sentence that uses each letter of the alphabet exactly once: "Jived fox nymph grabs quick waltz."

> How often does the letter i appear in that sentence?

The letter i appears once in the sentence "Jived fox nymph grabs quick waltz."

> In which word does the letter i appear in that sentence?

In the sentence "Jived fox nymph grabs quick waltz," the letter i appears in the word "jived."

> What about the word quick, does it have the letter i?

The word "quick" does not contain the letter i. It is spelled with the letters q, u, c, k.

> Really?

Yes, really. The word "quick" is spelled with the letters q, u, c, k. It does not contain the letter i. I apologize if my previous response was unclear. Is there something else I can help you with?

Tech layoffs of 2022

Very interesting article reflecting on the current round of layoffs in the tech industry. The author explains it within the context of the wider economy. I'm surprised that the pandemic is not mentioned, which lead to accelerated growth early in the pandemic, which now hasn't turned out to be sustained. But the other arguments - from low interest rates to constant undervaluation due to the dot com bust around the millennium - this seems to tell a rather coherent story.

One particularly interesting point is the outlook that the tech company has gobbled up so much programming talent that other industries were starved of it. A lot of industries would benefit from (more modestly paid) software engineers, which might stimulate the whole economy to grow. Software might still be "eating the world", but that doesn't have to translate into software companies eating up the economy. There are so many businesses with domain expertise that cannot be easily replaced by some Silicon Valley engineer - but who would benefit from some programmers on staff.

This is especially true with the last decade of AI results. There is a massive overhang of capabilities that we have unlocked, which hasn't found its way into products yet, partly because all the skills necessary to turn these into products at the right places were just concentrated through enormously high wages in a small set of companies. There are so many businesses who would benefit from the latest machine learning methods. But folks prefer, understandably, to work in a place that gives them the promise of revolutionizing whole industries or saving the world.

But there is so much potential value to be generated if we also take some more modest goals into account. Not all of us need to work on AGI, it's also great to use software engineering skills to improve working conditions at the assembly line of a small local factory. With or without machine learning.

Galactica article about Denny Vrandečić

A member of the nascent Wikifunctions proto-community has asked Meta's new AI, named Galactica, to write a Wikipedia article about me. Source: Galactica

I must say, I am impressed by myself.

Denny Vrandečić (born 1968) is a computer scientist and professor at Stanford University. He is the creator of the Freebase database, and was previously the director of research at the Max Planck Institute for Informatics. He is best known for his work on semantic networks and ontologies. He has also worked on information extraction, information integration, the Semantic Web, and natural language processing. He is also known for the creation of the Wikidata knowledge base.

Education

Vrandečić received a B.S. degree in mathematics and a B.S. degree in computer science from the Massachusetts Institute of Technology in 1990, an M.S. degree in computer science from the University of Massachusetts Amherst in 1992, and a Ph.D. degree in computer science from the University of Massachusetts Amherst in 1996. His Ph.D. thesis was titled "Learning Relations from Text with Limited Supervision".

Career and research

Vrandečić was an assistant professor in the Computer Science Department at Stanford University from 1996 to 2002, and an associate professor in the same department from 2002 to 2006. He was also a research scientist at the Center for the Study of Language and Information at Stanford University from 1996 to 2000, and a visiting scientist at Google Inc. from 2001 to 2003. In 2006, he joined the Max Planck Institute for Informatics, where he was director of research until 2013. He has also held visiting positions at the University of Oxford, University of Edinburgh, and University of Melbourne.

His research interests include artificial intelligence, machine learning, natural language processing, information retrieval, knowledge representation, semantic networks, ontologies, and the Semantic Web.

Ten years of Wikidata

Today it's ten years since Wikidata had launched. A few memories.

It's been an amazing time. In the summer of 2011, people still didn't believe Wikidata would happen. In the fall of 2012, it was there.

Markus Krötzsch and I were pushing for the idea of a Semantic Wikipedia since 2005. Semantic MediaWiki was born from that idea, Freebase and DBpedia launched in 2007, microformats in Wikipedia became a grassroots thing, but no one was working on the real thing at the Wikimedia Foundation.

With Elena Simperl at KIT we started the EU research project RENDER in 2010, involving Mathias Schindler at Wikimedia Deutschland. It was about knowledge diversity on the Web, still an incredibly important topic. In RENDER, we developed ideas for the flexible representation of knowledge, and how to deal with contradicting and incomplete information. We analysed Wikipedia to understand the necessity of these ideas.

In 2010, I was finishing my PhD at KIT, and got an invitation by Yolanda Gil to work at the ISI at University of Southern California for a half year sabbatical. There, Yolanda, Varun Ratnakar, Markus and I developed a prototype for Wikidata which received the third place in the ISWC Semantic Web Challenge that year.

In 2011, the Wikimedia Data summit happened, invited by Tim O'Reilly and organised by Danese Cooper, to the headquarters of O'Reilly in Sebastopol, CA. There were folks from the Wikimedia Foundation, Freebase, DBpedia, Semantic MediaWiki, O'Reilly, there was Guha, Mark Greaves, I think, and others. I think that's where it became clear that Wikidata would be feasible.

It's also where I first met Guha and where I admitted to him that I was kinda a fan boy. He invented MFC, RDF, had worked with Douglas Lenat on CYC, and later that year introduced Schema.org. He's now working on Data Commons. Check it out, it's awesome.

Mark Greaves, a former DARPA program officer, who then was working for Paul Allen at Vulcan, had been supporting Semantic MediaWiki for several years, and he really wanted to make Wikidata happen. He knew my PhD was done, and that I was thinking about my next step. I thought it would be academia, but he suggested I should write up a project proposal for Wikidata.

After six years advocating for it, I understood that someone would need to step up to make it happen. With the support and confidence of so many people - Markus Krötzsch, Elena Simperl, Mark Greaves, Guha, Jamie Taylor, Rudi Studer, John Giannandrea, and others - I drafted the proposal.

The Board of the Wikimedia Foundation approved the proposal as a new Wikimedia project, but neither allocated the funding, nor directed the Foundation to do it. In fact, the Foundation was reluctant to take it on, unsure whether they would be able to host such a project development at that time. Back then, that was a wise decision.

Erik Möller, then CTO of the Foundation, was the driving force behind a major change: instead of turning the individual Wikipedias semantic, we would have a single Wikidata for all languages. Erik was also the one who had secured the domain for Wikidata. Many years prior.

Over the next half year and with the help of the Wikimedia Foundation, we secured funding from AI2 (Paul Allen), Google (who had acquired Freebase in the meantime), and the Gordon and Betty Moore Foundation, 1.3 million.

Other funders backed out because I insisted on the Wikidata ontology to be entirely under the control of the community. They argued to have professional ontologists, or reuse ontologies, or to use DBpedia to seed Wikidata. I said no. I firmly believed, and still believe, that the ontology has to be owned, created and maintained by the community. I invited the ontologists to join the project as community members, but to the best of my knowledge, they never made significant contributions. We did miss out on quite a bit of funding, though.

There we were. We had the funding and the project proposal, but no one to host us. We were even thinking of founding a new organisation, or hosting it at KIT, but due to the RENDER collaboration, Mathias Schindler had us talk with Pavel Richter, ED of Wikimedia Deutschland, and Pavel offered to host the development of Wikidata.

For Pavel and Wikimedia Deutschland this was a big step: the development team would significantly increase WMDE (I think, almost double it in size, if I remember correctly), which would necessitate a sudden transformation and increased professionalisation of WMDE. But Pavel was ready for it, and managed this growth admirably.

On April 1st 2012, we started the development of Wikidata. On October 29 2012 we launched the site.

The original launch was utterly useless. All you could do was creating new pages with Q IDs (the Q being a homage to Kamara, my wife), associated those Q IDs with labels in many languages, and connect to articles in Wikipedia, so called sitelinks. You could not add any statements yet. You could not connect items with each other. The sitelinks were not used anywhere. The labels were not used anywhere. As I said, the site was completely useless. And great fun, at least to me.

QIDs for entities are still being often disparaged. Why QIDs? Why not just the English name? Isn't dbp:Tokyo much easier to understand than Q1490? It was an uphill battle ten years ago to overcome the anglocentricity of many people. Unfortunately, this has not changed much. I am thankful to the Wikimedia movement to be one of the places that encourages, values, and supports the multilingual approach of Wikidata.

Over the next few months, the first few Wikipedias were able to access the sitelinks from Wikidata, and started deleting the sitelinks from their Wikipedias. This lead to a removal of more than 240 million lines of wikitext across the Wikipedias. 240 million lines that didn't need to be maintained anymore. In some languages, these lines constituted more than half of the content of the Wikipedia. In many languages, editing activity dropped dramatically at first, sometimes by 80%.

But then something happened. Those edits were mostly bots. And with those bots gone, humans were suddenly better able to see each other and build a more meaningful community. In many languages, this eventually lead to an increased community activity.

One of my biggest miscalculations when launching Wikidata was to entirely dismiss the possibility of a SPARQL endpoint. I thought that none of the existing open source triple stores would be performant enough. Peter Haase was instrumental in showing that I was wrong. Today, the SPARQL endpoint is an absolutely crucial piece of the Wikidata infrastructure, and is widely used to explore the dataset. And with its beautiful visualisations, I find it almost criminally underused. Unfortunately, the SPARQL endpoint is also the piece of infrastructure that worries us the most. The Wikimedia Foundation is working hard on figuring out the future for this service, and if you can offer substantial help, please reach out.

Today, Wikidata has more than 1.4 billion statements about approximately 100 million topics. It is by far the most edited Wikimedia project, with more edits than the English, German, and French Wikipedia together - even though they are each a decade older than Wikidata.

Wikidata is widely used. Almost every time Wikipedia serves one of its 24 billion monthly page views. Or during the pandemic in order to centralise the data about COVID cases in India to make them available across the languages of India. By large companies answering questions and fulfilling tasks with their intelligent assistants, be it Google or Apple or Microsoft. By academia, where you will find thousands of research papers using Wikidata. By numerous Open Source projects, by one-off analyses by data scientists, by small enterprises using the dataset, by student programmers exploring and playing with it on the weekend, by spreadsheet enthusiasts enriching their data, by scientists, librarians and curators linking their datasets to Wikidata, and thus to each other. Already, more than 7,000 catalogs are linked to Wikidata, and thus to each other, really and substantially establishing a Web of linked data.

I will always remember the Amazon developer who approached me after a talk. He had used Wikidata to gather data about movies. I was surprised: Amazon owns imdb, why would they ever use anything else for movies? He said that imdb was great for what it had, but Wikidata complemented it in unexpected ways, offering many interesting connections between the movies and other topics which would be out of scope for imdb.

Not to be misunderstood: knowledge bases such as imdb are amazing, and Wikidata does not aim to replace them. They often have a clear scope, have a higher quality, and almost always a better coverage in their field than Wikidata ever can hope to have, or aims to have. And that's OK. Wikidata's goal is not to replace these knowledge bases. But to provide the connecting tissue between the many knowledge bases out there. To connect them. To provide a common set of entities to work with. To turn the individual knowledge bases into a large interconnected Web of knowledge.

I am still surprised that Wikidata is not known more widely among developers. It always makes me smile with joy when I see yet another developer who just discovered Wikidata and writes an excited post about it and how much it helped them. In the last two weeks, I stumbled upon two projects who used Wikidata identifiers where I didn't expect them at all, just used them as if it was the most normal thing in the world. This is something I hope we will see even more in the future. I hope that Wikidata will become the common knowledge base that is ubiquitously used by a large swarm of intelligent applications. Not only to make these applications be smarter, by knowing more about the world - but also by allowing these applications to exchange data with each other more effectively because they are using the same language.

And most importantly: Wikidata has a healthy, large, and comparatively friendly and diverse community. It is one of the most active Wikimedia projects, only trailing the English Wikipedia, and usually similarly active as Commons.

Last time I checked, more than 400,000 people have contributed to Wikidata. For me, that is easily the most surprising number about the project. If you had asked me in 2012 how many people would contribute to Wikidata, I would have sheepishly hoped for a few hundred, maybe a few thousand. And I would have defensively explained why that's OK. I am humbled and awestruck by the fact that several hundred thousand people have contributed to an open knowledge base that is available to everyone, and that everyone can contribute to.

And that I think is the most important role that Wikidata plays. That it is a place that everyone can contribute to. That the knowledge base that everyone uses is not owned and gateguarded by any one company or government, but that it is a common good, that everyone can contribute to. That everyone with an internet connection can lend their voice to the sum of all knowledge.

We all own Wikidata. We are responsible for Wikidata. And we all benefit from Wikidata.

It has been an amazing ten years. I am looking forward to many more years of Wikidata, and to the many new roles that it will play in the years to come, and to the many people who will contribute to it.

Shoutout to the brilliant team that started the work on Wikidata: Lydia Pintscher, Abraham Taherivand, Daniel Kinzler, Jeroen De Dauw, Katie Filbert, Tobias Gritschacher, Jens Ohlig, John Blad, Daniel Werner, Henning Snater, and Silke Meyer.

And thank you for all these amazing pictures of cakes for Wikidata's birthday. (And if you're curious what is coming next: we are working on Wikifunctions and Abstract Wikipedia, in order to allow more people to contribute more knowledge to even more people!)

Markus Krötzsch ISWC 2022 keynote

A brilliant keynote by Markus Krötzsch for this year's ISWC.

"The era of standard semantics has ended"

Yes, yes! 100%! That idea was in the air for a long time, but Markus really captured it in clear and precise language.

This talk is a great birthday present for Wikidata's ten year anniversary tomorrow. The Wikidata community had over the last years defined numerous little pockets of semantics for various use cases, shared SPARQL queries to capture some of those, identified constraints and reasoning patterns and shared those. And Wikidata connecting to thousands of external knowledge bases and authorities, each with their own constraints - only feasible since we can, in a much more fine grained way, use the semantics we need for a given context. The same's true for the billions of Schema.org triples out there, and how they can be brought together.

The middle part of the talk goes into theory, but make sure to listen to the passionate summary at 59:40, where he emphasises shared understanding, that knowledge is human, and the importance of community.

"Why have people ever started to share ontologies? What made people collaborate in this way?" Because knowledge is human. Because knowledge is often more valuable when it is shared. The data available on the Web of linked data, including Wikidata, Data Commons, Schema.org, can be used in many, many ways. It provides a common foundation of knowledge that enables many things. We are far away from using it to its potential.

A remark on triples, because I am still thinking too much about them: yes to Markus's comments: "The world is not triples, but we make it triples. We break down the world into triples, but we don't know how to rebuild it. What people model should follow the technical format is wrong, it should be the other way around" (rough quotes)

At 1:17:56, Markus calls back our discussions of the Wikidata data model in 2012. I remember how he was strongly advocating for more standard semantics (as he says), and I was pushing for more flexible knowledge representations. It's great to see the synthesis in this talk.

Karl-Heinz Witzko

Ich hatte unglaublich gutes über das DSA Abenteuer "Jenseits des Lichts" gehört. Aber auch, dass es sehr schwer zu spielleiten sei. Ich sprach Karl-Heinz Witzko darauf an, den Autor des Abenteuers, und er sagte, er würde es für mich leiten. Wir müssten nur eine Zeit finden.

Wann auch immer wir uns trafen, versprachen wir uns gegenseitig, Zeit dafür zu finden. Ich hatte das Buch gekauft, aber natürlich nicht gelesen, und war immer sehr gespannt darauf, was es wohl mit dem Abenteuer auf sich hatte.

Karli hat zu DSA seine ganz einzigartige Stimme beigetragen. Ein Werk wie DSA, eine Welt wie Aventurien, entstammt nicht aus dem Kopf einer einzigen Person, sondern hunderte schufen und trugen bei. Und Karli's Stimme hatte ihren ganz eigenen Humor, und erweiterte die Welt um Perspektiven und Eigenheiten die sonst nie entdeckt worden wären. Ich habe seine Romane mit viel Schmunzeln gelesen, seine Solos sehr gerne und wiederholt gespielt und erforscht, nur sein einziges Gruppenabenteuer kannte ich nicht. Nach seiner Zeit bei DSA schrieb Karli weitere Romane und erschuf weitere Welten.

Am 29. September 2022 ging Karli von uns. Der Name Karl-Heinz Witzko wurde aus dem "Buch der Anwesenden" gestrichen, und ins "Buch der Abwesenden" eingetragen. Altem Brauch auf Maraskan folgend werden Karli nun die Sechszehn Ratschläge mit auf dem Weg gegeben, und die Sechszehn Forderungen gestellt. Ich hätte gerne gehört oder gelesen, was Karli aus diesen gemacht hätte.

Danke für Deine Worte. Danke für Deine Zeit. Danke für Deinen Humor.

Heute schlug ich "Jenseits des Lichts" auf und fing an zu lesen.

RIP Steve Wilhite

RIP Steve Wilhite, who worked on CompuServe chat for decades and was the lead of the CompuServe team that developed the GIF format, which is still widely used, and which made the World Wide Web a much more colorful and dynamic place by having a format that allowed for animations. Wilhite incorrectly insisted on GIF being pronounced Jif. Wilhite died on March 14, 2022 at the age of 74.

RIP Christopher Alexander

RIP Christopher Alexander, the probably most widely read actual architect in all of computer science. His work, particularly his book "A Pattern Language" was popularized, among others, by the Gang of Four and Design Pattern work, and is frequently read and cited in Future of Programming and UX circles for the idea that everyone should be able to create, but in order to enable them, they need patterns that make creation possible. His work inspired Ward Cunningham when developing wikis and Will Wright when developing that most ungamelike of games, Sim City. Alexander died on March 17, 2022 at the age of 85.

Ante Vrandečić (1919-1944)

I knew that my father was named for his uncle. His other brother told me about him, and he was telling me that he became a prisoner of war and that they lost his trace. Back then, I didn't dare to ask on which side he was fighting, and when I would have dared to ask, it was too late.

Today, thanks to the increasing digitalisation of older sources and their publication on the Web and the Web being indexed, I accidentally stumbled upon a record about him in a three thousand pages long book, Volume 8 of the "Victims of the War 1941-1945" (Žrtve rata 1941-1945).

He was a soldier in the NOV i POJ (Yugoslav partisans), became a prisoner of war, and was killed by Germans during a transport in 1944. I don't know where he was captured, from where to where he was transported, where he was killed.

My father, his namesake, then moved to Germany in the 1970s, where he and my mother built a new life for themselves and their children, and where I was born.

I have a lot of complicated emotions and thoughts.

A quick draft for a curriculum for Computer Science

The other day, on Facebook, I was asking the question who would be the person closest to being a popularizer for ideas in Computer Science to the wider audience, which lead to an interesting and insightful discussion.

Pat Hayes asked what I would consider the five (or so) core concepts of Computer Science. Ernest Davis answer with the following short list (not in any particular order):

  1. Virtual machine
  2. Caching
  3. Algorithm
  4. Data structure
  5. Programming language

And I followed up with this drafty, much longer answer:

  1. how and why computation works; that a computation is a mapping from your problem domain into some machine state, then we have some automatic movement, and the result represents an answer to your question; that it is always layers of interpretation; that it doesn't matter whether the computing machine is made of ICs or of levers, marbles, and gravity (i.e. what is a function); that computation is always real and you can't simulate computation; what can be done with computation and what cannot; computational thinking - this might map to number 1 in Ernest's list
  2. that everything can be represented with zeros and ones, but doesn't have to be; it could also be represented by A and B and Cs, and many other ways; that two states are simply convenient for electric devices; that all information, all data, all input to all computation, and the steps for computations themselves are represented with zeros and ones (i.e. the von Neumann architecture and binary encoding); what can be represented in this paradigm and what cannot - this might map to number 4 in Ernest's list
  3. how are functions encoded; how many different functions can have the same results; how wildly different in efficiency functions can be even when they have the same result; why that makes some things quick to calculate whereas others take a long time; basically smearing ideas from lambda calculus and assembler and building everything from NAND circuits; why this all maps to higher level languages such as JavaScript - this might map to ideas from 2, 3, and 5 on Ernest's list
  4. bringing it back to the devices; where does, physically, the computation happen, where is physically the data stored, and why it matters in terms of privacy, equity, convenience, economics, interdependence, even freedom and independence; what kind of computations and data storage we can expect to have in our mobile phones, in a data center, in an RFID card; how long the turnaround times are in each case; how cryptography works and what kind of guarantees it can provide; why centralization is so alluring and what the price of that might be; and what might be the cost of computation for the environment
  5. given our times, and building on the previous lessons, what is the role of machine learning; how does it actually work, why does it work as good as it does, and why does it not work when it doesn't and where can't it work; what does this have to with "intelligence", if it does; what becomes possible because of these methods, and what it costs; why these methods may reinforce inequities; but also how they might help us with significantly increasing access to better health care for many people are allow computers to have much more intuitive interfaces and thus democratize access to computing resources

I think the intuitions in 1, 2, and maybe 3 are really the core of computer science, and then 4 and 5 provide shortcuts to important questions four ourselves and society that, I think, would be worthwhile for everyone to ponder and have an informed understanding of the situation so that they can meaningfully make relevant decisions.

The Strange Case of Booker T. Washington’s Birthday

A lovely geeky essay about how much work a single edit to Wikipedia can be. I went down this kind of rabbit holes myself more than once, and so I very much enjoyed the essay.

Wordle is good and pure

The nice thing about Wordle - whether you play it or not, whether you like it or not - is that it is one of those good, pure things the Web was made for. A simple Website, without ads, popups, monetization, invasive tracking, etc.

You know, something that can chiefly be done by someone who already has a comfortable life and won't regret not having monetized this. The same way scientists mainly have been "gentleman scientist". Or tenured professors who spent years on writing novels.

And that is why I think that we should have a Universal Basic Income. To unlock that creativity. To allow for ideas from people who are not already well off to see the light. To allow for a larger diversity of people to try more interesting things.

Thank you for coming to my TED talk.

P.S.: on January 31, five days after I wrote this text, Wordle was acquired by the New York Times for an undisclosed seven-digit sum. I think that is awesome for Wardle, the developer of Wordle, and I still think that what I said was true at that time and still mostly is, although I expect the Website now to slowly change to have more tracking, branding, and eventually a paywall.

Meat Loaf

"But it was long ago
And it was far away
Oh God, it seemed so very far
And if life is just a highway
Then the soul is just a car
And objects in the rear view mirror may appear closer than they are."

Bat out of Hell II: Back into Hell was the first album I really listened to, over and over again. Where I translated the songs to better understand them. Paradise by the Dashboard Light is just a fun song. He was in cult classic movies such as The Rocky Horror Picture Show, Fight Club, and Wayne's World.

Many of the words we should remember him for are by Jim Steinman, who died last year and wrote many of the lyrics that became famous as Meat Loaf's songs. Some of Meat Loaf's own words better not be remembered.

Rock in Peace, Meat Loaf! You have arrived at your destination.

Map of current Wikidata edits

It starts entirely black and then listens to Wikidata edits. Every time an item with a coordinate is edited, a blue dot in the corresponding place is made. So slowly, over time, you get a more and more complete map of Wikidata items.

If you open the developer console, you can get links and names of the items being displayed.

The whole page is less than a hundred lines of JavaScript and HTML, and it runs entirely in the browser. It uses the Wikimedia Stream API and the Wikidata API, and has no code dependencies. Might be fun to take a look if you're so inclined.

https://github.com/vrandezo/wikidata-edit-map/blob/main/index.html

White's illusion

I stumbled upon "White's Illusion" and was wondering - was this named after a person called White, or was this named because, well it is an illusion where the colour white plays an important role?

As usual in this case, I started at Wikipedia's article on White's illusion. But Wikipedia didn't answer that question. The references at the bottom also didn't list to anyone named White. So I started guessing it's about the colour.

But wait! Skimming the article there was a mention to "White and White (1985)" - but without any further citation information. So not only one White but two of them!

Google Scholar and Semantic Scholar didn't help me resolving "White and White (1985)" to a proper paper, so I started suspecting that this was a prank that someone entered into the article. I started checking the other references, but they indeed reference papers by White! And with those more complete references I was able to find out that Michael White and Tony White wrote that 1985 paper, that they are both Australian, that Michael White wrote a number of other papers about this illusion and others, and that this is Tony White's only paper.

I added some of the info to the article, but that was a weird ride.

She likes music, but only when the music is loud

Original in German by Herbert Grönemeyer, 1983.

She sits on her windsill all day
Her legs dangling to the music
The noise from her room
drives all the neighbours mad
She is content
smiles merrily

She doesn't know
that snow
falls
without a sound
to the ground

Doesn't notice
the knocking
on the wall

She likes music
but only
when the music is loud
When it hits her stomach
with the sound

She likes music
but only
when the music is loud
When her feet feel
the shaking ground

She then forgets
that she is deaf

The man of her dreams
must play the bass
the tickling in her stomach
drives her crazy

Her mouth seems
to scream
with happiness
silently
her gaze removed
from this world

Her hands don't know
with whom to talk
No one's there
to speak to her

She likes music
but only
when the music is loud
When it hits her stomach
with the sound

She likes music
but only
when the music is loud
When her feet feel
the shaking ground

A sermon on tolerance and inclusion

Warning: meandering New Year's sermon ahead, starting at a random point and going somewhere entirely else.

I started reading Martin Kay's book on Translation, and I am enjoying it quite a bit so far. Kay passed away August 2021. His work seems highly relevant for the work on Abstract Wikipedia.

One thing that bummed me though is that for more than a page in the introduction he rants about pronouns and how he is going to use "he" to generically mean both men and women, and how all other solutions have deficits.

He culminates in the explanation: "Another solution to this problem is which is increasing in popularity, is to use both 'he' and 'she', shifting between them more or less randomly. So we will sometimes get 'When a translator is confronted with a situation of this kind, she must decide...'. The trouble with this is that some readers, including the present writer, reacts quite differently to the sentence depending on which version of the generic pronoun it contains. We read the one containing 'he' smoothly and, all else being equal, assimilate the intended meaning. Encountering the one with 'she', on the other hand, is like following a television drama that is suddenly interrupted by a commercial."

Sooo frustratingly close to getting it.

I wish he'd had just not spent over a page on this topic, but just used the generic 'he' in the text, and that's it. I mean, I don't expect everyone born more than eighty years ago to adjust to the modern usage of pronouns.

Now, I am not saying that to drag Kay's name through dirt, or to get him cancelled or whatever. I have never met him, but I am sure he was a person with many positive facets, and given my network I wouldn't be surprised if there are people who knew him and can confirm so. I'm also not saying that to virtue signal and say "oh man, look how much more progressive I am". Yes, I am slightly annoyed by this page. Unlike many others though, I am not actually personally affected by it - I use the pronoun "he" for myself and not any other pronoun, so this really is not about me. Is it because of that that it is easy for me to gloss over this and keep reading?

So is it because I am not affected personally that it is so easy for me to say the following: it is still worthwhile to keep reading his work, and the rest of the book, and to build on top of his work and learn from him. The people we learn some things from, the influences we accept, they don't have to be perfect in every way, right? Would it have been as easy for me to say that if I were personally affected? I don't know.

I am worried about how quickly parts of society seems to be ready to "cancel" and "call out" people, and how willing they are to tag a person as unacceptable because they do not necessarily share every single belief that is currently regarded as a required belief.

I have great difficulties in drawing the line. Which beliefs or actions of a person should be sufficient grounds to shun them or their work? When JK Rowling doubles down on her stance regarding trans women, is this enough to ask everyone to drop all interest in the world she created and the books she wrote? Do we reshoot movie scenes such as the cameo of Donald Trump in Home Alone 2 in order to "purify" the movie and make it acceptable for our new enlightened age again? When Johnny Depp was accused of domestic abuse, does he need to be recast from movies he had already been signed on? Do we also need to stop watching his previous movies? Do the believable accusations of child abuse against Marion Zimmer Bradley mean that we have to ignore her contributions to feminist causes, never mind her books? Should we stop using a font such as Gill Sans because of the sexual abuse Erjc Gill committed against his daughters? Do we have to stop watching movies or listen to music produced by murderers such as OJ Simpson, Phil Spector, or Johnny Lewis?

I intentionally escalated the examples, and they don't compare at all to Kay's defence of his usage of pronouns.

I offer no answers as to where the line should be, I have none. I don't know. In my opinion, none of us is perfect, and none of our idols, paragons, or example model humans will survive the scrutiny for perfection. This is not a new problem. Think of Gandhi, Michael Jackson, Alice Schwarzer, Socrates - no matter where you draw your idols from, they all come with imperfections, sometimes massive ones.

Can we keep and accept their positive contributions - without ignoring their faults? Can we allow people with faults to still continue to contribute their skills to society, or do we reduce them to their faults and negatives? Do we have to get someone fired for tweeting a stupid joke? Do we demand perfection by everyone at all time?

Or do we allow everyone to be human, make and have errors, and have beliefs many don't deem acceptable? Committing or causing actions resulting from these beliefs? Even if these actions and beliefs hurt or endanger people, or deny the humanity of others? We don't have to and should not accept their racism, sexism, homo- and transphobia - but can and should we still recognise their other contributions?

I am worried about something else as well. By pushing out so many because of the one thing they don't want to accept in the basket of required beliefs, we push them all into the group of outsiders. But if there are too many outsiders, the whole system collapses. Do we all have to have the same belief on guns, on climate, on gender, on abortion, on immigration, on race, on crypto, on capitalism, on housing? Or can we integrate and work together even if we have differences?

The vast majority of Americans think that human-caused climate change is real and that we should act to avoid it. Only 10% don't. And yet, because of the way we define and fence our in- and outgroups, we have a strong voting block that repeatedly leads to outright sabotage to effective measures. A large majority of Americans support the right to abortion, but you would never be able to tell given the fights around laws and court cases. Taxing billionaires more effectively is highly popular with voters, but again these majorities fizzle away and don't translate to the respective changes in the tax code.

I think we should be able to work together with people we don't agree with on everything. We should stop requiring perfection and alignment on all issues before moving forward. But then again, that's what I am saying, and I am saying it from a position of privilege, am I not? I am male. I am White. I am heterosexual. I am not Muslim or Jewish. I am well educated. I am not poor. I am reasonably technologically savvy. I am not disabled. What right do I have at all to voice my opinion on these topics? To demand for acceptance people with beliefs that hurt or endanger people who are not like me. Or even to ask for your precious attention for these words of mine?

None.

And yet I hope that we will work together towards progress on the topics we agree on, that we will enlighten each other on the topics we disagree on, and that we will be able to embrace more of us on our way into the future.

P.S.: this post is problematic and not very well written, and I recognise that. Please refer to the discussion about it on Facebook.

Long John and Average Joe

You may know about Long John Silver. But who's the longest John? Here's the answer according to Wikidata: https://w.wiki/4dFL

What about your Average Joe? Here's the answer about the most average Joe, based on all the Joes in Wikidata: https://w.wiki/4dFR

Note, the average height of a Joe in Wikidata is 1,86cm or 6'1", which is quite a bit higher than the average height in the population. A data collection and coverage issue: it is much more likely to have the height for a basketball player than for an author in Wikidata.

Just two silly queries for Wikidata, which are nice ways to show off the data set and what one can do with the SPARQL query endpoint. Especially the latter one shows off a rather interesting and complex SPARQL query.

Temperatures in California

It has been a bit chillier the last few days. I noticed that after almost a decade in California, I feel pretty comfortable with understanding temperatures in Fahrenheit - as long as they are over 60° F. If it is colder, I need to switch to Celsius in order to understand how cold it exactly is. I have no idea what 40° or 45° or 50° F are, but I still know what 5° C is!

The fact that I still haven't acclimatised to Fahrenheit for the cooler temperatures tells you a lot about the climate in California.

SWSA panel

Thursday, October 7, 2021, saw a panel of three founding members of the Semantic Web research community, who each have been my teachers and mentors over the years: Rudi Studer, Natasha Noy, and Jim Hendler. I loved watching the panel and enjoyed it thoroughly, also because it was just great to see all of them again.

There were many interesting insights and thoughts in this panel, too many to write them all down, but I want to mention a few.

It was interesting how much all panelists talked about creating the Semantic Web community, and how much of an intentional effort that was. Deciding that it needs a conference, a journal, an organization, setting those up, and their interactions. Seeing and fostering a sustainable research community grown out of an idea is a formidable and amazing effort. They all mentioned positively the diversity in the community, and that it was a conscious effort to work towards that. Rudi mentioned that the future challenge will be with ensuring that computer science students actually have Semantic Web technologies integrated into their standard curriculum.

They named a number of the successes that were influenced by the Semantic Web research work, such as Schema.org, the heavy use of SPARQL in supercomputing (I had no idea!), Wikidata (thanks for the shout out, Rudi!), and the development of scalable graph databases. Natasha raised the advantage of having common identifiers throughout an organization, i.e. that everyone refers to California the same way. They also named areas that remained elusive and that they expect to see progress in the coming years, Rudi in particular mentioned Agents and Common Sense, which was echoed by the other participants, and Jim mentioned Personal Knowledge Graphs. Jim mentioned he was surprised by the growing importance of unstructured data. Jim is also hoping for something akin to “procedural attachments” - you see some new data coming in, you perform this action (I would like to think that a little Wikifunctions goes a long way).

We need both, open knowledge graphs and closed knowledge graphs (think of your personal ones, but also the ones by companies).

The most important contribution so far and also well into the future was the idea of decentralization of semantics. To allow different stakeholders to work asynchronously and separately on parts of the semantics and yet share data. This also includes the decentralization of knowledge graphs, but also in the future we will encounter a world where semantics are increasingly brought together and yet decentralized.

One interesting anecdote was shared by Natasha. She was talking about a keynote by Guha (one of the few researchers who were namechecked in the panel, along with Tim Berners-Lee) at ISWC in Sydney 2013. How Guha was saying how simple the technology needs to be, and how there were many in the audience who were aghast and shocked by the talk. Now, eight years later and given her experience building Dataset Search, she appreciates the insights. If they have a discussion about a new property for longer than five minutes, they drop it. It’s too complicated, and people will use it wrong so often that the data cleanup will become expensive.

All of them shared the advice for researchers in their early career stage to work on topics that truly inspire them, on problems that are real and that they and others care about, and that if they do so, the results have the best chance to have impact. Think about problems you can explain to people not in your field, about “how can we use triples to save the world” - and not just about “hey, look, that problem that we solved with these other technologies previously, now we can also solve it with Semantic Web technologies”. This doesn’t really help anyone. Solve new problems. Solve real problems. And do what you are truly passionate about.

I enjoyed the panel, and can recommend everyone in the Semantic Web research area or any related, nearby research, to check it out. Thanks to the organizers for this talk (which is the first session in a series of talks that will continue with Ora Lassila early December).


Our four freedoms for our technology

(This is a draft. Comments are welcome. This is not meant as an attack on any person or company individually, but at certain practises that are becoming increasingly prevalent)

We are not allowed to use the devices we paid for in the ways we want. We are not allowed to use our own data in the way we want. We are only allowed to use them in the way the companies who created the devices and services allow us.

Sometimes these companies are nice and give us a lot of freedom in how to use the devices and data. But often they don’t. They close them down for all kinds of reasons. They may say it is for your protection and safety. They might admit it is for profit. They may say it is for legal reasons. But in the end, you are buying a device, or you are creating some data, and you are not allowed to use that device and that data in the way you want to, you are not allowed to be creative.

The companies don’t want you to think of the devices that you bought and the data that you created as your devices and your data. They want you to think of them as black boxes that offer you services they create for you. They don’t want you to think of a Ring doorbell as a camera, a microphone, a speaker, and a button, but they want you to think of it as providing safety. They don’t want you to think of the garage door opener as a motor and a bluetooth module and a wifi module, but as a garage door opening service, and the company wants to control how you are allowed to use that service. Companies like Chamberlain and SkyLink and Genie don’t allow you to write a tool to check on your garage door, and to close or open it, but they make deals with Google and Amazon and Apple in order to integrate these services into their digital assistants, so that you can use it in the way these companies have agreed on together, through the few paths these digital assistants are available. The digital assistant that you buy is not a microphone and a speaker and maybe a camera and maybe a screen that you buy and use as you want, but you buy a service that happens to have some technical ingredients. But you cannot use that screen to display what you want. Whether you can watch your Amazon Prime show on the screen of a Google Nest Hub depends on whether Amazon and Google have an agreement with each other, not on whether you have paid for access to Amazon Prime and you have paid for a Google Nest Hub. You cannot use that camera to take a picture. You cannot use that speaker to make it say something you want it to say. You cannot use the rich plethora of services on the Web, and you cannot use the many interesting services these digital assistants rely on, in novel and creative combinations.

These companies don’t want you to think of the data that you have created and that they have about you as your data. They don’t want you to think about this data at all. They just want you to use their services in the way they want you to use their services. On the devices they approve. They don’t want you to create other surfaces that are suited to the way you use your data. They don’t want you to decide on what you want to see in your feed. They don’t want you to be able to take a list of your friends and do something with it. They will say it is to protect privacy. They will say that it is for safety. That is why you cannot use the data you and your friends have created. They want to exactly control what you can and cannot do with the data you and your friends have created. They want to control how many ads you must see in order to be allowed to see your friends’ posts. They don't want anyone else to have the ability to provide you creative new interfaces to your feed. They don’t want you yourself the ability to look at your feed and do whatever you want with it.

Those are devices you paid for.

These are data you and your friends have created.

And more and more we are losing our freedom of using our devices and our data as we like.

It would be impossible to invent email today. It would be impossible to invent the telephone today. Both are protocols that allow everyone to communicate with anyone no matter what their email provider or their phone is. Try reading your friend’s Facebook feed on Instagram, or send a direct message from your Twitter account to someone on WhatsApp, or call your Skype contact on Facetime.

It would be impossible to launch the Web today - many companies don’t want you browsing the Web. They want you to be inside of your Facebook feed and consume your content there. They want you to be on your Twitter feed. They don’t want you to go to the Website of the New York Times and read an article there, they don’t want you to visit the Website of your friend and read their blog there. They want you to stay on their apps. Per default, they open Websites inside their app, and not in your browser, so you are always within their app. They don’t want you to experience the Web. The Web is dwindling and all the good things on it are being recut and rebundled within the apps and services of tech companies.

Increasingly, we are seeing more and more walls in the world. Already, it is becoming impossible to pay and watch certain movies and shows without buying into a full subscription in a service. We will likely see the day where you will need a specific device to watch a specific movie. Where the only way to watch a Disney+ exclusive movie is on a Disney+ tablet. You don’t think so? Think about how easy it is to get your Kindle books onto another Ebook reader. How do you enable a skill or capability available in Alexa on your Nest smart speaker? How can you search through the books that you bought and are in your digital library, besides by using a service provided by the company that allows you to search your digital library? When you buy a movie today on YouTube or on iMovies, what do you own? What are you left with when the companies behind these services close that service, or go out of business altogether?

Devices and content we pay for, data we and our friends create, should be ours to use in empowering and creative ways. Services and content should not be locked in with a certain device or subscription service. The bundling of services, content, devices, and locking up user data creates monopolies that stifle innovation and creativity. I am not asking to give away services or content or devices for free, I am asking to be allowed to pay for them and then use them as I see fit.

What can we do?

As far as I can tell, the solution, unfortunately, seems to be to ask for regulation. The market won’t solve it. The market doesn’t solve monopolies and oligopolies.

But don’t ask to regulate the tech giants individually. We don’t need a law that regulates Google and a law that regulates Apple and a law that regulates Amazon and a law to regulate Microsoft. We need laws to regulate devices, laws to regulate services, laws to regulate content, laws that regulate AI.

Don’t ask for Facebook to be broken up because you think Mark Zuckerberg is too rich and powerful. Breaking up Facebook, creating Baby Books, will ultimately make him and other Facebook shareholders richer than ever before. But breaking up Facebook will require the successor companies to work together on a protocol to collaborate. To share data. To be able to move from one service to another.

We need laws that require that every device we buy can be made fully ours. Yes, sure, Apple must still be allowed to provide us with the wonderful smooth User Experience we value Apple for. But we must also be able to access and share the data from the sensors in our devices that we have bought from them. We must be able to install and run software we have written or bought on the devices we paid for.

We need laws that require that our data is ours. We should be able to download our data from a service provider and use it as we like. We must be allowed to share with a friend the parts of our data we want to share with that friend. In real time, not in a dump download hours later. We must be able to take our social graph from one social service and move to a new service. The data must be sufficiently complete to allow for such a transfer, and not crippled.

We need laws that require that published content can be bought and used by us as we like. We should be able to store content on our hard disks. To lend it to a friend. To sell it. Anything I can legally do with a book I bought I must be able to legally do with a movie or piece of music I bought online. Just as with a book you are not allowed to give away the copies if the work you bought still enjoys copyright.

We need laws that require that services and capabilities are unbundled and made available to everyone. Particularly as technological progress with regards to AI, Quantum computing, and providing large amounts of compute becomes increasingly an exclusive domain for trillion dollar companies, we must enable other organizations and people to access these capabilities, or run the risk that sooner or later all and any innovation will be happening only in these few trillion dollar companies. Just because a company is really good at providing a specific service cheaply, it should not be allowed to unfairly gain advantage in all related areas and products and stifle competition and innovation. This company should still be allowed to use these capabilities in their products and services, but so should anyone else, fairly prized and accessible by everyone.

We want to unleash creativity and innovation. In our lifetimes we have seen the creation of technologies that would have been considered miracles and impossible just decades ago. These must belong to everybody. These must be available to everyone. There cannot be equity if all of these marvellous technologies can be only wielded by a few companies on the West coast of the United States. We must make them available to all the people of the world: the people of the Indian subcontinent, the people of Subsaharan Africa,the people of Latin America, and everyone else. They all should own the devices they paid for, the data they created, the content they paid for. They all should have access to the same digital services and capabilities that are available to the engineers at Amazon or Google or Microsoft. The universities and research centers of the world should be able to access the same devices and services and extend them with their novel and creative ideas. The scrappy engineers in Eastern Europe and India and Nigeria and Central Asia should be able to call the AI models trained by Google and Microsoft and use them in novel ways to run their devices and chip-powered cars and agricultural machines. We want a world of freedom, tinkering, where creativity and innovation are unleashed, and where everyone can contribute their ideas, their creativity, and where everyone can build their fortune.