Nur Deutsche Beiträge - English posts only - Other contents of Simia

30 years of wikis

25 March 2025

Today is the 30th anniversary of the launch of the first wiki by Ward Cunningham. A page that anyone could edit. Right from the browser. It was generally seen as a bad idea. What if people did bad things?

Originally with the goal to support the software development community in creating a repository of software design patterns, wikis were later used for many other goals (even an encyclopedia!), and became part of the recipe, together with blogs, fora and early social media, that was considered the Web 2.0.

Thank you, Ward, and congratulations on the first 30 years.

A wiki birthday card is being collected on Wikiindex.

Simia

My thoughts on Alignment research

2 March 2025

Alignment research seeks to ensure that hypothetical future superintelligent AIs will be beneficial to humanity—that they are "aligned" with "our goals," that they won’t turn into Skynet or universal paperclip factories.

But these AI systems will be embedded in larger processes and organizations. And the problem is: we haven’t even figured out how to align those systems with human values.

Throughout history, companies and institutions have committed atrocious deeds—killing, poisoning, discriminating—sometimes intentionally, sometimes despite the best intentions of the individuals within them. These organizations were composed entirely of humans. There was no lack of human intelligence that could have recognized and tempered their misalignment.

Sometimes, misalignment was prevented. When it was, we might have called the people responsible heroes—or insubordinate. We might have awarded them medals, or they might have lost their lives.

Haven’t we all witnessed situations where a human, using a computer or acting within an organization, seemed unable to do the obvious right thing?

Yesterday, my flight to Philadelphia was delayed by a day. So I called the hotel I had booked to let them know I’d be arriving later.

The drama and the pain the front desk clerk went through!

“If you don’t show up today,” he told me, “your whole reservation will be canceled by the system. And we’re fully booked.”

“That’s why I’m calling. I am coming—just a day later. I’m not asking for a refund.”

“No, look, the system won’t let me just cancel one night. And I can’t create a new reservation. And if you don’t check in today, your booking will be canceled…”

And that was a minor issue. The clerk wanted to help. It is a classical case of Little Britain's "Computer says no" sketch. And yet, more and more decisions are being made algorithmically—decisions far more consequential than whether I’ll have a hotel room for the night. Decisions about mortgages and university admissions. Decisions about medical procedures. Decisions about clemency and prison terms. All handled by systems that are becoming increasingly "intelligent"—and increasingly opaque. Systems in which human oversight is diminishing, for better and for worse.

For millennia, organizations and institutions have exhibited superhuman capabilities—sometimes even superhuman intelligence. They accomplish things no individual human could achieve alone. Though we often tell history as a story of heroes and individuals, humanity’s greatest feats have been the work of institutions and societies. Even the individuals we celebrate typically succeeded because they lived in environments that provided the time, space, and resources to focus on their work.

Yet we have no reliable way of ensuring that these superhuman institutions—corporations, governments, bureaucracies—are aligned with the broader goals of humanity. We know that laissez-faire policies have allowed companies to do terrible things. We know that bureaucracies, over time, become self-serving, prioritizing their own growth over their original purpose. We know that organizations can produce outcomes directly opposed to their stated missions.

And these misalignments happen despite the fact that these organizations are made up of humans—beings with whom we are intimately familiar. If we can’t even align them, what hope do we have of aligning an alien, inhuman intelligence? Or even a superintelligence?

More troubling still: why should we accept a future in which only a handful of trillion-dollar companies—the dominant tech firms of the Western U.S.—control access to such powerful, unalignable systems? What have these corporations done to earn such an extraordinary level of trust in a technology that some fear could be catastrophic?

What am I arguing for? To stop alignment research? No, not at all. But I would love for us to shift our focus to the short- and mid-term effects of these technologies. Instead of debating whether we might have to fight Skynet, we should be considering how to prevent further concentration of wealth by 2030 and how to ensure a fairer distribution of the benefits these technologies bring to humanity. Instead of worrying about Roko’s basilisk, we should examine the impact of LLMs on employment markets—especially given the precarious state of unions and labor regulations in certain countries. Rather than fixating on hypothetical paperclip-maximizing AIs, we should focus on the real and immediate dangers of lethal autonomous weapons in warfare and terrorism.

Simia

The Editors

2 February 2025

I finished reading "The Editors" by Stephen Harrison, and I really enjoyed it. The novel follows some crucial moments of Infopendium, a free, editable online encyclopedia with mostly anonymous contributors. The setting is a fictionalized version of Wikipedia, and set around the beginning of the COVID pandemic.

The author is a journalist who has covered Wikipedia before, and now has written a fictional novel. It's not a roman à clef - the events described here have not happened for Wikipedia, even though some of the characters feel very much inspired by real Wikipedia contributors. I constantly had people I know playing the roles of DejaNu, Prospero, DocMirza, and Telos in my inner cinema. And as the book continued I found myself apologizing in my mind to the real people, because they would never act as in the book.

There were some later scenes I had a lot of trouble to suspend disbelief for, but it's hard to say which ones without spoiling too much. Also, I'm very glad that the real world Wikipedia is far more technically robust than Infopendium seems to be.

I recommend reading it. It offers a fictional entrypoint to ideas like edit wars, systemic bias, the pushback to it, anonymous collaboration, community values, sock puppets, conflict of interest, paid editing, and more, and I found it also a good yarn, with a richly woven plot. Thanks for the book!

Simia

AI and centralization

31 January 2025

We have a number of big American companies with a lot of influential connections which have literally spent billions of dollars into developing large models. And then another company comes in and releases a similar product available for free.

Suddenly, trillions of dollars are on the line. With their connections they can call for regulation, designed to protect their investment. They could claim that the free system is unsafe and dangerous, as Microsoft and Oracle were doing in the 90s with regards to open source. They could try to use and extend copyright once they have benefitted from the loose regulations, as Disney was doing in the 60s to 90s. They could increase the regulatory hurdles to enter the market. They could finance scientific studies, philosophers and ethicists to publish about the dangers and benefits of having this technology widely available, another playbook tobacco and oil companies have been following for decades.

It's about trillions of dollars. Some technology giants are seeing that opportunity to make easy money dissipate. They would love if everyone has to use their models, running on their cloud infrastructure. They would love if every little app made many calls to their services, sending a constant stream of money to them, if every piece of value created had an effective AI "tax" they would collect. In the 90s and 00s Microsoft made huge amounts of money through the OS "tax", then Apple and Google and Microsoft made huge amounts of money through the app store "tax". Amazon and Microsoft and Google and OpenAI would love to have a repeat of that business model.

I would expect a lot of soft and hard power to be pushed around in the coming months. Many old playbooks reiterated, but also new playbooks introduced. Unimaginable amounts of value and money can and will be made, but how it will be distributed is an utterly non-transparent process. I don't know what an effective way would be to avoid a highly centralized world, to ensure that the fruits of all this work is distributed just a little bit more equally, to have a world in which we all have a bit of equity in the value being created.

To state it clearly: I'm not afraid of a superintelligent AI that will turn us all into paperclips. I'm afraid of a world where a handful of people have centralized extreme amounts of power and wealth, and where most of us struggle with living a good life in dignity. I'm afraid of a world where we don't have a say anymore in what happens. I'm afraid of a world where we effectively lost democracy and individual agency.

There is enough to go around to allow everyone to live a good life. And AI has the opportunity to add even more value to the world. But this will go with huge disruptions. How we distribute the wealth, value and power in the world is going to be one of the major questions of the 21st century. Again.

Simia

Languages with the best lexicographic data coverage in Wikidata 2024

23 January 2025

Languages with the best coverage as of the end of 2023

English 93.1% (=, +0.2%)
Italian 92.6% (+7, +9.7%)
Danish 92.3% (+3, +5.4%)
Spanish 91.8% (-2, +0.5%)
Norwegian Bokmal 89.4% (-2, +0.3%)
Swedish 89.3% (-2, +0.4%)
French 87.6% (-2, +0.6%)
Latin 85.7% (-1, -0.1%)
Norwegian Nynorsk 81.8% (+1, +1.6%)
Estonian 81.3% (-1, +0.1%)
German 79.6% (=, +0.1%)
Malay 77.8% (+2, +4.7%)
Basque 75.9% (-1, =)
Portuguese 74.9% (-1, +0.1%)
Panjabi 73.3% (=, +2.3%)
Breton 71.1% (+1, +3.8%)
Czech 69.3% (NEW, +6.1%)
Slovak 67.8% (-2, =)
Igbo 67.8% (NEW, +2.0%)

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. The first number in the parentheses is the change in rank compared to last year, and the second number the change in coverage compared to last year.

The list contains all languages where the data covers more than two thirds of the selected corpus.

English managed to keep the lead, but the distance to the second place melted from 1.6% last year to a mere 0.5% this year. Italian and Danish made huge jumps forward, Italian by increasing coverage by almost 10% and raising seven ranks to second place. Compared to last year, two new languages made it into the top list, Czech and Igbo, both cracking the ⅔ limit to join the top list – Hindi just being behind at 66.5%.

The complete data is available on Wikidata.

Simia

Progress in lexicographic data in Wikidata 2024

22 January 2025

Here are some highlights of the progress in lexicographic data in Wikidata in 2024

Hausa: jumped from 1.5% coverage right to 40%
Danish: Danish also made another huge jump forward, increasing the number of forms from 170k to 570k, form coverage from 33% to 52%, and token coverage from 87% to 92%
Italian: Italian made another huge push, increased the number of forms from 290k to 410k, and the coverage from 83% to 93%
Spanish: Spanish also kept pushing forward, increasing the number of forms from 440k to 560k, and the coverage from 91.3% to 91.8%
Norwegian (Nynorsk): increased the number of forms from 67k to 88k, and coverage from 80% to 82%
Czech: increased the coverage from 63% to 69%, the number of forms from 190k to 210k
Tamil: almost doubled the number of forms from 3800 to 6600, increasing coverage from 8% to 11%
Breton: added 1000 new forms, increasing the coverage from 67% to 71%
Croatian: increased from 4k to 5.5k forms, improving coverage from 45% to 48%

What does the coverage mean? Given a text (usually Wikipedia in that language, but in some cases a corpus from the Leipzig Corpora Collection), how many of the occurrences in that text are already represented as forms in Wikidata's lexicographic data. Note that every percent more gets much more difficult than the previous one: an increase from 1% to 2% usually needs much much less work than from 91% to 92%.

Wikidata lexicographic data coverage for Croatian in 2024

21 January 2025

For last year I picked up an ambitious goal for growing the lexicographic data for Croatian in 2024. And, just like last year, I missed again.

My goal was to grow the coverage to 50% - i.e. half of all the words in a Croatian corpus would be found in Wikidata. Instead, we grew from 45.5% to 47.9%. The number of forms grew from 4115 to 5506, more than a thousand new forms, a far bigger growth in forms than last year. So, even though the goal was missed, the speed of growth in Croatian is accelerating.

Part of that growth in forms is due to Google's Wordgraph release, a free dataset with words in about 40 languages which describe people - both demonyms and professions.

Do I want to set again a goal? After missing it twice, I am hesitant. Would I again reduce the goal further? But less than 50% sounds defeatist. But back to 60% is obviously too much. So, yes, let's go for 50% again. Let's see where it will take us this time. It's only 2.1% of coverage away from 50%, so that should be doable.

Simia

Large Language Models, Knowledge Graphs and Search Engines

16 January 2025

How can Large Language Models (LLMs), Knowledge Graphs and Search Engines be combined to best serve users? What are the strengths and limitations of these technologies?

Aidan Hogan (Universidad de Chile, previously DERI, Linked data), Luna Dong (Meta, previously Amazon and Google), Gerhard Weikum (MPI, Yago), and myself (Wikimedia, previously Google) have been invited to give keynotes on this topic in the last year or two, on different conferences. Now we wrote a paper together to synthesise and capture some of the ideas we were presenting.

Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users' Questions, arxiv.org/abs/2501.06699

Simia

Translating Nazor: The Man Who Lost a Button

23 December 2024

The most famous child of the island of Brač is very likely Vladimir Nazor. His books are part of the canon for Croatian children, and, as fate has laid it out, he also happened to become the first head of state of Croatia during and after World War II.

In 1924, exactly a hundred years ago, he published "Stories from childhood", a collection of short stories. I took one of his stories from that collection and translated it into English, to make some of his work more accessible to more readers, and to see how I would do with such a translation.

I welcome you to read "The Man Who Lost a Button". Feedback, comments, and reviews are very welcome. I am also planning to make a translation into German, but I don't know how long that will take.

Simia

2024 US election

10 November 2024

Some thoughts on the US election.

Wrong theory: 2024 was lost because Harris voters stayed home

I first believed that Harris lost just because people were staying home compared to 2020. But that, by itself, is an insufficient explanation.

At first glance, this seems to hold water: currently, we have 71 million votes reported for Harris, and 75 million votes reported for Trump, whereas last time Biden got 81 million votes and Trump 74 million votes. 10 million votes less is enough to lose an election, right?

There are two things that make this analysis insufficient: first, California is really slow at counting, and it is likely that both candidates will have a few million votes more when all is counted. Harris already has more votes than any candidate ever had, besides Biden and Trump.

Trump already has more votes than he got in the previous two elections. In 2020, more people voted for Trump than in 2016. In 2024, more people voted for Trump than in 2020.

Second, let’s look at the states that switched from Biden to Trump:

Wisconsin and Georgia: both Trump and Harris got more votes than Trump or Biden respectively in 2020
Pennsylvania, Nevada and Michigan: Trump already has more voters in 2024 than Biden had in 2020. Even if Harris had the same number of voters as Biden had in 2020, she would have lost these states.
Arizona still hasn’t counted a sixth of their votes, and it is unclear where the numbers will end up. If we just extrapolate linearly, Arizona will comfortably be in one of the two buckets above.

Result: There is no state where Biden’s 2020 turnout would have made a difference for Harris. (With the possible but unlikely exception of Arizona, where the counting is still lagging behind)

Yes, 10 million votes fewer for Harris than for Biden looks terrible and like sufficient explanation, but 1) this is not the final result and it will become much tighter, and 2) it wouldn’t have made a difference.

California is slow at counting

I was really confused: why had California only reported two thirds of its votes so far. I found the article below, explaining some of it, but it really seems a home-made mess for California, and one that the state should clean up.

https://www.berkeleyside.org/2024/11/08/alameda-county-election-results-slow-registrar

Voting results in PDF instead of JSON

Voting results in Alameda County will be released as PDF instead of JSON. The Registrar for Votes “recently told the Board of Supervisors that he’s following guidance from the California Secretary of State, which is recommending the PDF format to better safeguard the privacy of voters.”

This statement is wrong. JSON does not safeguard the privacy of voters any better than PDF does. This statement is not just wrong, it doesn’t even make sense.

In 2022, thanks to the availability of the JSON files, a third-party audit found an error in one Alameda election, resulting in the wrong person being certified. “Election advocates say the PDF format is almost impossible to analyze, which means outside organizations won’t be able to double-check [...] [I]f the registrar had released the cast vote record in PDF format in 2022, the wrong person would still be sitting in an OUSD board seat.”

The county registrar is just following the California Secretary of State. According to a letter by the registrar: “If a Registrar intends to produce the CVR [Cast Vote Record], it must be in a secure and locked PDF format. The Secretary of State views this as a directive that must be followed according to state law. I noted that this format does not allow for easy data analysis. The Secretary of State’s Office explained that they were aware of the limitations when they issued this directive. [...] San Francisco has historically produced its CVR in JSON format, contrary to the Secretary of State's directive. The Secretary of State’s office has informed me that they are in discussions with San Francisco to bring them into compliance”.

Sources:

It was not a decisive win

There are many analyses about why Harris lost the election, and many are going far overboard, and often for political reasons, with the aim to influence the platform of the Democratic party for the next election. This wasn’t a decisive win.

I wanted to make the argument that 30k voters in Wisconsin, 80k voters in Michigan, and 140k voters in Pennsylvania would have made the difference. And that’s true. I wanted to compare that with other US elections, and show that this is tighter than usual.

But it’s not. US elections are just often very tight. There are exceptions, the first Obama election was such an exception. But in general, American elections are tight (I’ll define a tight election as “if I can find that by flipping less than 0.5% of the voters, a different president would have been elected”).

I don’t know how advisable it is to make big decisions on a basically random outcome.

Simia

... further results

Archive - Subcribe to feed

Main Page

30 years of wikis

My thoughts on Alignment research

The Editors

AI and centralization

Languages with the best lexicographic data coverage in Wikidata 2024

Progress in lexicographic data in Wikidata 2024

Wikidata lexicographic data coverage for Croatian in 2024

Large Language Models, Knowledge Graphs and Search Engines

Translating Nazor: The Man Who Lost a Button

2024 US election

Wrong theory: 2024 was lost because Harris voters stayed home

California is slow at counting

Voting results in PDF instead of JSON

It was not a decisive win

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

topics

Tools