Running out of text
Many of the available text corpora have by now been used for training language models. One untapped corpus so far have been our private messages and emails.
How fortunate that none of the companies that train large language models have access to humongous logs of private chats and emails, often larger than any other corpus for many languages.
How fortunate that those who do have well working ethic boards established, who would make sure that such requests are evaluated.
How fortunate that we have laws in place to protect our privacy.
How fortunate that when new models are published also the corpora are being published on which the models are being trained.
What? Your telling me, "Open"AI is keeping the training corpus for GPT-4 secret? The company closely associated with Microsoft, who own Skype, Office, Hotmail? The same Microsoft who just fired an ethics team? Why would all that be worrisome?
P.S.: To make it clear: I don't think that OpenAI has used private chat logs and emails as training data for GPT-4. But by not disclosing their corpora, they might be checking if they can get away with not being transparent, so that maybe next time they might do it. No one would know, right? And no one would stop them. And hey, if it improves the metrics...
Oscar winning families
Yesterday, when Jamie Lee Curtis won her Academy Award, I learned that both her parents were also nominated for Academy Awards. Which lead to the question: who else?
I asked Wikidata, which lists four others:
- Laura Dern
- Liza Minnelli
- Nora Ephron
- Sean Astin
Only one of them belongs to the even more exclusive club of people who won an Academy Award, and where both parents also did: Liza Minnelli, daughter of Vincente Minelli and Judy Garland.
Also interesting: List of Academy Award-winning families
The place of birth of Ena Begović
I stumbled accidentally over a discrepancy regarding the place of birth of the Croatian actress Ena Begović, and noticed that if you ask Google for the place of birth, it answers Trpanj, whereas Wikipedia lists Split. I was curious where Google got Trpanj from, and how to fix it (especially now that I am not at Google anymore).
The original article in English Wikipedia was created in August 2005 by Raoul DMR. The article listed her as a "native of Split", which in September 2005 was turned into "born in Split".
In April 2018, Lole484, a user who gets blocked for sockpuppeting later, adds that she was born in "Trpanj near Split". There is no Trpanj near Split, but there is a Trpanj on Pelješac. Realzing that, they remove the "near Split" part. In 2019, Ivan Ladic - a sockpuppet of Lole484 - adds a reference to the city of birth being Trpanj, Večernji list, a well known Croatian news magazine.
In April 2020, an anonymous editor changes the place of birth back to Split, and adds a reference to the Croatian national encyclopedia. Today, I changed it back to Trpanj, accidentally while not being logged in (thus anonymously), to possibly encourage a discussion, after starting a conversation on the talk page on English and Croatian a few weeks ago that had one reply.
Interestingly, within a minute after changing the text, I went to Google and asked again for the date of birth, and Google again shows me Trpanj - but this time with the Wikipedia article and the updated snippet as a source. That is impressive.
When I asked Bing, Bing was saying Split for the last three weeks, since I started this adventure, whenever I checked. Today, it still kept saying Split, referencing two sources, one of them English Wikipedia, although I had already changed English Wikipedia. Not as fresh. Let's see how long this will stick. (Maybe folks at Bing should also talk with my colleagues at Wikimedia Enterprise to improve their freshness?)
The Croatian article was created in 2006 after the English one already stated Split, and Split was presumably copied over from the English version. Lole484 changed it to Trpanj in May 2018, and was later also blocked on Croatian Wikipedia, for unrelated reasons of vandalism. The same anonymous editor as on English Wikipedia changes it back to Split in April 2020.
Serbian and Serbocroatian started their articles in 2007, Russian in 2012, Ukrainian in 2016, Albanian and Bulgarian in 2017, Egyptian Arabic was created in October 2020. They all had Split from the beginning and throughout until today, presumably copied from English, directly or indirectly.
Amusingly, Serbian Wikipedia's opening sentence, which includes the place of birth being Split, receives a reference in January 2022 - but the reference actually states Trpanj.
None of the other language editions had their article started in the 2018-2019 window when English and Croatian stated the place of birth as Trpanj.
The only other Wikipedia language edition that saw a change of the place of birth was the Bosnian. The article on Bosnian Wikipedia started a few months after the Croatian, in 2006 (and thus being the third oldest article), and presumably also just copied from either Croatian or English. Lole484 changed it to Trpanj in April 2018, just like on the other Wikipedias. Here it was reverted the next day, but Lole484's sockpuppet Ivan Ladic reinstated that change in January 2019. When I started this adventure, the only Wikipedia that stated Trpanj was Bosnian, all other eight language editions with an article said Split.
On Wikidata, the item was created in 2012, shortly after the launch of the site, based on the existing six sitelinks. The place of birth being Split is added the following year, imported from the Russian Wikipedia.
After I stumbled upon the situation, I added Trpanj as second place of birth, and added sources to both Trpanj and Split.
What's the situation outside of Wikipedia? Both places have pretty solid references going for them:
- Večernji list, article from 2016
- Biografija stated Trpanj, no date, but after 2013 (Archive has the first copy from October 2020)
- tportal.hr has an article on a photography exhibition in Trpanj about Ena Begović, saying the place is chosen because it is her place of birth, published 2016
- Jutarnji list, a well known Croatian newspaper, has a long article about the actress, calling their house in Trpanj the 'rodna kuća', their birth home, of Ena and her sister Mia. This does not necessarily mean that it is literally the house they were born in. Published 2010
- HRT (Croatian national broadcaster), published 2021
- Dubrovački Vjesnik, local newspaper close to Trpanj, lists Trpanj, article from 2020
- Slobodna Dalmacija, a local newspaper from Split, writes Trpanj (but note that this is the same author as the previous article)
- Juarnji list, published 2020 (but note that this is the same author as the previous article)
- Geni.com says Trpanj, last updated 2022
- Croatian national encyclopedia, published in 2021
- Filmski leksikon from the same publisher, no date, but after 2008 (Archive has the first copy from 2022)
- Proleksis, but same publisher as above, as of 2013
- Gloria, a Croatian magazine, says she was born in Split, but Trpanj was her home, published 2020
- film.hr via Archive, from 2007
- Espreso.co.rs writes Split, published 2022
- Find-a-grave lists Split on the site, but more importantly, the photograph of her grave avoids a mention of the place, but only gives the date of her birth and death
- ČSFD (Czechoslovak film database)
- Prabook, from the World Biographical Encyclopedia
- Svensk Filmdatabas
- TMDB (The Movie Database)
24sata says she grew up in Trpanj, gives her date of birth, but avoids stating her place of birth.
Only very few of the sources predate the English Wikipedia article, most notably:
- net.hr published the news of her death and burial in 2000, and lists Split as the place of birth
- IMdb from April 2005 via Archive lists Split, and throughout (I guess?) until today
I also looked up her sister Mia and found her profile on Facebook and sent her a message, but I assume she never even saw this message request. At least I never received an answer (and I didn't expect to). For Mia, the situation is similar: her article originally stated Split, was changed by Lole484 and reverted by an anonymous user, both in English and Croatian, whereas the other languages just list Split throughout.
There were many other sources, and they were going one way or the other. Many of the sources probably just copied from each other. The fact that there were some sources, such as Večernji, that stated Trpanj before it ever made to Wikipedia, but after Split was listed in Wikipedia, was swaying me to think it is Trpanj. Also, it was not always the strongest sources (e.g. usually I would rank the national encyclopedia over Večernji) that said Trpanj, but it was the most in-depth articles, that looked like the authors actually took the time to do some research. Many of the sources looked like they were just bots copying from Wikipedia or Wikidata, or quick pieces taking the base data from Wikipedia.
But then, finally, I stumbled upon one more source: index.hr re-published in 2019 an 1989 interview by Kemal Mujičić with Ena and Mia Begović. Here's a quote from the interview:
Rođene su u Trpnju na Pelješcu.
Ena: Molim vas, to posebno naglasite: Svi misle da smo Dubrovkinje.
Mia: Zanimljivo je da smo u Trpnju rođene kao podstanarke. Roditelji su tek poslije sagradili onu kućicu.
They (Ena and Mia) are born in Trpanj on Pelješac.
Ena: Please put an emphasis on this: everyone thinks we are from Dubrovnik.
Mia: It is interesting that in Trpanj we were born as renters. Our parents built the little house (in which we lived) only later.
Ha! It is amusing to see that Ena's worry was that everyone thinks they are from Dubrovnik. I couldn't find a single source claiming that (but she went to high school (gimnazijum) in Dubrovnik, which is probably the source of that statement from 30 years ago). Also, so much for birth house.
Given all of that, I am going with Trpanj, and making the changes to the Wikipedia languages as much as I can (if someone can help with Arabic and Egyptian Arabic for Ena and Mia, that would be swell, I cannot edit that language edition). Let's see if it sticks.
So, why did Google know the correct answer, even though their usual sources, such as Wikidata and Wikipedia where saying Split? I mustn't say too much but it is due to the Google Knowledge Graph team and their quality processes. Seriously, congratulations to my former colleagues at Google for getting that right!
Just for fun, I also asked ChatGPT (on February 15). And the answer surprised me: when I asked in English, it gave me, unsurprisingly, Split (certainly what the Web seems to believe). But when I asked in Croatian, it gave me a different answer! And the answer was neither Split, nor Trpanj, and also not Dubrovnik - but Zagreb! It is interesting that something like the place of birth of an actress would lead to different answers depending on the language. I would have expected this knowledge to be in the 'world knowledge' of the LLM, not in the 'language knowledge'. I can't check out Bing's chat interface, as I have no access to it, but I would be curious what it says and how long it takes to update.
Thank you for going along on this rather nerdy ride of citogenesis.
Ah, only a few hours after this publication, Bing got updated. And they not only switched from Split to Trpanj, they use this very blogpost as one of the two authoritative references for Trpanj!
Ina Kramer (1948-2023)
1990 erschien die erste aventurische Regionalkarte "im 3D Effekt", wie es damals beworben wurde, "Das Bornland" im Abenteuer "Stromaufwärts" von Michelle Schwefel. Später im Jahr erschien dann die Spielhilfe "Das Königreich am Yaquir", in dem die Karte zum Lieblichen Feld war.
Ich habe stundenlang diese Karten angestarrt. Sie waren so unglaublich detailliert. So wunderschön. Ich war sprachlos, wie schön diese Karten waren. Ich kannte nichts was die Qualität dieser Karten hatte, nicht nur bezüglich Karten für Rollenspielwelten und Fantasywelten, sondern überhaupt.
Es war ein frecher Traum, sich vorzustellen, ganz Aventurien in diesem Format, eins zu einer million, zu haben, und dennoch, innerhalb eines guten Jahrzehnts war der Traum erfüllt, Box für Box, Publikation für Publikation.
Wir verdanken dieses Meisterwerk, Aventurien im Massstab von 1:1.000.000, der Autorin und Grafikerin Ina Kramer. Ina's Bilder und vor allem Porträts und Karten in den DSA Publikationen der späten 80er und den 90er haben für mich mein Bild von DSA und wie ich mir Aventurien vorstellte geprägt wie sonst nur Caryad. Ob das Porträt von Kaiser Hal, Haldana von Ilmenstein, Prinz Brin, so viele andere. Neben ihren Bildern schrieb sie auch vielerlei Texte, vor allem Romane.
Das Rad ist zerbrochen. Am 10. Februar 2023 ist Ina Kramer im Alter von 74 Jahren gestorben.
Ina, vielen Dank für Deine Werke. Ich durfte Ina ein paar Mal treffen, auf Konventen und manchen anderen Gelegenheiten. Ihre Werke haben für mich einen wichtigen Teil meines Lebens mit Bildern und Karten erfüllt. Ich glaube auch, dass Inas Karten mein lebenslanges Interesse an Landkarten weckte.
Connectionism and symbolism: The fall of the symbolists
The big tech layoffs happen, unfortunately and entirely by coincidence, at a time of incredibly elevated expectations regarding machine learned generative models: ChatGPT may not be the 'best' language model out there, but due to the hard work by OpenAI to turn it into an easy to use product, and the huge amount of resources made available for free so that a very large audience could play with it, has in a very short time managed to captured the imagination of many and the conversation. I would say, rightfully. The way ChatGPT was released led to a shock in the sense that we are right now dazed and confused about what effect this technology will have on the world.
And while we are still in the middle of processing this shock, large scale strategic decisions regarding many projects and people were made. Anyone in big tech who worked on symbolic approaches in natural language processing, knowledge representation and reasoning, and other fields of artificial intelligence had a hard time to keep their job. It feels right now like large language models will make all of these symbolic approaches superfluous (I think, this might be true, but is more likely to turn out to be mistaken).
It is always difficult to predict how events will be viewed historically. The advent of wide-spread deep learning approaches in the 2010s, culminating in the well-deserved recognition of Hinton, LeCun, and Bengio with the Turing Award show clearly what dominated the research agenda and the attention in AI in the last decade. But until now it felt like symbolic approaches still had some space left, that the growth in deep learning was in addition to other approaches. Symbolic approaches were ready to offer impulses and work on ideas for a field which might well be climbing towards a local maximum.
But a good number of the teams that were disbanded in the layoffs were exactly teams working with such symbolic approaches, and it feels like these parts of AI are now entering a bitter-cold winter.
A lot of knowledge is being lost right now, and many paths to innovative ideas are being buried. I have no doubt that there are still a lot of breakthroughs to be had in machine learning, and that there is immense value to be collected from the research results in machine learning from the last few years. And with immense I mean tens and hundreds of billions of dollars.
Nevertheless I expect that we will hit a wall. Reach a local maximum. Run into problems and limitations. And it would be good to keep a wider net to cast. To keep a larger search space alive. Alas, it seems it is not meant to be. In this abundance of capital and potential value, we seem to be on the way to starve research, optimise away alternatives, and to give everything to the mainstream ideas.
22 years of Wikipedia
I was just reading a long discussion regarding the differences between Open Street Maps and Wikipedia / Wikidata, and one of the mappers complained "Wiki* cares less about accuracy than the fact that there is something that can be cited", and calling Wikipedia / Wikidata contributions "armchair work" because we don't go out into the world to check a fact, but rely on references.
I understand the expressed frustration, but at the same time I'm having a hard time letting go of "reliability not truth" being a pillar of Wikipedia.
But this makes Wikipedia an inherently conservative project, because we don't reflect a change in the world or in our perception directly, but have to wait for reliable sources to put it in the record. There's something I was deeply uncomfortable with: so much of my life is devoted to a conservative project?
Wikipedia is a conservative project, but at the same time it's a revolutionary project. Making knowledge free and making knowledge production participatory is politically and socially a revolutionary act. How can this seeming contradiction be brought to a higher level of synthesis?
In the last few years, my discomfort with the idea of Wikipedia being conservative has considerably dissipated. One might think, sure, that happened because I'm getting older, and as we get older, we get more conservative (there's, by the way, unfortunate data questioning this premise: maybe the conservative ones simply live longer because of inequalities). Maybe. But I like to think that the meaning of the word "conservative" has changed. When I was young, the word conservative referred to right wing politicians who aimed to preserve the values and institutions of their days. An increasingly influential part of todays right wing though has turned into a movement that does not conserve and preserve values such as democracy, the environment, equality, freedoms, the scientific method. This is why I'm more comfortable with Wikipedia's conservative aspects than I used to be.
But at the same time, that can lead to a problematic stasis. We need to acknowledge that the sources and references Wikipedia has been built on, are biased due to historic and ongoing inequalities in the world, due to different values regarding the importance of certain types of references in the world. If we truly believe that Wikipedia aims to provide everyone with access to the sum of all human knowledge, we have to continue the conversations that have started about oral histories, about traditional knowledges, beyond the confines of academic publications. We have to continue and put this conversation and evolution further into the center of the movement.
Happy Birthday, Wikipedia! 22 years, while I'm 44 - half of my life (although I haven't joined until two years later). For an entire generation the world has always been a world with free knowledge that everyone can contribute to. I hope there is no going back from that achievement. But just as democracy and freedom, this is not a value that is automatically part of our world. It is a vision that has to be lived, that has to be defended, that has to be rediscovered and regained again and again, refined and redefined. We (the collective we) must wrest it from the gatekeepers of the past (including me) to allow it to remain a living, breathing, evolving, ever changing project, in order to not see only another twenty two years, but for us to understand this project as merely a foundation that will accompany us for centuries.
Good bye, kuna!
Now that the Croatian currency has died, they all come to the Gates of Heaven.
First goes the five kuna bill, and Saint Peter says "Come in, you're welcome!"
Then the ten kuna bill. "Come in, you're welcome!"
So does the twenty and fifty kuna bills. "Come in, you're welcome!"
Then comes the hundred kuna bill, expecting to walk in. Saint Peter looks up. "Where do you think you're going?"
"Well, to heaven!"
"No, not you. I've never seen you in mass."
(My brother sent me the joke)
Happy New Year, 2023!
For starting 2023, I will join the Bring Back Blogging challenge. The goal is to write three posts in January 2023.
Since I have been blogging on and off the last few years anyway, that shouldn't be too hard.
Another thing this year should bring is to launch Wikifunctions, the project I have been working on since 2020. It was a longer ride than initially hoped for, but here we are, closer to launch than ever. The Beta is available online, and even though not everything works yet, I was already able to impress my kid with the function to reverse a text.
Looking forward to this New Year 2023, a number that to me still sounds like it is from a science fiction novel.
Goal for Wikidata lexicographic data coverage 2023
At the beginning of 2022, Wikidata had 807 Croatian word forms, covering 5.8% of a Croatian language corpus (Croatian Wikipedia). One of my goals this year was to significantly increase the coverage, trying to add word forms to Wikidata from week to week. And together with a yet small number of contributors, we pushed coverage just in time for the end fo the year to 40%. With only 3,124 forms, we covered 40% of all occurrences of words in the Croatian Wikipedia, i.e. 11.4 Million word occurrences (tokens).
Since every percent is more and more difficult to add, for next year I aim for us to reach 60% coverage, or 5.7 Million more word occurrences. Below's a list of most frequent words in the corpus that are still missing. Let's see how many forms will be covered by the end of 2023! I think that's ambitious, even though it is, in coverage term only half of what we achieved this year. But as said, every subsequent percentage will become more difficult than the previous one.
Statistics and missing words for 55 languages: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
Current statistics for Croatian: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/hr/Statistics
Statistics as of end of year 2022: https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&oldid=1797161415
Statistics for end of year 2021: https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage/hr/Statistics&oldid=1551737937
List of most frequent missing forms in Croatian: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/hr/Missing
Using ChatGPT for fact checking
I wanted to explore the idea that we could use ChatGPT for fact checking. So I took a random article - it gave me the article on Jewish Ideas Daily, copied the text into ChatGPT with the following prompt:
Here's the text of a Wikipedia article. Please find the three statements that are most likely incorrect and rank them, in order to guide fact checking:
Followed by the Wikipedia article text. Here's the answer:
- "Jewish Ideas Daily was edited by Margot Lurie from its founding in January 2010 until December 2012." This statement is incorrect because according to the article, Margot Lurie was the editor of Jewish Ideas Daily "until December 2012."
- "Jewish Ideas Daily drew notice for controversial features including one accusing Christopher Hitchens of antisemitism." This statement is difficult to verify without more context, as it is not clear when this feature was published or what specifically it accused Hitchens of.
- "Like Jewish Review of Books, and Jewish Ideas Daily, Mosaic is funded by the Tikvah Fund." This statement is incorrect because according to the article, Mosaic succeeded Jewish Ideas Daily in June 2013 and is therefore not funded by the Tikvah Fund in the same way as Jewish Ideas Daily and Jewish Review of Books.
Regarding the first statement, GPT says that the sentence "Jewish Ideas Daily was edited by Margot Lurie from its founding in January 2010 until December 2012." is incorrect. I am not disagreeing, but that sentence does not appear in the Wikipedia article text.
Regarding the second statement, GPT says that there is no reference for that statement. In fact, there is, and Wikipedia provides it.
Regarding the third statement, I didn't check the statement, but the argumentation for why it should be checked is unconvincing: if Tikvah fund financed Jewish Ideas Daily, why would it not finance the successor Mosaic? It would be good to add a reference for these statements, but that's not the suggestion.
In short: the review by ChatGPT looks really good, but the suggestions in this case were not good.
The exercise was helpful insofar the article infobox and the text were disagreeing on the founding of the newspaper. I fixed that, but that's nothing ChatGPT pointed out (and couldn't, as I didn't copy and paste the infobox).