Main Page

From Simia
Jump to navigation Jump to search
Nur Deutsche Beiträge - English posts only - Other contents of Simia

An indigenous library

Great story about an indigenous library using their own categorization system instead of the Dewey Decimal System (which really doesn't work for indigenous topics - I mean it doesn't really work for the modern world as well, but that's another story).

What I am wondering though if if they're not going far enough. Dewey's system is eventually rooted in Aristotelian logic and categorization - with a good dash of practical concerns of running a physical library.

Today, these practical concerns can be overcome, and it is unlikely that indigenous approaches to knowledge representation would be rooted in Aristotelian logic. Yes, having your own categorization system is a great first step - but that's like writing your own anthem following the logic of European hymns or creating your own flag following the weird rules of European medieval heraldry. How would it look like if you were really going back to the principles and roots of the people represented in these libraries? Which novel alternatives to representing and categorizing knowledge could we uncover?

Via Jens Ohlig.


How much information is in a language?

About the paper "Humans store about 1.5 megabytes of information during language acquisition“, by Francis Mollica and Steven T. Piantadosi.

This is one of those papers that I both love - I find the idea is really worthy of investigation, having an answer to this question would be useful, and the paper is very readable - and can't stand, because the assumptions in the papers are so unconvincing.

The claim is that a natural language can be encoded in ~1.5MB - a little bit more than a floppy disk. And the largest part of this is the lexical semantics (in fact, without the lexical semantics, the rest is less than 62kb, far less than a short novel or book).

They introduce two methods about estimating how many bytes we need to encode the lexical semantics:

Method 1: let's assume 40,000 words in a language (languages have more words, but the assumptions in the paper is about how many words one learns before turning 18, and for that 40,000 is probably an Ok estimation although likely on the lower end). If there are 40,000 words, there must be 40,000 meanings in our heads, and lexical semantics is the mapping of words to meanings, and there are only so many possible mappings, and choosing one of those mappings requires 553,809 bits. That's their lower estimate.

Wow. I don't even know where to begin in commenting on this. The assumption that all the meanings of words just float in our head until they are anchored by actual word forms is so naiv, it's almost cute. Yes, that is likely true for some words. Mother, Father, in the naive sense of a child. Red. Blue. Water. Hot. Sweet. But for a large number of word meanings I think it is safe to assume that without a language those word meanings wouldn't exist. We need language to construct these meanings in the first place, and then to fill them with life. You can't simply attach a word form to that meaning, as the meaning doesn't exist yet, breaking down the assumptions of this first method.

Method 2: let's assume all possible meanings occupy a vector space. Now the question becomes: how big is that vector space, how do we address a single point in that vector space? And then the number of addresses multiplied with how many bits you need for a single address results in how many bits you need to understand the semantics of a whole language. There lower bound is that there are 300 dimensions, the upper bound is 500 dimensions. Their lower bound is that you either have a dimension or not, i.e. that only a single bit per dimension is needed, their upper bound is that you need 2 bits per dimension, so you can grade each dimension a little. I have read quite a few papers with this approach to lexical semantics. For example it defines "girl" as +female, -adult, "boy" as -female,-adult, "bachelor" as +adult,-married, etc.

So they get to 40,000 words x 300 dimensions x 1 bit = 12,000,000 bits, or 1.5MB, as the lower bound of Method 2 (which they then take as the best estimate because it is between the estimate of Method 1 and the upper bound of Method 2), or 40,0000 words x 500 dimensions x 2 bits = 40,000,000 bits, or 8MB.

Again, wow. Never mind that there is no place to store the dimensions - what are they, what do they mean? - probably the assumption is that they are, like the meanings in Method 1, stored prelinguistically in our brains and just need to be linked in as dimensions. But also the idea that all meanings expressible in language can fit in this simple vector space. I find that theory surprising.

Again, this reads like a rant, but really, I thoroughly enjoyed this paper, even if I entirely disagree with it. I hope it will inspire other papers with alternative approaches towards estimating these numbers, and I'm very much looking forward to reading them.


Milk consumption in China

Quiet disappointed by The Guardian. Here's a (rather) interesting article on the history of milk consumption in China. But the whole article is trying to paint how catastrophic this development might be: the Chinese are trying to triple their intake in milk! That means more cows! That's bad because cows fart us into a hot house!

The argumentation is solid - more cows are indeed problematic. But blaming it on milk consumption in China? Let's take a look at a few numbers omitted from the article, or stuffed into the very last paragraph.

  • On average, a European consumes six times as much milk as a Chinese. So, even if China achieves its goal and triples average milk consumption, they will drink only half as much as a European.
  • Europe has double the number of dairy cows than China has.
  • China is planning to increase their milk output by 300% but only increase resources for that by 30% according to the article. I have no idea how that works, but sounds like a great deal to me.
  • And why are we even talking about dairy cows? The number of beef cows in the US or in Europe each outnumber the dairy cows by a fair amount (unsurprisingly - a cow produces quite a lot of milk over a longer time, whereas its meat production is limited to a single event)
  • There are about 13 million dairy cows in China. The US have more than 94 million cattle, Brazil has more than 211 million, world wide it's more than 1.4 billion - but hey, it's the Chinese milk cows that are the problem.

Maybe the problem can be located more firmly in the consumption habits of people in the US and in Europe than the "unquenchable thirst of China".

The article is still interesting for a number of other reasons.



Shazam! was fun. And had more heart than many other superhero stories. I liked that, for the first time, a DC universe movie felt like it's organically part of that universe - with all the backpacks with Batman and Superman logos and stuff. That was really neat.

Since I saw him in the first trailer I was looking forward to see Steve Carell playing the villain. Turns out it was Mark Strong, not Steve Carell. Ah well.

I am not sure the film knew exactly at whom it was marketed. The theater was full with kids, and given the trailers it was clear that the intention was to get as many families into it as possible. But the horror sequences, the graphic violence, the expletives, and the strip club scenes were not exactly for that audience. PG-13 is an appropriate rating.

It was a joy to watch the protagonist and his buddy explore and discover his powers. Colorful, lively, fun. Easily the best scenes of the movie.

The foster family drama gave the movie it's heart, but the movie seemed a bit overwhelmed by it. I wish that part was executed a bit better. But then again, it's a superhero movie, and given that it was far better than many of the other movies of its genre. But as far as High School and family drama superheroes go, it doesn't get anywhere near Spiderman: Homecoming.

Mid credit scenes. A tradition that Marvel started and that DC keeps copying - but unlike Marvel DC hasn't really paid up to the teasers in their scenes. And regarding cameos - also something where DC could learn so much from Marvel. Also, what's up with being afraid of naming their heroes? Be it in Man of Steel with Superman or here with Billy, the hero doesn't figure out his name (until the next movie comes along and everybody refers to him as Superman as if it was obvious all the time).

All in all, an enjoyable movie while waiting for Avengers: Endgame, and hopefully a sign that DC is finally getting on the right path.


EMWCon 2018, Day 2

Today was the second day of the Enterprise MediaWiki Conference, EMWCon, in Daly City at the Genesys headquarters.

The day started with my keynote on Wikidata and the Abstract Wikipedia idea. The idea was received very friendly.

Today, the day was filled with stories from people building systems on top of MediaWiki, and in particularly Semantic MediaWiki, Cargo, and some Wikibase. This included SFMoma presenting their system to collaboratively document art, using Cargo and Lua on the League of Legends wiki, running a whole wiki farm for Finnish memory and language institutions, the Lost Plays database, and - what I found particularly impressive - an engineer at NASA who implemented a workflow for document approval including authorization, audibality, and a full Web interface within a mere week, and still thinking that it could have been done much faster.

A common theme was "how incredibly easy it was". Yes, almost everyone mentioned something they got stumped on, and this really points to the community needing maybe more usage on StackOverflow or IRC or something, but in so many use cases, people who were not developers were able to create pretty complex workflows and apps right there in their browsers. This also ties in with the second common theme, that a lot of the deployments of such wikis are often starting "under the radar".

There were also genuinely complex solutions that were using Semantic MediaWiki as a mere component: Matteo Busanelli was presenting a solution that included lifting external data sources, deploying ontologies, reasoning, and all the whistles and bells - a very impressive and powerful architecture.

The US government uses Semantic MediaWiki in many places, most notably Intellipedia used by more than 16 intelligence agencies, Diplopedia by the Department of State, and Powerpedia for the Department of Energy. EPA's Statipedia is no more, but new wikis are popping up in other agency, such as WikITA for the International Trade Administration, and for the Nuclear Regulatory Commission. Canada's GCpedia was mentioned with a lot of respect, and the wish that the US would have something similar.

NASA has a whole wiki farm: within mission control alone they had 12 different wikis after a short while, many grown bottom up. They noticed that it would make sense to merge them together - which wasn't easy, neither technically nor legally nor managerially. They found that a lot of their knowledge was misclassified - for example, they classified handbooks which can be bought by anyone on Amazon. One of the biggest changes the wiki caused at NASA was that the merged ISS wiki lead to opening more knowledge to more people, and drawing the circles larger. 20% of the people who have access to the wikis actively contribute to the wikis! This is truly impressive.

So far, no edit has been made from space - due to technical issues. But they are working on it.

The day ended with a panel, asking the question where MediaWiki is in the marketplace, and how to grow.

Again, thanks to Yaron Koren and Cindy Cicalese for organizing the conference, and Genesys for hosting us. All presentations are available on YouTube.

Semantic MediaWiki

EMWCon 2019, Day 1

Today was the first day of the Enterprise MediaWiki Conference, EMWCon, in Daly City. Among the attendees were people from NASA (6 or more people), UIC (International Union of Railways), the UK Ministry of Defence, the US radioactivity safety agencies, cancer research institutes, the Bureaus of Labour Statistics, PG&E, General Electric, and a number of companies providing services around MediaWiki, such as WikiTeq, Wikiworks, dokit, etc., with or without semantic extensions. The conference was located at the Headquarter of Genesys.

I'm not going to comment on all talks, and also I will not faithfully report on the talks - you can just go to YouTube to watch the talks themselves. The following is a personal, biased view of the first day.

NASA made an interesting comment early on: the discussion was about MediaWiki and its lack of fine-grained access control. You can set up a MediaWiki easily for a controlled group (so that not everyone in the world can access it), but it is not so easy to say "oh, this set of pages is available for people in this group, and managers in that org can access the pages with this markers", etc. So NASA, at first, set up a lot of wiki installations, each one for such specific groups - but eventually turned it all around and instead had a small number of well-defined groups and merged the wikis into them, tearing down barriers within the org and making knowledge wider available.

Evita Hollis from General Electric had an interesting point in her presentation on how GE does knowledge sharing: they use SharePoint and Yammer to connect people to people, and MediaWiki to connect people to Knowledge. MediaWiki has been not-exactly-great at allowing people to work together in real-time - it is a different flow, where you capture and massage knowledge slowly into it. There is a reason why Ops at Wikimedia do not use a wiki during an incident that much, but rather IRC. I think there is a lot of insight in her argument - and if we take that serious, we could actually really lift MediaWiki to a new level, and take Wikipedia there too.

Another interesting point is that SharePoint at General Electric had three developers, and MediaWiki had one. The question from the audience was, whether that reflect how difficult it is to work with SharePoint, or whether that reflected some bias of the company towards SharePoint. Hollis was adamant about how much she likes Sharepoint, but the reason for the imbalance was that MediaWiki, particularly Semantic MediaWiki, allows actually much more flexibility and power than SharePoint without having to touch a single line of wiki source code. It is a platform that allows for rapid experimentation by the end user (I am adding the Spiderman adage about great power coming with great responsibility).

Daren Welsh from NASA talked about many different forms of biases and how they can bubble up on your wiki. Very interesting was one effect: if knowledge from the wiki is becoming too readily availble, people may start to become dependent on it. They had tests where they took away the wiki randomly from flight controllers in training, in order to ensure they are resourceful enough to still figure out what to do - and some failed miserably.

Ike Hecht had a brilliant presentation on the kind of quick application development Semantic MediaWiki lends itself to. He presented a task manager, a news feed, and a file management system, calling them "Semantic Structures That Do Stuff" - which is basically a few pages for your wiki, instead of creating extensions for all of these. This also resonated with GE's statement about needling less developers. I think that this is wildly underutilized and there is a lot of value in this idea.

Thanks to Yaron Koren - who also gave an intro to the topic - and Cindy Cicalese for organizing the conference, and Genesys for hosting us. All presentations are available on YouTube.

Semantic MediaWiki

EMWCon Spring 2019

I'm honored to be invited to keynote the Enterprise MediaWiki conference in Daly City. The keynote is on Thursday, I will talk about Wikidata and beyond - towards an abstract Wikipedia.

The talk is planned to be recorded, so it should be available afterwards for everyone interested.

Semantic MediaWiki

Turing Award to Bengio, LeCun, and Hinton

Congratulations to Yoshua Bengio, Yann LeCun, and Geoffrey Hinton on being awarded the Turing Award, the most prestigious award in Computer Science.

Their work had revolutionized huge parts of computer science as it is used in research and industry, and has lead to the current impressive results in AI and ML. They were continuing to work on an area that was deemed unpromising, and has suddenly swept through whole industries and reshaped them.

Computer science

Something Positive in Deutsch wieder online

2005 und 2006 übersetzten Ralf Baumgartner und ich die ersten paar Something Positive comics von R. K. Milholland ins Deutsche. Die 80 Comics, die wir damals übersetzt haben, sind hiermit wieder online. Wir haben noch vier weitere Comics übersetzt, die in den nächsten Tagen auch nach und nach online kommen werden.

Viel Spass! Oh, und die Comics sind für Erwachsene.

Something Positive

DSA Erfolgswahrscheinlichkeiten

Ich fand es immer spannend, auszurechnen, wie hoch die Wahrscheinlichkeit ist, dass eine Talentprobe in DSA gelingt oder nicht. Ich konnte über die Jahre hinweg keine vernünftige, geschlossene Formel finden, und so blieb ich immer bei Überschlagsrechnungen. Dabei visualisierte ich mir im Kopf die drei Würfelwürfe als die drei Dimensionen eines Raumes, in dem ein Teil des Raumes gelungene Proben und der Rest des Raumes misslungene Proben darstellt.

Ich dachte lange darüber nach, dass es interessant ware, diesen Raum tatsächlich zu visualisieren. 2010 musste ich während eines Forschungsaufenthalts in Los Angeles ein paar Webtechniken erlernen - HTML Canvas, jQuery, Blueprint, etc. - und am besten lerne ich, indem ich ein kleines Projekt mache. Also nutzte ich diese Gelegenheit. Damals war DSA4 aktuell, und entsprechend machte ich das Projekt für die Regeln von DSA4.

2017 überarbeitete Hanno Müller-Kalthoff die Visualisierung und passte sie an die neuen Regeln von DSA5 an. Hier sind Links für beide Seiten und eine DSA5 App:

Das Schwarze Auge

... further results

Archive - Subcribe to feed