Semantic search

Jump to navigation Jump to search

Knowledge Graph Conference 2019, Day 1

On Tuesday, May 7, began the first Knowledge Graph Conference. Organized by François Scharffe and his colleagues at Columbia University, it was located in New York City. The conference goes for two days, and aims at a much more industry-oriented crowd than conferences such as ISWC. And it reflected very prominently in the speaker line-up: especially finance was very well represented (no surprise, with Wall Street being just downtown).

Speakers and participants from Goldman Sachs, Capital One, Wells Fargo, Mastercard, Bank of America, and others were in the room, but also from companies in other industries, such as Astra Zeneca, Amazon, Uber, or AirBnB. The speakers and participants were rather open about their work, often listing numbers of triples and entities (which really is a weird metric to cite, but since it is readily available it is often expected to be stated), and these were usually in the billions. More interesting than the sheer size of their respective KGs were their use cases, and particularly in finance it was often ensuring compliance to insider trading rules and similar regulations.

I presented Wikidata and the idea of an Abstract Wikipedia as going beyond what a Knowledge Graph can easily express. I had the feeling the presentation was well received - it was obvious that many people in the audience were already fully aware of Wikidata and are actively using it or planning to use it. For others, particularly the SPARQL endpoint with its powerful visualization capabilities and the federated queries, and the external identifiers in Wikidata, and the approach to references for the claims in Wikidata were perceived as highlights. The proposal of an Abstract Wikipedia was very warmly received, and it was the first time no one called it out as a crazy idea. I guess the audience was very friendly, despite New York's reputation.

A second set of speakers were offering technologies and services - and I guess I belong to this second set by speaking about Wikidata - and among them were people like Juan Sequeda of Capsenta, who gave an extremely engaging and well-substantiated talk on how to bridge the chasm towards more KG adoption; Pierre Haren of Causality Link, who offered an interesting personal history through KR land from LISP to Causal Graphs; Dieter Fensel of OnLim, who had a a number of really good points on the relation between intelligent assistants and their dialogue systems and KGs; Neo4J, Eccenca, Diffbot.

A highlight for me was the astute and frequent observation by a number of the speakers from the first set that the most challenging problems with Knowledge Graphs were rarely technical. I guess graph serving systems and cloud infrastructure have improved so much that we don't have to worry about these parts anymore unless you are doing crazy big graphs. The most frequently mentioned problems were social and organizational. Since Knowledge Graphs often pulled data sources from many different parts of an organization together, with a common semantics, they trigger feelings of territoriality. Who gets to define the common ontology? What if the data a team provides has problems or is used carelessly, who's at fault? What if others benefit from our data more than we did even though we put all the effort in to clean it up? How do we get recognized for our work? Organizational questions were often about a lack of understanding, especially among engineers, for fundamental Knowledge Graph principles, and a lack of enthusiasm in the management chain - especially when the costs are being estimated and the social problems mentioned before become apparent. One particularly visible moment was when Bethany Sehon from Capital One was asked about the major challenges to standardizing vocabularies - and her first answer was basically "egos".

All speakers talked about the huge benefits they reaped from using Knowledge Graphs (such as detecting likely cliques of potential insider trading that later indeed got convicted) - but then again, this is to be expected since conference participation is self-selecting, and we wouldn't hear of failures in such a setting.

I had a great day at the inaugural Knowledge Graph Conference, and am sad that I have to miss the second day. Thanks to François Scharffe for organizing the conference, and thanks to the sponsors, OntoText, Collibra, and TigerGraph.

For more, see:

Golden

I'd say that Golden might be the most interesting competitor to Wikipedia I've seen in a while (which really doesn't mean that much, it's just the others have been really terrible).

This one also has a few red flags:

  • closed source, as far as I can tell
  • aiming for ten billion topics in their first announcement, but lacking an article on Germany
  • obviously not understanding what the point of notability policies are, and no, it is not about server space

They also have a features that, if they work, should be looked at and copied by Wikipedia - such as the editing assistants and some of the social features that are built-in into the platform.

Predictions:

  1. they will make a splash or two, and have corresponding news cycles to it
  2. they will, at some point, make an effort to import or transclude Wikipedia content
  3. they will never make a dent in Wikipedia readership, and will say that they wouldn't want to anyway because they love Wikipedia (which I believe)
  4. they will make a press release of donating all their content to Wikipedia (even though that's already possible thanks to their license)
  5. and then, being a for-profit company, they will pivot to something else within a year or two.

May 2019 talks

I am honored to give the following three invited talks in the next few weeks:

The topics will all be on Wikidata, how the Wikipedias use it, and the Abstract Wikipedia idea.

AI and role playing

An article about AI and role playing games, and thus in the perfect intersection of my interest.

But the article is entirely devoid of any interesting content, and basically boils down to asking the question "could RPGs be a Turing test for AI?"

I mean, the answer is so painfully obviously "yes" that no one ever bothered to write it down. I mean, Turing wrote the test as a role playing game basically!

Papaphobia

In a little knowledge engineering exercise, I was trying to add the causes of a phobia to the respective Wikidata items. There are currently about 160 phobias in Wikidata, and only a few listed in a structured way what they are afraid of. So I was going through them, trying to capture it in s a structured way. Here's a list of the current state:

Now, one of those phobias was the Papaphobia - the fear of the pope. Now, is that really a thing? I don't know. CDC does not seem to have an entry on it. On the Web, in the meantime, some pages have obviously taken to mining lists of phobias and creating advertising pages that "help" you with Papaphobia - such as this one:

This page is likely entirely auto-generated. I doubt it that they have "clients for papaphobia in 70+ countries", whom they helped "in complete discretion" within a single day! "People with severe fears and phobias like papaphobia (which is in fact the formal diagnostic term for papaphobia) are held prisoners by their phobias."

This site offers more, uhm, useful information.

"Group psychotherapy can also help where individuals share their experience and, in the process, understand and recover from their phobia." Really? There are enough cases that we can even set up a group therapy?

Now, maybe I am entirely off here - maybe, papaphobia is really a thing. With search in Scholar I couldn't find any medical sources (the term is mentioned in a number of sociological and historical works, to express general sentiments in a population or government against the authority of the pope, but I could not find any mentions of it in actual medical literature).

Now could those pages up there be benign cases of jokes? Or are they trying to scam people with promises to heal their actual fears, and they just didn't curate the list of fears sufficiently, because, really, you wouldn't find this page unless you actually search for this term?

And now what? Now what if we know these pages are made by scammers? Do we report them to the police? Do we send a tip to journalists? Or should we just do nothing, allowing them to scam people with actual fears? Well, by publishing this text, maybe I'll get a few people warned, but it won't reach the people it has to reach at the right time, unfortunately.

Also, was it always so hard to figure out what is real and what is not? Does papaphobia exist? Such a simple question. How should we deal with it on Wikidata? How many cases are there, if it exists? Did it get worse for people with papaphobia now that we have two people living who have been made pope?

My assumption now is that someone was basically working on a corpus, looking for words ending in -phobia, in order to generate a list of phobias. And then the term papaphobia from sociological and historical literature popped up, and it landed in some list, and was repeated in other places, etc., also because it is kind of a funny idea, and so a mixture of bad research and joking bubbled through, and rolled around on the Web for so long that it looks like it is actually a thing, to the point that there are now organizations who will gladly take your money (CTRN is not the only one) to treat you for papaphobia.

The world is weird.

An indigenous library

Great story about an indigenous library using their own categorization system instead of the Dewey Decimal System (which really doesn't work for indigenous topics - I mean it doesn't really work for the modern world as well, but that's another story).

What I am wondering though if if they're not going far enough. Dewey's system is eventually rooted in Aristotelian logic and categorization - with a good dash of practical concerns of running a physical library.

Today, these practical concerns can be overcome, and it is unlikely that indigenous approaches to knowledge representation would be rooted in Aristotelian logic. Yes, having your own categorization system is a great first step - but that's like writing your own anthem following the logic of European hymns or creating your own flag following the weird rules of European medieval heraldry. How would it look like if you were really going back to the principles and roots of the people represented in these libraries? Which novel alternatives to representing and categorizing knowledge could we uncover?

Via Jens Ohlig.

How much information is in a language?

About the paper "Humans store about 1.5 megabytes of information during language acquisition“, by Francis Mollica and Steven T. Piantadosi.

This is one of those papers that I both love - I find the idea is really worthy of investigation, having an answer to this question would be useful, and the paper is very readable - and can't stand, because the assumptions in the papers are so unconvincing.

The claim is that a natural language can be encoded in ~1.5MB - a little bit more than a floppy disk. And the largest part of this is the lexical semantics (in fact, without the lexical semantics, the rest is less than 62kb, far less than a short novel or book).

They introduce two methods about estimating how many bytes we need to encode the lexical semantics:

Method 1: let's assume 40,000 words in a language (languages have more words, but the assumptions in the paper is about how many words one learns before turning 18, and for that 40,000 is probably an Ok estimation although likely on the lower end). If there are 40,000 words, there must be 40,000 meanings in our heads, and lexical semantics is the mapping of words to meanings, and there are only so many possible mappings, and choosing one of those mappings requires 553,809 bits. That's their lower estimate.

Wow. I don't even know where to begin in commenting on this. The assumption that all the meanings of words just float in our head until they are anchored by actual word forms is so naiv, it's almost cute. Yes, that is likely true for some words. Mother, Father, in the naive sense of a child. Red. Blue. Water. Hot. Sweet. But for a large number of word meanings I think it is safe to assume that without a language those word meanings wouldn't exist. We need language to construct these meanings in the first place, and then to fill them with life. You can't simply attach a word form to that meaning, as the meaning doesn't exist yet, breaking down the assumptions of this first method.

Method 2: let's assume all possible meanings occupy a vector space. Now the question becomes: how big is that vector space, how do we address a single point in that vector space? And then the number of addresses multiplied with how many bits you need for a single address results in how many bits you need to understand the semantics of a whole language. There lower bound is that there are 300 dimensions, the upper bound is 500 dimensions. Their lower bound is that you either have a dimension or not, i.e. that only a single bit per dimension is needed, their upper bound is that you need 2 bits per dimension, so you can grade each dimension a little. I have read quite a few papers with this approach to lexical semantics. For example it defines "girl" as +female, -adult, "boy" as -female,-adult, "bachelor" as +adult,-married, etc.

So they get to 40,000 words x 300 dimensions x 1 bit = 12,000,000 bits, or 1.5MB, as the lower bound of Method 2 (which they then take as the best estimate because it is between the estimate of Method 1 and the upper bound of Method 2), or 40,0000 words x 500 dimensions x 2 bits = 40,000,000 bits, or 8MB.

Again, wow. Never mind that there is no place to store the dimensions - what are they, what do they mean? - probably the assumption is that they are, like the meanings in Method 1, stored prelinguistically in our brains and just need to be linked in as dimensions. But also the idea that all meanings expressible in language can fit in this simple vector space. I find that theory surprising.

Again, this reads like a rant, but really, I thoroughly enjoyed this paper, even if I entirely disagree with it. I hope it will inspire other papers with alternative approaches towards estimating these numbers, and I'm very much looking forward to reading them.

Milk consumption in China

Quiet disappointed by The Guardian. Here's a (rather) interesting article on the history of milk consumption in China. But the whole article is trying to paint how catastrophic this development might be: the Chinese are trying to triple their intake in milk! That means more cows! That's bad because cows fart us into a hot house!

The argumentation is solid - more cows are indeed problematic. But blaming it on milk consumption in China? Let's take a look at a few numbers omitted from the article, or stuffed into the very last paragraph.

  • On average, a European consumes six times as much milk as a Chinese. So, even if China achieves its goal and triples average milk consumption, they will drink only half as much as a European.
  • Europe has double the number of dairy cows than China has.
  • China is planning to increase their milk output by 300% but only increase resources for that by 30% according to the article. I have no idea how that works, but sounds like a great deal to me.
  • And why are we even talking about dairy cows? The number of beef cows in the US or in Europe each outnumber the dairy cows by a fair amount (unsurprisingly - a cow produces quite a lot of milk over a longer time, whereas its meat production is limited to a single event)
  • There are about 13 million dairy cows in China. The US have more than 94 million cattle, Brazil has more than 211 million, world wide it's more than 1.4 billion - but hey, it's the Chinese milk cows that are the problem.

Maybe the problem can be located more firmly in the consumption habits of people in the US and in Europe than the "unquenchable thirst of China".

The article is still interesting for a number of other reasons.

Shazam!

Shazam! was fun. And had more heart than many other superhero stories. I liked that, for the first time, a DC universe movie felt like it's organically part of that universe - with all the backpacks with Batman and Superman logos and stuff. That was really neat.

Since I saw him in the first trailer I was looking forward to see Steve Carell playing the villain. Turns out it was Mark Strong, not Steve Carell. Ah well.

I am not sure the film knew exactly at whom it was marketed. The theater was full with kids, and given the trailers it was clear that the intention was to get as many families into it as possible. But the horror sequences, the graphic violence, the expletives, and the strip club scenes were not exactly for that audience. PG-13 is an appropriate rating.

It was a joy to watch the protagonist and his buddy explore and discover his powers. Colorful, lively, fun. Easily the best scenes of the movie.

The foster family drama gave the movie it's heart, but the movie seemed a bit overwhelmed by it. I wish that part was executed a bit better. But then again, it's a superhero movie, and given that it was far better than many of the other movies of its genre. But as far as High School and family drama superheroes go, it doesn't get anywhere near Spiderman: Homecoming.

Mid credit scenes. A tradition that Marvel started and that DC keeps copying - but unlike Marvel DC hasn't really paid up to the teasers in their scenes. And regarding cameos - also something where DC could learn so much from Marvel. Also, what's up with being afraid of naming their heroes? Be it in Man of Steel with Superman or here with Billy, the hero doesn't figure out his name (until the next movie comes along and everybody refers to him as Superman as if it was obvious all the time).

All in all, an enjoyable movie while waiting for Avengers: Endgame, and hopefully a sign that DC is finally getting on the right path.

EMWCon 2018, Day 2

Today was the second day of the Enterprise MediaWiki Conference, EMWCon, in Daly City at the Genesys headquarters.

The day started with my keynote on Wikidata and the Abstract Wikipedia idea. The idea was received very friendly.

Today, the day was filled with stories from people building systems on top of MediaWiki, and in particularly Semantic MediaWiki, Cargo, and some Wikibase. This included SFMoma presenting their system to collaboratively document art, using Cargo and Lua on the League of Legends wiki, running a whole wiki farm for Finnish memory and language institutions, the Lost Plays database, and - what I found particularly impressive - an engineer at NASA who implemented a workflow for document approval including authorization, audibality, and a full Web interface within a mere week, and still thinking that it could have been done much faster.

A common theme was "how incredibly easy it was". Yes, almost everyone mentioned something they got stumped on, and this really points to the community needing maybe more usage on StackOverflow or IRC or something, but in so many use cases, people who were not developers were able to create pretty complex workflows and apps right there in their browsers. This also ties in with the second common theme, that a lot of the deployments of such wikis are often starting "under the radar".

There were also genuinely complex solutions that were using Semantic MediaWiki as a mere component: Matteo Busanelli was presenting a solution that included lifting external data sources, deploying ontologies, reasoning, and all the whistles and bells - a very impressive and powerful architecture.

The US government uses Semantic MediaWiki in many places, most notably Intellipedia used by more than 16 intelligence agencies, Diplopedia by the Department of State, and Powerpedia for the Department of Energy. EPA's Statipedia is no more, but new wikis are popping up in other agency, such as WikITA for the International Trade Administration, and for the Nuclear Regulatory Commission. Canada's GCpedia was mentioned with a lot of respect, and the wish that the US would have something similar.

NASA has a whole wiki farm: within mission control alone they had 12 different wikis after a short while, many grown bottom up. They noticed that it would make sense to merge them together - which wasn't easy, neither technically nor legally nor managerially. They found that a lot of their knowledge was misclassified - for example, they classified handbooks which can be bought by anyone on Amazon. One of the biggest changes the wiki caused at NASA was that the merged ISS wiki lead to opening more knowledge to more people, and drawing the circles larger. 20% of the people who have access to the wikis actively contribute to the wikis! This is truly impressive.

So far, no edit has been made from space - due to technical issues. But they are working on it.

The day ended with a panel, asking the question where MediaWiki is in the marketplace, and how to grow.

Again, thanks to Yaron Koren and Cindy Cicalese for organizing the conference, and Genesys for hosting us. All presentations are available on YouTube.