September 24 was the AKTS workshop - Advanced Knowledge Technologies for Science in a FAIR world - co-located with the eScience and Gateways conferences in San Diego. As usual with my trip reports, I won't write about every single talk, but offer only my own personal selection and view. This is not an official report on the workshop.
I had the honor of kicking off the day. I made the proposal of using Wikidata for describing datasets so that dataset catalogs can add these descriptions to their indexes. The standard way to do so is to use Schema.org annotations describing the datasets, but our idea here was to provide a fallback solution in case Schema.org cannot be applied for one reason or the other. Since the following talks would also be talking about Wikidata I used the talk to introduce Wikidata in a bit more depth. In parallel, I kicked the same conversation off on Wikidata as well. The idea was well received, but one good question was raised by Andrew Su: why not add Schema.org annotations to Wikidata instead?
After that, Daniel Garijo of USC's ISI presented WDPlus, Wikidata Plus, which presented a prototype for how to extend Wikidata with more data (particularly tabular data) from external data sources, such as censuses and statistical publications. The idea is to surround Wikidata with a layer of so-called satellites, which materialize statistical and other external data into Wikidata's schema. They implemented a mapping languages, T2WDML, that allows to grab CSV numbers and turn them into triples that are compatible with Wikidata's schema, and thus can be queried together. There seems to be huge potential in this idea, particularly if one can connect the idea of federated SPARQL querying with on-the-fly mappings, extending Wikidata to a virtual knowledge base that would be easily several times its current size.
Andrew Su from Scripps Research talked about using Wikidata as a knowledge graph in a FAIR world. He presented their brilliant Gene Wiki project, about adding knowledge about genes and proteins to Wikidata. He presented the idea of using Wikidata as a generalized back-end for customized frontend-applications - which is perfect. Wikidata's frontend is solid and functional, but in many domains there is a large potential to improve the UX for users in specific domains (and we are seeing some if flowering more around Lexemes, with Lucas Werkmeister's work on lexical forms). Su and his lab developed ChlamBase which allows the Chlamydia research community to look at the data they are interested in, and to easily add missing data. Another huge advantage of using Wikidata? Your data is going to live beyond the life of the grant. A great overview of the relevant data in Wikidata can be seen in this rich and huge and complex diagram.
The talks switched more to FAIR principles, first by Jeffrey Grethe of UCSD and then Mark Musen of Stanford. Mark was pointing out how quickly FAIR turned from a new idea to a meme that was pervasive everywhere, and the funding agencies now starting to require it. But data often has issues. One example: BioSample is the best metadata NIH has to offer. But 73% of the Boolean metadata values are not 'true' or 'false' but have values like "nonsmoker" or "recently quitted". 26% of the integers were not parseable. 68% of the entries from a controlled vocabulary were not. Having UX that helped with entering this data would be improving the quality considerably, such as CEDAR.
Carole Goble then talked about moving towards using Schema.org for FAIRer Life Sciences resources and defining a Schema.org profile that make datasets easier to use. The challenges in the field have been mostly social - there was a lot of confidence that we know how to solve the technical issues, but the social ones provide to be challenging. Carol named four of those explicitly:
- building consensus (it's harder than you think)
- the Schema.org Catch-22 (Schema.org won't take it if there is no usage, but people won't use it until it is in Schema.org)
- dedicated resources (people think you can do the social stuff in your spare time, but you can't)
Natasha Noy gave the keynote, talking about Google Dataset Search. The lessons learned from building it:
- Build an ecosystem first, be technically light-weight (a great lesson which was also true for Wikipedia and Wikidata)
- Use open, non-proprietary, standard solutions, don't ask people to build it just for Google (so in this case, use Schema.org for describing datasets)
- bootstrapping requires influencers (i.e. important players in the field, that need explicit outreach) and incentives (to increase numbers)
- semantics and the KG are critical ingredients (for quality assurance, to get the data in quickly, etc.)
At the same time, Natasha also reiterated one of Mark's points: no matter how simple the system is, people will get it wrong. The number of ways a date field can be written wrong is astounding. And often it is easier to make the ingester more accepting than try to get people to correct their metadata.
Chris Gorgolewski followed with a session on increasing findability for datasets, basically a session on SEO for dataset search: add generic descriptions, because people who need to find your dataset probably don't know your dataset and the exact terms (or they would already use it). Ensure people coming to your landing site have a pleasant experience. And the description is markup, so you can even use images.
I particularly enjoyed a trio of paper presentations by Daniel Garijo, Maria Stoica, Basel Shbita and Binh Vu. Daniel spoke about OntoSoft, an ontology to describe software workflows in sufficient detail to allow executing them, and also to create input and output definitions, describe the execution environment, etc. Close to those in- and output definition we find Maria's work on an ontology of variables. Maria presented a lot of work to identify the meaning of variables, based on linguistic, semantic, and ontological reasoning. Basel and Binh talked about understanding data catalogs deepers, being able to go deeper into the tables and understand the actual content in them. If one would connect the results of these three papers, one could potentially see how data from published tables and datasets could become alive and answer questions almost out of the box: extracting knowledge from tables, understanding their roles with regards to the input variables, and how to execute the scientific workflows.
Sure, science fiction, and the question is how well would each of the methods work, and how well would they work in concert, but hey, it's a workshop. It's meant for crazy ideas.
Ibrahim Burak Ozyurt presented an approach towards question answering in the bio-domain using Deep Learning, including Glove and BERT and all the other state of the art work. And it's all on Github! Go try it out.
The day closed with a panel with Mark Musen, Natasha Noy, and me, moderated by Yolanda Gil, discussing what we learned today. It quickly centered on the question how to ensure that people publishing datasets get appropriate credit. For most researchers, and particularly for universities, paper publications and impact factors are the main metric to evaluate researchers. So how do we ensure that people creating datasets (and I might add, tools, workflows, and social consensus) receive the fair share of credit?
Thanks to Yolanda Gil and Andrew Su for organizing the workshop! It was an exhausting, but lovely experience, and it is great to see the interest in this field.
Illuminati and Wikibase
Bring me to your leader!