Using logical constraints to validate statistical information about disease outbreaks in collaborative knowledge graphs: the case of COVID-19 epidemiology in Wikidata

View article
PeerJ Computer Science
120,109 of 142,665 as of 4 February 2022: https://github.com/search?q=covid-19+OR+covid19+OR+coronavirus+OR+cord19+OR+cord-19
CC0 is a rights waiver similar to Creative Commons licenses, used to publish material into the public domain. It waives as much copyright as possible within a given jurisdiction. Further information can be found at https://creativecommons.org/publicdomain/zero/1.0/.
An open license SPARQL textbook available in multiple languages can be found at https://en.wikibooks.org/wiki/SPARQL.
Detailed information about the data structure of Wikidata can be found in Turki et al. (2022).
For an updated list of available Wikidata properties, please see https://tools.wmflabs.org/hay/propbrowse/.
For an overview of the semi-automated editing tools for Wikidata, please see https://www.wikidata.org/wiki/Wikidata:Tools.
Further information about the rights and governance of users in Wikidata is shown at https://www.wikidata.org/wiki/Wikidata:User_access_levels.
For further details about the language representation of COVID-19 knowledge in Wikidata, please refer to Turki et al. (2022), which has a figure and multiple tables on the subject.
A Wikidata-friendly format of a database is an edition of that resource where items and predicates of triples are replaced by their equivalents in Wikidata or in ontologies integrated with it.
Wikidata Integrator is a bot framework for automatically curating genetic information provided by Wikidata (https://github.com/SuLab/WikidataIntegrator). For Wikidata bots using this framework, refer to https://www.wikidata.org/wiki/Wikidata:WikiProject_Gene_Wiki#Bot_accounts. The framework has been adapted to various specific contexts, e.g., the curation of cell lines indexed in Cellosaurus, as per https://github.com/calipho-sib/cellosaurus-wikidata-bot.
RefB: Description at https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/RefB_(WikiCred), Source code at https://github.com/Data-Engineering-and-Semantics/refb/, Wikidata edits at https://www.wikidata.org/wiki/Special:Contributions/RefB_(WikiCred).
Internationalized Resource Identifier (IRI) is a standardized character string (e.g., a URL) that recognizes a given item in a semantic resource
ShEx schemas can also be defined in RDF-based representations like Turtle or JSON-LD.
The data models for WikiProject COVID-19 are accessible via https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19/Data_models.
Competency questions: A set of requirements ensuring consistency of a knowledge graph, constraints determining what knowledge to be involved in a knowledge graph (Wiśniewski et al., 2019).
For SPARQL-based visualizations of COVID-19 information in Wikidata, see https://speed.ieee.tn/, https://egonw.github.io/SARS-CoV-2-Queries/, https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19/Queries, and https://scholia.toolforge.org/topic/Q84263196.
We found the Wikidata properties reflecting epidemiological data about COVID-19 outbreaks using a specific SPARQL query available at https://w.wiki/5UsE. Please note that current results can return new properties that did not exist as of August 8, 2020 such as Number of vaccinations (P9107).
As of August 8, 2020. For updated statistics, see https://w.wiki/Z5m.
For instance, the query SELECT (COUNT(*) AS ?c) WHERE {?s ?p ?o} currently gives 11857528152 results on the clone at https://wikidata.demo.openlinksw.com/sparql that was set up by Chalupsky et al. (2021), while the live Wikidata result as of 23 July 2022 is 14040950269.
Detailed information about string functions in SPARQL can be found at https://www.w3.org/TR/sparql11-query/#func-strings.
Systematized Nomenclature Of Medicine—Clinical Terms
This method can be adapted to meet the needs of the user. For instance, the SPARQL queries can be slightly adjusted to assess other patterns in collaborative ontologies such as the usage of classes.
This information can be represented in the form of RDF triples where the subject is the studied relation type and integrated into Wikidata.
Epidemiological data about the monkeypox epidemic have begun to be tracked, e.g. via the item Q112070734 for the 2022 monkeypox outbreak and similar entries with a more regional focus like Q112059351 for the 2022 monkeypox outbreak in the United Kingdom.

Main article text

 

Introduction

Wikidata as a collaborative knowledge graph

Knowledge graph validation of Wikidata

Constraint-driven heuristics-based validation of epidemiological data

Discussion

Conclusion

Supplemental Information

SPARQL queries for the heuristics-based validation of epidemiological counts in Wikidata.

The SPARQL queries that were used for the Tasks defined in Table 2, to be run against the Wikidata Query Service available at https://query.wikidata.org/. Note that this query service has Wikidata-specific prefixes predefined, so they do not need to be re-stated in a query.

DOI: 10.7717/peerj-cs.1085/supp-1

SPARQL queries for the validation of case fatality rate statements in Wikidata.

These SPARQL queries correspond to the Tasks M1, M2 and M3 that address heuristics concerning the case fatality rate m.

DOI: 10.7717/peerj-cs.1085/supp-2

Additional Information and Declarations

Competing Interests

All the co-authors of this paper except Eric Prud’hommeaux are active members of WikiProject Medicine, the community curating clinical knowledge in Wikidata, and of WikiProject COVID-19, the community developing multidisciplinary COVID-19 information in Wikidata. Dariusz Jemielniak is a non-paid voluntary member of the Board of Trustees of the Wikimedia Foundation, the non-profit publisher of Wikipedia and Wikidata. Eric Prud’hommeaux is a co-creator of SPARQL. Eric Prud’hommeaux and Jose E Labra Gayo are co-creators of ShEx.

Author Contributions

Houcemeddine Turki conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Dariusz Jemielniak conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Mohamed A. Hadj Taieb conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Jose E. Labra Gayo conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.

Mohamed Ben Aouicha conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Mus’ab Banat conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Thomas Shafee conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Eric Prud’hommeaux conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.

Tiago Lubiana conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.

Diptanshu Das conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Daniel Mietchen conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

All the SPARQL queries used in this research work are available in the Appendices.

The Internet Archive URLs cited in this article are available at Wikidata: https://web.archive.org/save/https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox&oldid=1580603965.

Funding

The work done by Houcemeddine Turki, Mohamed Ali Hadj Taieb, and Mohamed Ben Aouicha was supported by the Ministry of Higher Education and Scientific Research in Tunisia (MoHESR) in the framework of Federated Research Project PRFCOV19-D1-P1, by the Wikimedia Foundation through a rapid grant, and by the WikiCred Grants Initiative of Craig Newmark Philanthropies, Facebook, and Microsoft. The work done by Jose Emilio Labra Gayo was funded by the Spanish Ministry of Economy and Competitiveness (Society challenges: TIN2017-88877-R). The work done by Daniel Mietchen was supported by the Alfred P. Sloan Foundation under grant numbers G-2019-11458 and G-2021-17106. The work done by Dariusz Jemielniak was funded by the Polish National Science Center (Grant No. 2019/35/B/HS6/01056). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

6 Citations 2,322 Views 115 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more