Interoperability and FAIRness through a novel combination of Web technologies

Mark D Wilkinson; Ruben Verborgh; Luiz Olavo Bonino da Silva Santos; Tim Clark; Morris A Swertz; Fleur D.L. Kelpin; Alasdair J. G. Gray; Erik A. Schultes; Erik M. van Mulligen; Paolo Ciccarese; Arnold Kuzniar; Anand Gavai; Mark Thompson; Rajaram Kaliyaperumal; Jerven T. Bolleman; Michel Dumontier

doi:10.7287/peerj.preprints.2522v2

Interoperability and FAIRness through a novel combination of Web technologies

Mark D Wilkinson ¹, Ruben Verborgh², Luiz Olavo Bonino da Silva Santos³, Tim Clark^4,5, Morris A Swertz⁶, Fleur D.L. Kelpin⁶, Alasdair J. G. Gray⁷, Erik A. Schultes⁸, Erik M. van Mulligen⁹, Paolo Ciccarese^10,11, Arnold Kuzniar¹², Anand Gavai¹², Mark Thompson¹³, Rajaram Kaliyaperumal¹⁴, Jerven T. Bolleman¹⁵, Michel Dumontier¹⁶

1 Center for Plant Biotechnology and Genomics - UPM/INIA, Universidad Politécnica de Madrid, Madrid, Spain

2 imec, Ghent University, Ghent, Belgium

3 Dutch Techcenter for Life Sciences, Utrecht, The Netherlands

4 Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, USA

5 Department of Neurology, Harvard Medical School, Boston, United States

6 Genomics Coordination Center and Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands

7 Department of Computer Science, School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, United Kingdom

8 FAIR Data, Dutch TechCenter for Life Science, Utrecht, The Netherlands

9 Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands

10 Department of Neurology, Harvard Medical School, Boston, United States of America

11 PerkinElmer Inc., Waltham, Massachusetts, United States

12 Netherlands eScience Center, Amsterdam, The Netherlands

13 Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands

14 Department of Human Genetics,, Leiden University Medical Center, Leiden, The Netherlands

15 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland

16 Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, United States of America

DOI: 10.7287/peerj.preprints.2522v2

Published: 2017-01-02
Accepted: 2017-01-02

Subject Areas: Bioinformatics, Data Science, Databases, Emerging Technologies, World Wide Web and Web Science
Keywords: FAIR Data, Interoperability, Data Integration, Semantic Web, Linked Data, REST, RML, Triple Pattern Fragments

Copyright: © 2017 Wilkinson et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Wilkinson MD, Verborgh R, Bonino da Silva Santos LO, Clark T, Swertz MA, Kelpin FDL, Gray AJG, Schultes EA, van Mulligen EM, Ciccarese P, Kuzniar A, Gavai A, Thompson M, Kaliyaperumal R, Bolleman JT, Dumontier M. 2017. Interoperability and FAIRness through a novel combination of Web technologies. PeerJ Preprints 5:e2522v2 https://doi.org/10.7287/peerj.preprints.2522v2

Abstract

Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special-purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved atthe level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs.

Author Comment

This manuscript describes a novel approach to interoperability for data published in any public or private repository, that was guided by a desire to maximize adherence to the FAIR Data Principles. It provides a means for data discovery through publication of FAIR Metadata describing repository-level, and optionally record-level metadata. It then proposes a novel, discoverable, and machine-actionable approach to provision of data that has been transformed into RDF, such that interoperability can be achieved at the data level even over computationally opaque and/or non-interoperable data formats. The former is accomplished through layered metadata with a structure informed by the W3C's Linked Data Platform Container. The latter is accomplished through a combination of models of RDF data written using RML, and data provided via servers following the Triple Pattern Fragments design pattern. These three technologies, in combination, allow a high degree of discoverability and FAIRness without the need to create any novel standards or APIs. We provide two exemplar implementations of this approach - the first demonstrating the ability to make a Zenodo data archive FAIR, and the second demonstrating that FAIR data in UniProt can be transformed into a novel semantic framework, and more explicitly linked to its citation metadata, using the same approach. Thus, we show that this approach is applicable to both static and dynamic data sources, in a wide range of common repositories.

In this new version, we have reordered the presentation of the components of the solution for clarity; we have added a driving use-case to better frame the purpose of the approach; and we have added a second exemplar to show the breadth of utility of the proposed combination of technologies.

Supplemental Information

Figure 1: The two layers of the FAIR Accessor

Inspired by the LDP Container, there are two resources in the FAIR Accessor. The first resource is a Container, which responds to an HTTP GET request by providing FAIR metadata about a composite research object, and optionally a list of URLs representing MetaRecords that describe individual components within the collection. The MetaRecord resources resolve by HTTP GET to documents containing metadata about an individual data component and, optionally, a set of links structured as DCAT Distributions that lead to various representations of that data.

DOI: 10.7287/peerj.preprints.2522v2/supp-1

Download

Figure 2: Diagram of the structure of an exemplar Triple Descriptor representing a hypothetical record of a SNP in a patient’s genome

In this descriptor, the Subject will have the URL structure http://example.org/patient/{id}, and the Subject is of type PatientRecord. The Predicate is hasVariant, and the Object will have URL structure http://identifiers.org/dbsnp/{snp} with the rdf:type from the sequence ontology “0000694” (which is the concept of a “SNP”). The two nodes shaded green are of the same ontological type, showing the iterative nature of RML, and how individual RML Triple Descriptors will be concatenated into full FAIR Profiles. The three nodes shaded yellow are the nodes that define the subject type, predicate and object type of the triple being described.

DOI: 10.7287/peerj.preprints.2522v2/supp-2

Download

Figure 3. Integration of FAIR Projectors into the FAIR Accessor

Resolving the MetaRecord resource returns a metadata document containing multiple DCAT Distributions for a given record, as in Figure 1. When a FAIR Projector is available, additional DCAT Distributions are included in this metadata document. These Distributions contain a URL (purple text) representing a Projector, and a Triple Descriptor that describes, in RML, the structure and semantics of the Triple(s) that will be obtained from that Projector resource if it is resolved. These Triple Descriptors may be aggregated into FAIR Profiles, based on the Record that they are associated with (Record R, in the figure) to give a full mapping of all available representations of the data present in Record R.

DOI: 10.7287/peerj.preprints.2522v2/supp-3

Download

Figure 4. A representative portion of the output from resolving the Container Resource of the FAIR Accessor, rendered into HTML by the Tabulator Firefox plugin

The three columns show the label of the Subject node of all RDF Triples (left), the label of the URI in the predicate position of each Triple (middle), and the value of the Object position (right), where blue text indicates that the value is a Resource, and black text indicates that the value is a literal.

DOI: 10.7287/peerj.preprints.2522v2/supp-4

Download

Figure 5. A representative (incomplete) portion of the output from resolving the MetaRecord Resource of the FAIR Accessor for record C8V1L6 (at http://linkeddata.systems/Accessors/UniProtAccessor/C8V1L6), rendered into HTML by the Tabulator Firefox

The columns have the same meaning as in Figure 4.

DOI: 10.7287/peerj.preprints.2522v2/supp-5

Download

Figure 6. Turtle representation of the subset of triples from the MetaRecord metadata pertaining to the two DCAT Distributions

Each distribution specifies an available representation (media type), and a URL from which that representation can be downloaded.

DOI: 10.7287/peerj.preprints.2522v2/supp-6

Download

Figure 7. A portion of the output from resolving the MetaRecord Resource of the FAIR Accessor for record C8UZX9, rendered into HTML by the Tabulator Firefox plugin

The columns have the same meaning as in Figure 4. Comparing the structure of this document to that in Figure 5 shows that there are now four values for the “distribution” predicate. An RDF and HTML representation, as in Figure 5, and two additional distributions with URLs conforming to the TPF design pattern (highlighted).

DOI: 10.7287/peerj.preprints.2522v2/supp-7

Download

Figure 8. Turtle representation of the subset of triples from the MetaRecord metadata pertaining to one of the FAIR Projector DCAT Distributions of the MetaRecord shown in Figure 7

The text is colour-coded to assist in visual exploration of the RDF. The DCAT Distribution blocks of the two Projector distributions (black bold) have multiple media-type representations (red), and are connected to an RML Map (Dark blue) by the hasMapping predicate, which is a block of RML that semantically describes the subject, predicate, and object (green, orange, and purple respectively) of the Triple Descriptor for that Projector. This block of RML is schematically diagrammed in Figure 2. The three media-types (red) indicate that the URL will respond to HTTP Content Negotiation, and may return any of those three formats.

DOI: 10.7287/peerj.preprints.2522v2/supp-8

Download

Figure 9: Data before and after FAIR Projection

Bolded segments show how the URI structure and the semantics of the data were modified, according to the mapping defined in the Triple Descriptor (data_0896 = “Protein report” and data_1176 = “GO Concept ID”). URI structure transformations may be useful for integrative queries against datasets that utilize the Identifiers.org URI scheme such as OpenLifeData (González et al., 2014) . Semantic transformations allow integrative queries across datasets that utilize diverse and redundant ontologies for describing their data, and in this example, may also be used to add semantics where there were none before.

DOI: 10.7287/peerj.preprints.2522v2/supp-9

Download

Supplemental Information

Figure 1: The two layers of the FAIR Accessor

Figure 2: Diagram of the structure of an exemplar Triple Descriptor representing a hypothetical record of a SNP in a patient’s genome

Figure 3. Integration of FAIR Projectors into the FAIR Accessor

Figure 4. A representative portion of the output from resolving the Container Resource of the FAIR Accessor, rendered into HTML by the Tabulator Firefox plugin

Figure 5. A representative (incomplete) portion of the output from resolving the MetaRecord Resource of the FAIR Accessor for record C8V1L6 (at http://linkeddata.systems/Accessors/UniProtAccessor/C8V1L6), rendered into HTML by the Tabulator Firefox

Figure 6. Turtle representation of the subset of triples from the MetaRecord metadata pertaining to the two DCAT Distributions

Figure 7. A portion of the output from resolving the MetaRecord Resource of the FAIR Accessor for record C8UZX9, rendered into HTML by the Tabulator Firefox plugin

Figure 8. Turtle representation of the subset of triples from the MetaRecord metadata pertaining to one of the FAIR Projector DCAT Distributions of the MetaRecord shown in Figure 7

Figure 9: Data before and after FAIR Projection

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article