The Need for a Machine-actionable Rosetta Stone for (meta)data That Acts as an Interlingua

cover
8 Mar 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Vogt, Lars, TIB Leibniz Information Centre for Science and Technology;

(2) Konrad, Marcel, TIB Leibniz Information Centre for Science and Technology;

(3) Prinz, Manuel, TIB Leibniz Information Centre for Science and Technology.

The need for a machine-actionable Rosetta Stone for (meta)data that acts as an interlingua for specifying reference terms and reference schemata to support cognitive and semantic interoperability

Above, we discussed the role of terms and statement structures (i.e., syntax trees and (meta)data schemata) in reliably communicating the meaning and thus the semantic content of (meta)data statements. Statement structures specify syntactic positions or slots with semantic roles or constraint specifications for a given statement type. To achieve semantic interoperability, we therefore need controlled vocabularies (i.e., ontologies) and ontological and referential term mappings across ontologies for FAIR terms and their terminological interoperability. And we need (meta)data schemata and ontological and referential schema crosswalks for FAIR (meta)data statements and their schematic interoperability.

We also discussed why we think that it is impossible to agree on a best term for every possible type of entity and a best schema for every possible type of statement, due to varying frames of reference and operational priorities. Therefore, we think that we need something like a machine-actionable Rosetta Stone to support the establishment of semantic interoperability across different terms and different schemata for a given type of (meta)data statement. This Rosetta Stone needs to function like an interlingua, with which term mappings and schema crosswalks can be easily specified and operationalized. The building blocks of the interlingua are reference terms, reference datatype specifications, and reference schemata. Each entity type must have specified a corresponding reference term, and each statement type must have a corresponding reference schema. Terms from controlled vocabularies can be mapped to their corresponding reference term, and schemata to their corresponding reference schema. Constraint specifications for slots of reference schemata must refer to reference terms in the case of resources, and to reference datatype specifications in the case of values. These three types of building blocks take over the role of mediating connectors, so that it would no longer be necessary to specify schema crosswalks for every possible pair of schemata of a given type of (meta)data statement and to specify term mappings for every possible pair of terms. This would minimize the number of schema crosswalks and term mappings that need to be specified in order to achieve schematic and terminological interoperability for a given type of statement (Fig. 4).

Ideally, a reference schema is based on a generic Rosetta modeling paradigm that allows the reconstruction of the natural language statement underlying the datum. At the same time, it should document this statement using a formalized structure to ensure its human- and machine-actionability. With respect to human-actionability, the Rosetta modeling paradigm should reflect as closely as possible the structure of natural language statements, favoring lean over complex models, with the aim of reducing overall modeling complexity and modeling burden. Many schemata are very complex and include positions with resources that do not directly align with any input slot (e.g., ‘scalar measurement datum’ and ‘scalar value specification’ in Fig. 2E). Such schemata are not suitable for use as reference schemata.

Schemata that conform to the Rosetta modeling paradigm should be easy to understand and to apply, allowing any producer of (meta)data to specify new reference schemata for types of statements that do not yet have a reference schema assigned to them, and allowing any application developer to readily use their (meta)data. It should not require experience in semantics and knowledge engineering on the part of the data producer and the application developer.

Figure 4: Number of schema crosswalks required. Left) The number of schema crosswalks required to achieve schematicinteroperability between 8 different schemata is with 28 very high, because each possible pair of schemata has its own

With regard to the structure of reference schemata, we need to consider the machine-actionability of the resulting (meta)data statements. In other words, we need to consider which operations are important for such a reference schema. While reasoning is important for domain knowledge, especially when developing ontologies, other types of operations such as searching and exploring are more important in the context of empirical (meta)data and (meta)data management in general. However, regardless of the choice of operations and associated tools, the application of reference schemata must result in FAIR (meta)data that are machine-readable and machine-interpretable in order to be machine-actionable.

As for reference terms, they should ideally be collected in a large, controlled cross-domain vocabulary and should be machine-readable and machine-interpretable to be machine-actionable.