This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Vogt, Lars, TIB Leibniz Information Centre for Science and Technology;
(2) Konrad, Marcel, TIB Leibniz Information Centre for Science and Technology;
(3) Prinz, Manuel, TIB Leibniz Information Centre for Science and Technology.
Table of Links
- Abstract & Introduction
- Interoperability
- Semantic interoperability and what natural languages like English can teach us
- Requirements for successfully communicating terms and statements
- Parallels between the structure of natural language statements and data schemata with implications for semantic interoperability
- What makes a term a good term and a schema a good schema?
- The need for a machine-actionable Rosetta Stone for (meta)data that acts as an interlingua for specifying reference terms and reference schemata to support cognitive and semantic interoperability
- Rosetta Stone and machine-readability: UPRIs, XML Schema datatypes, and RDF for communicating terms, datatypes, and statements
- Rosetta Stone and machine-interpretability: Wikidata and a modeling paradigm for (meta)data statements based on English
- Rosetta Stone and semantic interoperability: Specifying term mappings and schema crosswalks
- Rosetta Stone and cognitive interoperability: Specifying display templates and using a query builder
- Discussion
- Related work
- Conclusion, Acknowledgements, & References
Rosetta Stone and machine-readability: UPRIs, XML Schema datatypes, and RDF for communicating terms, datatypes, and statements
We have discussed above that a dataset documented in an ASCII file already qualifies as machine-readable. This readability includes all the terms and statements contained in the file. However, the ASCII file alone does not enable the machine to identify where exactly a term, a value, and a statement begin and end in a dataset. We would need to specify additional information in an application to enable a machine to make such distinctions.
Because we want to communicate (meta)data efficiently and reliably across machines and between machines and humans, and because we want to disambiguate terms, values, and statements, we suggest using UPRIs to communicate terms, XML Schema datatypes to communicate datatypes, and RDF to communicate statements. We use UPRIs for:
-
instance-terms: instance resource, including named individuals, that use proper names as their labels, refer to individual entities, and instantiate classes;
-
class-terms: class resources that use general/kind terms as their labels and refer to the extension of their class, i.e. the set of instance resources that instantiate the class. Class-terms can also refer to types of statements and their defining verbs or predicates.
-
property-terms: property resources that use verbs or predicates as their labels and refer to a specific type of action or attribute.
Machines are able to recognize unambiguously where a UPRI or a value (i.e., a literal) associated with an XML Schema datatype begins and where it ends. Both can be used in RDF to make statements using the RDF triple syntax of Subject-Predicate-Object, and a machine has no problem to recognize where a triple starts and where it ends.
Unfortunately, if you look at the predicate-argument-structure and compare the structure of triples with that of syntax trees, you will see that they are quite different and therefore do not properly align: the Predicate of a triple is always and necessarily binary, i.e., triples always have exactly one subject and one object argument. The predicates of natural language statements, however, are not necessarily always binary, as in the statement “This tree has part an apple”, but n-ary, as in “This apple has a weight of 212.45 grams”; “Anna travels by train from Berlin to Paris on the 21st of April 2023”. Therefore, ontology properties (i.e., the Predicate resources used in triples) do not map in a one-to-one relation to natural language predicates, and we often need to model natural language statements using multiple triples (cf. Fig. 1 and Fig. 2E). This is possible in principle: If a triple uses a UPRI in its Subject position that is used by another triple in its Object position, these two triples combine to form a semantic graph, so we can use multiple triples to model an n-ary natural language statement. However, semantic modeling of n-ary statements in RDF is often challenging, significantly increasing the semantification burden and usually resulting in graphs that are overly complex for humans to comprehend.
In the following, we will stay with the RDF framework, as we believe it is the most flexible, allowing the specification of both ontologies and (meta)data schemata, as well as term mappings across ontologies and schema crosswalks across schemata. However, we will introduce a modeling paradigm that reflects the structure of English and provides a generic pattern for modeling n-ary statements in RDF. In all of this, we are trying to take a pragmatic approach that may not satisfy all the requirements for knowledge management that one would wish for in an ideal world, but from which we hope to achieve practical improvements.