Rosetta Stone and machine-interpretability: Wikidata and a modeling paradigm

cover
8 Mar 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Vogt, Lars, TIB Leibniz Information Centre for Science and Technology;

(2) Konrad, Marcel, TIB Leibniz Information Centre for Science and Technology;

(3) Prinz, Manuel, TIB Leibniz Information Centre for Science and Technology.

Rosetta Stone and machine-interpretability: Wikidata and a modeling paradigm for (meta)data statements based on English

In the search for reference terms, we turn to ontologies and controlled vocabularies because they provide terms that carry meaning through ontological definitions (and ideally also recognition criteria for designation and recognition tasks), assign a UPRI to each term, and organize the terms in a taxonomy. They, thus, provide machine-interpretable terms. With respect to cognitive interoperability, Wikidata lends itself as a prime candidate for a repository of reference terms: it covers many terms, especially in English, which is the de facto lingua universalis in science and academia. For many terms, it provides labels in multiple languages. It also covers many terms from different ontologies, and it provides an interface and workflows so that anyone can add more terms to it. Since each term in Wikidata has its own UPRI, the problem of homonomy of natural languages is solved, because even if two terms may have identical labels but differ in meaning and/or referent, they can still be distinguished by a machine because they have different UPRIs.

While terms carry meaning through their ontological definitions, statements carry meaning through their terms and the syntactic positions in which they are placed. As discussed above, not all natural language statements are based on predicates with a binary valence, so their predicates do not map directly to ontology properties. Because RDF is highly expressive, a given type of natural language statement can be modeled in RDF in many different ways. The core of our Rosetta approach is therefore a particular modeling paradigm for types of statements. To meet the requirements of cognitive interoperability, we follow the basic idea that this modeling paradigm should be as generic and simple as possible, reflecting as much as possible structures that we are already familiar with from a natural language like English. In addition, the paradigm must support the specification of new reference schemata so that it becomes a straightforward task that does not require any background in semantics—it should allow the semantic modeling step to be automated. This can only be achieved, if we come up with a very generic structure for the model. This structure must be applicable to any type of statement, regardless of its n-aryness. To be lean, it should store only what is necessary to recover the meaning of the statement, which should be the same as storing only the information a user needs to provide to create a new statement. For example, instead of creating the entire subgraph shown in Figure 2E, for a weight measurement statement, it should be sufficient to store only the resources for the measured object and the quality, together with the value and the unit, with the main focus on always being able to reconstruct the original user input or data import for a given statement. Another criterion that the generic model must meet is that it must facilitate the seamless derivation of queries from it. Each one of these criteria is important, because, in the end, the reference schemata must support semantically interoperable (meta)data statements with which not only machines but also humans can interact. But how do we get there?

For the time being, we limit ourselves to statements as the minimum information units, apart from terms. We can always extend the generic model later to create reference schemata that cover groups of (meta)data statements, and with semantic units (50) we have already provided a framework within which this should be straightforward.

Next, we need to abstract the structure of syntax trees from natural language statements to their syntactic positions and associated semantic roles. Since we do not need to be able to represent the full expressiveness of natural language statements when documenting (meta)data statements, we can restrict ourselves to statements with a rather simple structure of subject, transitive verb or predicate, and a number of objects. In this first attempt to develop a machine-actionable Rosetta Stone, we do not consider passive forms, no tenses, and we do not need to distinguish between different possible syntactic alternations in which a verb or predicate can express its arguments. The generic model underlying our modeling paradigm is thus similar to a highly simplified frameset (see discussion above), specifying a subject-position and a number of required and optional object-positions, each with its associated semantic role in the form of a thematic label and a corresponding constraint specification. Its structure is thus an abstraction of the structure of a syntax tree.

Different types of statements can be distinguished on the basis of their underlying predicates (i.e., relations), resulting in a predicate-based classification of types of statements. For example, the apple weight measurement statement from Fig. 2A is of the type weight measurement statement.

As described above, following the predicate-argument-structure, each predicate has a valence that determines the number and types of arguments that it requires to complete its meaning. For example, a has-part statement requires two arguments―a subject representing the entity that has some part and an object representing the part―resulting in a statement that can be modeled as a single triple following conventional triple modeling schemes, e.g., ‘SUBJECT has part OBJECT’. Without the specification of a subject and an object, the has-part predicate cannot complete its meaning. Adjuncts can additionally relate to the predicate, but they are not necessary to complete the meaning of the predicate but provide optional information, such as a timestamp specification in the has-part statement, e.g., ‘SUBJECT has part OBJECT at TIMESTAMP’, which would require multiple triples for modeling in RDF. Subject and object phrases are the most common arguments and adjuncts. So we can say that each statement relates a resource, which is the subject argument of its predicate, to one or more literals or resources, which are the object arguments and adjuncts of its predicate. The subject argument is what we call the subject of the statement, and the object argument(s) and adjunct(s) are its required and optional object(s). Every statement has such a subject and one or more objects.

According to this model, we can distinguish different types of predicate by the total number of subjects and objects they relate within a given statement. A statement like ‘Sarah met Bob’ is a statement with a binary relation, where we refer to ‘Sarah’ as the subject of the statement and ‘Bob’ as its object. If we add a date to the statement, such as in ‘Sarah met Bob on 4th of July 2021’, it becomes a ternary relation with two objects 3 . If we add a place, it even becomes a quaternary relation, as in ‘Sarah met Bob on 4th of July 2021 in New York City’. This is open-ended in principle, although it is limited by the dimensionality of the human reader’s ability to comprehend n-ary relations 4 . Regardless of this limitation, statements can be distinguished into binary, ternary, quaternary, and so on, based on the number of subjects and objects that their underlying predicates relate to. Furthermore, based on the distinction between arguments and adjuncts, we can distinguish objects that are necessary and thus required to complete the meaning of the statement’s predicate from objects that are optional.

Looking again at the example above, if we were to model the statement ‘Sarah met Bob on 4th of July 2021’ in a knowledge graph, the objects ‘Bob’ and ‘4th of July 2021’ would be modeled differently. Whereas ‘Bob’ is likely to be modeled as a resource that instantiates a class ‘person’ (wikidata:Q215627), ‘4th of July 2021’ is likely to be modeled as a literal associated with the datatype xsd:date. Therefore, in addition to distinguishing arguments and adjuncts, each with their associated semantic roles and thematic labels, one can distinguish objects by their type into resources via their respective UPRIs and literals and specify class/datatype constraints based on their associated semantic roles. Resources, in turn, can be either named-individuals, classes, or properties (and, when following the semantic unit framework, some-instance and every-instance resources as well (50)). We shall refer to them as resource-objects and to literal-based objects as literal-objects. Statements can be distinguished by the number of resource-objects and the number of literal-objects they contain. With resource-subjects, resource-objects, and literal-objects, we now have the different elements that each reference schema must cover. The next step is to work out how best to relate them to each other and to the statement.

The light version of the Rosetta approach

The Rosetta modeling approach needs to relate the subject-resource to the different types of object-resources and object-literals of a given type of (meta)data statement. To avoid the problems of modeling statements that have n-ary predicates, and to reflect as closely as possible the structure of simple natural language statements in English that consist of only one verb or predicate, we classify all types of statements based on their predicate, and use instances of the respective classes to link the subject and object-resources, as well as the object-literals. So the statement ‘This apple has a weight of 212.45 grams’ would instantiate a ‘weight measurement statement’ class, and the corresponding reference schema would link an instance of ‘apple’ (wikidata: Q89) as the statement’s subject-resource via a ‘has subject’ property to an instance of this statement class. The schema would also require two additional arguments to be added: (i) a value of 212.45 with datatype xsd:float as the object-literal and (ii) a named-individual resource ‘gram’ (wikidata:Q41803) as the object-resource. The schema links the statement instance resource to these object-arguments via a ‘required object position’ property (Fig. 5). Comparing this schema with the weight measurement schemata from OBO and OBOE (cf. Fig. 3), it is immediately apparent that, on the one hand, fewer triples are required to model the statement—i.e., three instead of five or six— and on the other hand, much fewer classes are required. The reference schema is simpler and contains only input slots and no additional positions such as 'scalar measurement datum' and 'scalar value specification' in the OBO schema or 'observation' and ‘measurement’ in the OBOE schema. A human reader is not interested in these additional positions and their resources—they only want to see the information from the input slots.

Figure 5: From the structure of a natural language statement to the structure of a reference schema. A) A natural language statement with the predicate has_a_weight. B) The corresponding formalized statement, with the syntactic positions and their associated semantic roles highlighted in color. C) The reference schema for the weight measurement statement, following the Rosetta modeling paradigm.

The same modeling approach can be applied to any simple English statement consisting of a single verb or predicate. We therefore chose this modelling approach as the Rosetta modeling paradigm for reference schemata (see Figure 6).

As each instance of such a statement class represents the statement as a whole, including its verb or predicate, one can use this resource to make statements about that statement, including about (i) the provenance of the statement, such as creator, creation date, curator, imported from, etc., (ii) the UPRI of the reference schema that the statement instantiates, (iii) the copyright license for the statement, (iv) access/reading restrictions for specific user roles and rights for the statement, (v) whether the statement can be edited and by whom, (vi) a specification of the confidence level of the statement, which is very important, especially in the scientific context (53,54), where lack of it can cause problems such as citation distortion (55), (vii) a specification of the time interval for which the statement is valid, and (viii) references as source evidence for the statement, to name some possibilities. In other words, following the Rosetta modeling paradigm, one always gets statements, each represented by its own dedicated resource, so that one can make statements about each of these statements without having to apply RDF reification (56) or RDF-star (57,58), which are feasible for referring to individual triples but not for larger subgraphs such as a measurement datum with a 95% confidence interval (see Fig. 1, middle), for which the latter two approaches are inefficient and complicated to query. Named Graphs seem to be another solution for such larger subgraphs (56), and one could always organize all triples belonging to a statement into their own Named Graph using the UPRI of the statement instance resource as the UPRI of the Named Graph.

Figure 6: From a formalized natural language statement to the corresponding reference schema according to the lightversion of the Rosetta approach. A) A formalized statement with its syntactic positions and associated semantic roles

Each argument in a given reference schema can be understood as a particular syntactic position, which we model in the schema as a slot, for which we can specify the corresponding semantic role in the form of a constraint specification—either as XML Schema datatype specification for an object-literal, which can be supplemented with a specific pattern or range constraint, or as a Wikidata class specification for a subject or an object-resource, which restricts the type of resources that can be located in a particular slot to that class or any of its subclasses. Corresponding reference schemata can be specified as SHACL shapes, for example. Statements modeled according to the same shape are machine-interpretable statements.

A given reference schema can be extended to include more object adjuncts. Adding new object adjuncts to a reference schema is not problematic because any statement instance that was created using an older version of the schema will still be compatible with the new schema, since object adjuncts are only optional objects and thus are not required to comply with a reference schema.

The full version of the Rosetta approach, supporting versioning and the tracking of an editing history

Some knowledge graphs are rapidly evolving, and their content is the result of collaborative or even crowdsourced editing, where any user can edit any statement in the graph, including statements created by other users. For such knowledge graphs in particular, it may be important to be able to track the editing history at the level of individual statements. And if the knowledge graph were to allow its users to cite one or more of its statements, thus making the knowledge graph a valuable resource for scholarly communication, it would also require a versioning mechanism that provides a citable resource for sustaining the cited content. With such a mechanism, the knowledge graph can evolve continuously through user input, while still being citable. The statement versioning mechanism of the full version of the Rosetta approach supports this, and it supports tracking the editing history for each individual statement and each particular object-position. To support this, the Rosetta modeling paradigm must be adapted.

In the modeling paradigm of the full version of the Rosetta approach (Fig. 7), only the subject-resource is directly linked to the statement instance. The object-resources and object-literals are only indirectly linked to the statement instance through instances of corresponding object-position classes. Each reference schema has an object-position class defined for each of its object arguments and adjuncts, so that each particular statement of a given statement type has, in addition to an instance of the statement class, an instance of each object-position class, to which the actual object-resources and object-literals are linked. The number of object-position classes that a given reference schema distinguishes depends on the n-aryness of the underlying statement type. The dependency of object-position classes on their corresponding statement is also documented at the class level: Each statement class links in its class axioms to its corresponding required and optional object-position classes.

Figure 7: Structure of a reference schema following the full version of the Rosetta approach. The reference schema for thestatement from Figure 6 A), following the modeling paradigm of the full version of the Rosetta approach. Compared to the

Statements modeled according to the full version of the Rosetta modeling paradigm are downward compatible with those modeled using the light version, and can therefore be transferred to them. The reverse is not possible, as it requires the specification of corresponding object-position classes that are not required in the light version.

In addition to the object-resource or object-literal, further information can be associated with and thus documented for each object-position instance, such as provenance metadata about the particular input event (e.g., creator, creation date, imported from). Thus, one can think of an object-position instance of a particular statement to document a particular input event associated with that position. One can also specify that a particular logical property of the verb or predicate of the statement applies to a particular object-position by using a corresponding Boolean annotation property (e.g., ‘transitive’) with the object-position instance. This way, you can, for example, document that the transitivity of a has-part statement applies to the resource specified for the PART object-position.

The mechanism for statement versioning and for tracking the editing history uses these structural changes in the reference schema according to two basic ideas:

1) Soft-delete: no input can be deleted or modified after it has been added to the graph. Each editing step results in the addition of new triples to the graph of a corresponding statement, with the information provided being linked to a corresponding newly added object-position instance. In other words, the object-position instances belonging to a particular statement are not updated, but each editing step adds new instances to the graph. To identify newly added object-position instances, a Boolean property is used on the object-position instance and set to ‘true’. For example, if a user updated the VALUE position of a weight measurement statement, the VALUE position instance that was previously associated with the last added value, and therefore the one with the value ‘true’ for the ‘current’ property, would be set to ‘false’, and a new VALUE position instance with the ‘current’ value ‘true’ would be added, and its information displayed when a user navigates to the statement.

Each object-position instance added in this way instantiates the corresponding object-position class, so that a query can identify all instances of the same object-position of a given statement. All previously added object-position instances of the same object-position class are still in the knowledge graph and linked to the corresponding statement instance, and their information could be accessed at any time and sorted by their creation date if required, thus providing a detailed editing history for each object-position of a statement, but also for the statement as a whole. In the case of the weight measurement statement example from above, information about who entered when which weight measurement value could be presented in the user interface (UI) of a knowledge graph. Each object-position holds this information. This information is valuable in the context of managing collaborative efforts to edit content in a knowledge graph, and also for crowdsourced knowledge graphs to identify who added/edited what and when.

The same approach is applied not only to individual object-positions, but also to individual statements. The respective statement instance also indicates its status through the ‘current’ property. If a user wants to delete a statement, instead of actually deleting the triples associated with the statement, the statement instance is set to ‘false’ via the ‘current’ property. By default, the application does not display information about resources that are set to ‘false’, and in the UI it looks as if the information has been deleted, although it is still available in the knowledge graph and all associated provenance metadata are still accessible (fulfilling principle A2. of FAIR).

2) A user can create a version of any statement in the knowledge graph, which then cannot be edited and is therefore persistent over time. Users can create versions of specific statements. Each such version is represented by its own node in the graph. It has its own UPRI, which could take the form of a DOI. Consequently, such versions can be referenced and cited in publications. The version node also tracks metadata through its properties, like any other statement. Each version could, additionally, also have a content identifier (CID) that uses cryptographic hashing for decentralized content addressing. There may be several versions of a given statement. The node representing the latest version is linked to its statement instance by the ‘hasCurrentVersion’ (PAV:hasCurrentVersion) property, while all other version nodes are linked to it by the ‘hasVersion’ (PAV:hasVersion) property. The version nodes of a statement are linked to each other by a chain of ‘previousVersion’ (PAV:previousVersion) properties, starting with the latest version and going back to its respective previous versions.

All object-position instances belonging to the versioned statement are also updated to include the UPRI of the versioned statement as a value for their respective ‘versionID’ property. With this information, it is always possible to access the data belonging to that version by querying all object-position instances with the UPRI as a value for this property. The ‘versionID’ property can be applied multiple times to the same object-position instance to track more than one version UPRI. This is necessary, because not all object-positions of a given statement may have been edited between two versions, and thus a particular object-position instance may belong to more than one version.

Both aspects, the soft-delete and the versioning, have been demonstrated in a small Python-based prototype of a FAIR scholarly knowledge graph application that employs semantic units and knowledge graph building blocks and that uses Neo4j as the persistence-layer technology (62). It includes versioning of semantic units and automatic tracking of their editing history and provenance. The prototype is available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.

The Rosetta Framework primarily models statements rather than some human-independent reality

It is important to note that the Rosetta Framework does not claim to model and represent a human-independent reality, as other approaches to semantic modeling attempt to do, such as the Open Biological and Biomedical Ontology (OBO) Foundry, with the Basic Formal Ontology (BFO) (51) as its top-level ontology. Instead, it follows a pragmatic approach with a focus on the efficient and reliable communication of information of all kinds between humans and machines and across machines, including but not restricted to (meta)data statements. For the time being, the Framework is limited to terms and statements as meaning-carrying units of information, but can be extended to larger units in the future (see also the concept of semantic units to which it can be easily adapted (50)).

As each statement is represented in the graph with its own resource and as an instance of a particular statement class, it is straightforward to make statements about these statements. This includes, in addition to the classification of statements based on their verb or predicate already discussed, the ability to further classify any given statement as an instance of other types of classes, allowing sophisticated classification of statements across multiple independent classification criteria.

For example, one can distinguish between different types of statements regarding their truth function or reference. When we talk about the referent of a term that is a proper name, we are referring to a particular entity that is known by that name, and when the term is a kind term, we are referring to the set of entities that are instances and thus referents of the respective class. With statements, we can do something similar with respect to the subject of the statement, and we can distinguish the following types of statements (for a discussion, see also (50)): (1) The referent of an assertional statement is the particular known individual entity (i.e., named individual) that takes the subject-position of the statement. Assertional statements are taken to be true for that particular individual entity. Instance graphs and ABox expressions are examples of assertional statements (60). (2) The referent of a contingent statement is some instance of a class which takes the subject-position of the statement. Contingent statements are taken to be true for at least one instance of that class. (3) The referents of a prototypical statement are most instances of a particular class, indicating that the statement is true for most, but not necessarily all, instances of the class. As such, prototypical statements are a special kind of contingent statement. (4) The referent of a universal statement are all instances of a class. A universal statement is necessarily true for every instance of that class. Class axioms of ontology classes, and thus TBox expressions, are examples of universal statements (60).

Another possibility is to classify a given statement instance as an instance of the ‘negation’ class, indicating that its semantic content is negated. This can be used to express typical negations, but also absence statements, which are often needed when describing specific objects, situations, or events. In OWL, following the Open World assumption, modeling negations, including absences such as ‘This head has no antenna’, requires the specification of appropriate class axioms and blank nodes. By classifying statement resources as instances of a class ‘negation’, one can even model assertional, contingent, and prototypical statements with negations without having to model them as TBox expressions. The same applies to statements with cardinality constraints (e.g., ‘This head has exactly 3 eyes’). Modeling negations and cardinality constraints in this way is much simpler than modeling them as TBox expressions, and would thus increase their cognitive interoperability (see (50) for more details).

Many more classification criteria can be applied, leading to a sophisticated classification of different types of statements based on their meaning, their epistemic value and function, their referents, and their context (e.g., assumptions, ontological definitions, epistemological recognition criteria, questions, answers, logical arguments, commands/instructions, descriptions, hypotheses, knowledge, known facts).

A Rosetta Editor for creating reference schemata

The use of the light and full versions of the Rosetta modeling paradigm results in reference schemata that are easily understood by humans because the modeling paradigm reflects the structure of natural languages such as English. In addition, specifying reference schemata for new types of statements is not as demanding when following this modeling paradigm. Not only developers, but also researchers (and anyone else) who want to use knowledge graph applications and who are experts in domains other than semantics and knowledge modeling in particular, or computer science in general, will be able to specify new reference schemata if supported by an intuitively usable low-code Rosetta Editor. With the Editor, the need to develop semantic models to establish FAIR (meta)data within the RDF framework will no longer be a barrier (i.e., the semantification burden), and we expect that related tools will be easier to adapt, increasing the overall cognitive interoperability of the RDF framework for developers and users alike.

When specifying a new reference schema, users of the Editor will be guided through a list of questions, the answers to which allow the Editor to create the reference schema and its associated statement class (and object-position classes), without having to make any semantic modeling decisions:

1. Provide some example statements.

Example statements can be informative for users to understand what type of statement the reference schema is modeling, and what types of statements can be covered by it. In addition, for Large Language Models, example statements can be used to assist with the following questions by providing possible answers which only need to be checked for correctness by the user.

2. What is the predicate or verb of the statement?

The answer will be used to create the statement class label.

3. Give a description or characterization of the type of statement you want to add. What kind of statement will it cover?

Provides a human-readable definition of the newly created statement class. It could also take the form of a definition of the predicate of the statement.

4. Indicate the number of object-positions the statement should cover.

5. Give the subject-position and each object-position a short and meaningful label.

Each label acts as a thematic label, characterizing the semantic role associated with each position (cf. (45)). It is used for various purposes, including as a placeholder label for input fields and as a placeholder label for variables for the textual and graphical display templates (see below). It also provides the label for the corresponding object-position class for the full version.

6. Specify which of these object-positions are required (i.e., arguments) to form a semantically meaningful statement with the subject and the predicate, with all other object-positions being optional (i.e., adjuncts).

The ‘requiredObjectPosition’ property is used to specify which object-positions the statement instance will bind to, while the ‘optionalObjectPosition’ property is used to bind to all other object-positions. In the Editor UI, this could be done using checkboxes.

7. For each object-position, provide a brief description of the type of objects covered by the object-position and give some typical examples.

This is the human-readable definition for the corresponding object-position class. This text can also be used as a tooltip for the corresponding input field for adding or editing corresponding statements.

8. For each object-position, decide whether the object must be represented in the form of a resource or a literal. In the case of a resource, choose a Wikidata class that best specifies the type of entity that is allowed for the object-position, and in the case of a literal, choose the datatype and define datatype-specific constraints if necessary.

This provides input constraints for each object-position, which are documented in the shape specification of the reference schema.

9. Decide whether any logical properties apply to the predicate of the statement (e.g., transitivity) and, if so, which object-position is affected by them.

This is only optional, but required if you want to be able to automatically translate the statement graph resulting from instantiating a reference schema into an OWL graph over which reasoning can be applied. The Editor allows this information to be added to the relevant object-position class via an appropriate Boolean annotation property (e.g., ‘transitive’).

10. Write a human-readable statement using the thematic labels for the subject, the different object-positions, and the predicate.

Provides the dynamic label pattern for this type of statement (see display templates below).

The Rosetta Editor will process the input and create a reference schema along with a statement class and, if the full version of the Rosetta approach is followed, any required object-position classes. The schema itself could be specified in a YAML file, following the notation of LinkML, which can be translated into, e.g., SHACL shapes. The schema specifies slots/attributes with constraints that can be used by the frontend and a knowledge graph application for validation and input control purposes. Constraints include specifying datatypes as ranges such as ‘xsd:float’, max-min, or pattern constraints, and class constraints for resources. Slots belonging to required object-positions would be declared as ‘required: true’, indicating that this slot must have a value. LinkML is a general-purpose, platform-agnostic, object-oriented data modeling language and framework that aims to bring Semantic Web standards to the masses and that makes it easy to store, validate, and distribute FAIR data (61). It can be used with knowledge graph applications to schematize a wide variety of data types, from simple flat checklist standards to complex interrelated normalized data using inheritance. LinkML fits nicely into frameworks common to most developers and database engineers, such as JSON files, relational databases, document stores, and Python object models, while providing a solid semantic foundation by mapping all elements to RDF U(P)RIs.

LinkML also comes with several tools. Generators provide automatic translation from the YAML schema into various formats, including JSON schema, JSON-LD/RDF, SPARQL, OWL, SQL DDL, ShEx, GraphQL, Python data classes, Markdown, and UML diagrams. Loaders and dumpers convert instances of a schema between these formats. Generators thus allow integration with tools provided by other technical stacks. The SPARQL generator allows generating a set of SPARQL queries to be generated from a schema, and the Excel Spreadsheet Generator allows a spreadsheet representation of a schema to be generated. LinkML currently supports four different data validation strategies: 1) validation via Python object instantiation; 2) validation via JSON schema; 3) validation of triples in a triple store or RDF file via generation of SPARQL constraints; or 4) validation of RDF via generation of ShEx or SHACL.