16.17.16. Fast Approximate RDF Graph Diff and Patch
Two algorithms described below resemble "unified diff" and "patch by unified diff" but they work on RDF graphs, not on plain texts.
They work reasonably for graphs composed from CBDs (concise bounded descriptions) of some subjects, if these subjects are either "named" IRIs or can be identified by values of their inverse functional properties.
Many sorts of commonly used graphs match these restrictions, including all graphs without blank nodes, most of FOAF files, graphs that can be "pretty-printed" in JSON, most of dumps of relational databases etc.
The basic idea is as simple as zipper:
-
Place one graph at the left and one to the right,
-
Find a retainer box at the right and a matching pin at the left,
-
Join them
-
Pull the slider as long as possible.
-
Repeat this while there are pins and boxes that can be matched and sliders that can be moved.
An IRI in left graph (say, G1)
matches to same IRI in right graph (G2)
as pin to box. The same is true for literals too.
Functional and inverse functional properties are teeth that form chains, algorithm "moves sliders" along these chains, incrementally connecting more and more nodes.
If there is a match of this sort (O1 in G1 matches O2 in G2)
and the matched nodes
are values of same inverse functional property P
(there are { S1 P O1 }
in
G1
and { S2 P O2 }
in G2
) then we guess that S1
matches S2
.
If S1
in G1
matches S2
in G2
and the matched
nodes are subjects of same functional property P
( there are { S1 P N1 }
in
G1
and { S2 P N2 }
in G2
) then we guess that N1
matches N2
, now it's possible to try same interaction on triples where N1
and
N2
are in subject position, that's how slides move. A typical example of a long zipper
is closed list with matched heads.
Collect Functional And Inverse Functional Properties
Lists of functional properties can be retrieved from an ontology graph by query like:
SPARQL define output:valmode "LONG" SELECT (<LONG::sql:VECTOR_AGG(?s)) FROM <my-ontology-graph> WHERE { ?s a owl:functionalProperty }
Inverse functional properties could be retrieved by a similar query, but unfortunately the ontology may mention so called NULL values that can be property values for many subjects. Current implementation of diff and patch does not recognize NULL values so they can cause patch with "false alarm" errors. The workaround is to retrieve only properties that have no NULL values declared:
SPARQL define output:valmode "LONG" SELECT (<LONG::sql:VECTOR_AGG(?s)) FROM <my-ontology-graph> WHERE { ?s a owl:inverseFunctionalProperty . OPTIONAL { ?s owl:nullIFPValue ?v } FILTER (!Bound(?v)) }
If no ontology is available then appropriate predicates can be obtained from sample graphs using DB.DBA.RDF_GRAPH_COLLECT_FP_LIST .
Implementation-Specific Extensions of GUO Ontology
Note : This section contains implementation details that are needed only if you want to write your own patch or diff procedure, you don't have to worry about internals if you want to use existing procedures.
Basic GUO ontology is not expressive enough to work with blank nodes, so some custom extensions $ are needed.
In the rest of the description:
@prefix guo: <http://webr3.org/owl/guo#>
is assumed.
The diff contains one node of rdf:type guo:diff
.
For debugging purpose it has properties guo:graph1
and guo:graph2
that
corespond to gfrom
and gto
arguments of DB.DBA.RDF_SUO_DIFF_TTL
.
The diff also contains zero or more nodes of rdf:type guo:UpdateInstruction
. These
nodes are as described in basic GUO ontology, but guo:target_graph
is now optional,
guo:target_subject
can be a blank node and objects of predicates "inside" values of
guo:insert
and guo:delete
can also be blank nodes. These blank nodes are
"placeholders" for values, calculated according to the most important GUO extension - rule nodes.
There are eight sorts of rule nodes, four for gfrom
side of diff and four similar for
gto
side. Out of four sorts related to one side, two are for functional properties and
two similar are for inverse functional properties. Thus rdf:type-s
of these nodes are:
guo:from-rule-FP0, guo:from-rule-FP1, guo:from-rule-IFP0, guo:from-rule-IFP1
and
guo:to-rule-FP0, guo:to-rule-FP1, guo:to-rule-IFP , guo:to-rule-IFP1 .
Each rule node has property guo:order
that is an non-negative integer.
These integers enumerate all guo:from-rule-
... nodes, starting from zero.
When patch procedure works, these rules are used in this order, the result of each rule is a blank node that either exists in the graph or just created.
All results are remembered for use in the rest of the patch procedure.
Similarly, other sequence of these integers enumerate all guo:to-rule-
... nodes,
also starting from zero.
Consider a sequence of guo:from-rule-
... nodes, because guo:to-rule-
nodes have identical properties.
A rule node can have zero or more values of guo:dep
property, each value is a
bnode that is rule node that should be calculated before the current one.
Every rule has exactly one predicate guo:path
that is a blank node. Each property
of this blank node describes one possible "move of slider": predicate to follow is in predicate
position and a node to start from is in object position. An IRI or a literal in object position is
used as is, a blank node in object position should be of type guo:from-rule-
... and
have smaller guo:order
so it refers to already calculated result bnode of some
preceding rule.
Rule of form:
R a guo:from-rule-IFP1 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] .
searches for a unique blank node _:Rres
that is a common subject of triples:
_:Rres P1 O1 _:Rres P2 O2 . . . _:Rres Pn On
in the gfrom graph.
If subjects differ in these triples or some triples are not found or the subject is not a
blank node then an appropriate error is logged and rule fails, otherwise _:Rres
is remembered as the result of the rule.
Similarly, rule of form:
R a guo:from-rule-FP1 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] .
searches for a unique blank node _:Rres
that is a common object of triples:
O1 P1 _:Rres O2 P2 _:Rres . . . On Pn _:Rres
in the gfrom graph.
Rule of form:
R a guo:from-rule-IFP0 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] .
ensures that the gfrom
graph does not contain any triple like:
_:Rres P1 O1 _:Rres P2 O2
or
_:Rres Pn On
It is an error if something exists. If nothing found then the result of the rule is newly created unique blank node. That's how patch procedure creates new blank nodes when it inserts "totally new" data.
Similarly, rule of form:
R a guo:from-rule-IFP0 ; guo:path [ P1 O1 ; P2 O2 ; ... ; Pn On ] .
ensures that the gfrom
graph does not contain any triple like:
O1 P1 _:Rres O2 P2 _:Rres
or
On Pn _:Rres
Current version of patch procedure does not use rules guo:to-rule-
... ,
however they can be used by custom procedure of few sorts. First, these rules can be used to
produce a "reversed diff". Next, these rules can be used to validate the result of the patch -
if the patch can not be reverted then the result is "suspicious".