16.17.2. RDF Index Scheme
Starting with version 6.00.3126, the default RDF index scheme consists of 2 full indices over RDF quads plus 3 partial indices. This index scheme is generally adapted to all kinds of workloads, regardless of whether queries generally specify a graph. As indicated the default index scheme in Virtuoso is almost always applicable as is, whether one has a RDF database with very large numbers of small graphs or just one or a few large graphs. With Virtuoso 7 the indices are column-wise by default, which results in them to consuming usually about 1/3 of the space the equivalent row-wise structures would consume.
Alternate indexing schemes are possible but will not be generally needed. For upgrading old databases with a different index scheme see the corresponding documentation.
The index scheme consists of the following indices:
-
PSOG
- primary key
-
POGS
- bitmap index for lookups on object value.
-
SP
- partial index for cases where only S is specified.
-
OP
- partial index for cases where only O is specified.
-
GS
- partial index for cases where only G is specified.
This index scheme is created by the following statements:
CREATE TABLE DB.DBA.RDF_QUAD ( G IRI_ID_8, S IRI_ID_8, P IRI_ID_8, O ANY, PRIMARY KEY (P, S, O, G) ) ALTER INDEX RDF_QUAD ON DB.DBA.RDF_QUAD PARTITION (S INT (0hexffff00)); CREATE DISTINCT NO PRIMARY KEY REF BITMAP INDEX RDF_QUAD_SP ON RDF_QUAD (S, P) PARTITION (S INT (0hexffff00)); CREATE BITMAP INDEX RDF_QUAD_POGS ON RDF_QUAD (P, O, G, S) PARTITION (O VARCHAR (-1, 0hexffff)); CREATE DISTINCT NO PRIMARY KEY REF BITMAP INDEX RDF_QUAD_GS ON RDF_QUAD (G, S) PARTITION (S INT (0hexffff00)); CREATE DISTINCT NO PRIMARY KEY REF INDEX RDF_QUAD_OP ON RDF_QUAD (O, P) PARTITION (O VARCHAR (-1, 0hexffff));
The idea is to favor queries where the predicate is specified in triple patterns. The entire quad can
be efficiently accessed when P
and at least one of S
and O
are known.
This has the advantage of clustering data by the predicate which improves working set. A page read from disk
will only have entries pertaining to the same predicate; chances of accessing other entries of the page are
thus higher than if the page held values for arbitrary predicates. For less frequent cases where only
S
is known, as in DESCRIBE
, the distinct P
s of the S
are
found in the SP
index. These SP
pairs are then used for accessing the PSOG
index to get the O
and G
. For cases where only the G
is known, as when
dropping a graph, the distinct S
s of the G
are found in the GS
index.
The P
s of the S
are then found in the SP
index. After this, the whole
quad is found in the PSOG
index.
The SP
, OP
, and GS
indices do not store duplicates. If an
S
has many values of the P
, there is only one entry. Entries are not deleted from
SP
, OP
, or GS
. This does not lead to erroneous results since a full
index (that is, either POSG
or PSOG
) is always consulted in order to know if a quad
actually exists. When updating data, most often a graph is entirely dropped and a substantially similar graph
inserted in its place. The SP
, OP
, and GS
indices get to stay
relatively unaffected.
Still, over time, especially if there are frequent updates and values do not repeat between consecutive
states, the SP
, OP
, and GS
indices will get polluted, which may affect
performance. Dropping and recreating the index will remedy this situation.
In cases where this is not practical, the index scheme should only have full indices; i.e., each key
holds all columns of the primary key of the quad. This will be the case if the
DISTINCT NO PRIMARY KEY REF
options are not specified in the CREATE INDEX
statement.
In such cases, all indices remain in strict sync across deletes.
Many RDF workloads have bulk-load and read-intensive access patterns with few deletes. The default index scheme is optimized for these. With these situations, this scheme offers significant space savings, resulting in better working set. Typically, this layout takes 60-70% of the space of a layout with 4 full indices.