www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

RDF Database and SPARQL

Overview
Data Representation
RDF and SPARQL API and SQL
SPARUL -- an Update Language For RDF Graphs
RDF Insert Methods in Virtuoso
Virtuoso Sponger
Dereferencable IRIs and RDF Linked Data
RDF Views -- Mapping Relational Data to RDF
RDF Inference in Virtuoso
Using Full Text Search in SPARQL
Virtuoso SPARQL Query Service
Business Intelligence Extensions for SPARQL
Debugging SPARQL queries
Virtuoso RDF Performance Tuning
RDF Store Benchmarks
SPARQL Implementation Details
Native RDF Storage Providers
SPARQL predicates usage

15.1. Overview

The support of SPARQL in Virtuoso consists of two "layers" -- RDF support in SQL engine and SPARQL front-end compiler that translates SPARQL queries to SQL.

In its core, Virtuoso extends relational storage and SQL language with datatypes and language constructs required for handling RDF data like any "traditional" SQL datatype. There exist datatypes for RDF references (IRIs and blank nodes) and for RDF literals with types and languages. These datatypes are widely supported by built-in functions and there exists convenient conversions between RDF literals and other SQL datatypes.

SQL query language is extended with constructs that are convenient for "very heterogeneous" data of RDF graph. In traditional SQL queries, data types of retrieved values are mostly known in advance from the database schema via explicitly declared column types and types of return values of functions. An RDF graph contains mix of literals of all types, so cast errors are very frequent and a typical query should not terminate after some occasional type error in a huge data set.

SQL "user-defined aggregate" feature is used to support SPARQL DESCRIBE and CONSTRUCT statements, so many RDF triples about a subject can be converted into single "dictionary of triples" or can be written into single RDF/XML or TURTLE document.

SQL query language is also extended with BREAKUP construct that is somewhat "inverse" to aggregate functions. While aggregate functions are useful to convert big number of table rows into one or few rows of aggregated data, BREAKUP turns each "wide" and complete row of a relational table into many "narrow" rows of RDF result set.

The listed features form a solid background for second part of implementation -- a preprocessor that is called as soon as an input of SQL compiler contains SPARQL keyword. The preprocessor expects that the fragment after SPARQL keyword with well-parenthesized code is a SPARQL query (or SPARUL statement). SPARQL front-end compiler creates a text of SQL SELECT statement that replaces the original SPARQL fragment and SQL compiler continues to read its input without seeing any difference between the replaced part and any other SQL SELECT, so one may not only send SPARQL queries over all supported protocols like ODBC and JDBC but place them inside Virtuoso?PL stored procedures or use them inside SQL SELECT statements where subqueries are allowed by syntax.

RDF data can be added to the storage by parsing texts of RDF documents (functions DB.DBA.RDF_LOAD_RDFXML, DB.DBA.RDF_LOAD_RDFXML_MT, DB.DBA.TTLP, DB.DBA.TTLP_MT), by loading remote RDF resources by SPARUL LOAD statement, by extracting and storing metadata while crawling non-RDF resources and many other RDF insert methods. After bulk loading, RDF data can be edited using SPARUL update language.

The most important feature of Virtuoso SPARQL is that relational data may stay unchanged and not be loaded into RDF storage and they still can be accessed by SPARQL queries after creating appropriate RDF Views. This is the best tool for adding RDF capabilities to existing database applications.

Virtuoso is a "quad store", not a "triple store". Instead of storing subject-predicate-object triples of each individual graph in a separate storage (and the storage is explicitly created before first use), Virtuoso stores graph-subject-predicate-object quads in one common table (DB.DBA.RDF_QUAD). This simplifies querying when "interesting" graphs are not known in advance. E.g., while a SPARQL query is running, Virtuoso Sponger can iteratively download additional data in order to provide as complete an answer as possible.

Quads stored in DB.DBA.RDF_QUAD are usually referred to as "physical quads" as opposed to "mapped quads" that are not really stored in any table at all but still are accessible by SPARQL queries via RDF views. The SPARQL processor sees no great difference between querying physical and mapped quads so a query may operate with mix of data of all sorts. Thus, an application can easily provide both access to relational data for SPARQL clients (HTTP and WSDL via SPARQL protocol endpoint) and access to RDF data for traditional RDBMS clients (by passing SPARQL queries via ODBC, JDBC and the like).

RDF is for data integration. This implies the need of resolving both inconsistencies in RDF data that comes from various sources of different nature and problems with queries that work fine on some sample data from local data warehouse but fail on massive and/or unexpected data from third party sources. RDF inference helps to support variety of synonyms when different data sources uses different names for same classes, properties and subjects. Queries can be debugged using all methods that are traditional for SQL (what's executed is SQL), with paying some attention to RDF-specific performance tuning.

The whole purpose of SPARQL is to let application developers write queries faster than equivalent SQL code. A short SPARQL query on a mix of physical quads and RDF views of a number of applications may sometimes replace pages of SQL text. On the other hand, SQL has more expressive power than SPARQL so complicated business intelligence queries should be written in a mixed style, as an SQL query with SPARQL subqueries. Debugging of mixed-style queries is inconvenient, reducing overall benefit in development time, so Business Intelligence Extensions for SPARQL can be used in order to greatly reduce the need for wrapping SQL around SPARQL. Further, SPARQL protocol clients get access to the functionality, without needing SQL privileges. Full-text search in SPARQL eliminates one more reason for mixing SQL and SPARQL. In addition, extending SPARQL let SPARQL clients to delegate more business intelligence calculations to the server without any changes in the infrastructure; this is especially important for web applications that use AJAX tools (say, OAT OpenLink AJAX Toolkit).