A short SPARQL query can be compiled into a long SQL statement, especially if data comes from many quad map petterns. Middle-size application with 50 tables and 10 columns per table may create thousand of quad map patterns for subjects of hundred different types. An attempt to "select everything" from RDF view of that complexity may easily create 5000 lines of SQL code. Thus it is known in advance that some queries will be rejected even if same queries work fine on default storage of "physical quads".
In addition, an SQL compiler signals an error if table or column name is unknown, efficiently catching typos. SPARQL uses IRIs that are long and sometimes unreadable, but there is no "closed world" schema of the data so a typo in an IRI is not an error, it is simply some other IRI. So typo in a IRI or in a namespace prefix cause missing bindings of some triple patterns of the query and incomplete result, but no errors are usually reported. Typo in graph or predicate IRI may cause the SPARQL compiler to generate code that access deafult storage instead of relational source or to generate empty code that accesses nothing.
The SQL compiler does not signal casting errors when it runs the statement generated from SPARQL, because the generated SQL code contains option (QUIETCAST). This means that mismatches between expected and actual datatypes of values stay invisible and may cause rounding errors (say, integer division instead of floating-point) and even empty joins (due to join conditions that silently return NULL instead of return comparison error).
In two words, SPARQL queries are so laconic that there is no room for details that let the compiler distinguish between intention and a bug. This masquerades query complexity, misuse of names and mismatches in types. One may make debugging easier by making queries longer.
Two most helpful debugging tools are automatic void variable recognition and plain old code inspection. "Automatic" means "cheap" so the very first step of debugging is to ensure that every triple pattern of the query may in principle return something. This helps in finding typos when the query gets data from RDF views, this also helps when a query tries to join two disjoint sorts of subjects. If define sql:signal-void-variables 1 directive is placed at the preambl e of the SPARQL query, compiler will signal an error if if finds any triple pattern that can not bind variables or any variable that is proved to be always unbound. This is especially useful when data are supposed to come from option (exclusive) or option (soft exclusive) quad map. Without one of that options SPARQL compiler will usually bind variables using "physical quad"; the table of physical quads may contain any rows that match any given triple pattern; thus many errors will remain undiscovered. If the name of quad map pattern is known then it is possible to force the SPARQL compiler to use only that quad map for a whole query or for a part. This is possible by using the following syntax:
QUAD MAP quad-map-name { group-pattern }
If some triple pattern inside group-pattern can not be bound using quad-map-name or one of its descendants then define sql:signal-void-variables 1 will force the compiler to signal the error.
It is technically possible to use QUAD MAP to improve the performance of query that tries to access redundant RDF Views but it is much better to achieve the same effect by providing a more restrictive query or by changing/extending the RDF View. If an application needs this trick then interoperable SPARQL clients will experience problems -- they can not use Virtuoso-specific extensions.
If the automated query checking gives nothing, function sparql_to_sql_text can be used in order to get the SQL text generated from the given query. Its only argument is the text of SPARQL query to compile, without leading SPARQL keyword and semicolon at the end; the returned value is the SQL text. The output may be long but that is the most authoritative source of diagnostic data.
When called from ISQL or other ODBC client, return value of sparql_to_sql_text may be transferred as a BLOB so ISQL needs "set blobs on" instruction to avoid data truncation. Even better, the SQL text can be saved into a file:
string_to_file ('debug.sql', sparql_to_sql_text ('select * where { graph ?g { ?s a ?type }}'), -2);
(-2 is to overwrite the previous version of the file, as there will be probably many runs of the statement).
It is inconvenient to edit query text in order to replace every single quote with two single quotes to make it string constant for sparql_to_sql_text. It is much more convenient to use double quotes in SPARQL queries and replace nothing.
As an example, let's find out why the query
sparql
prefix northwind: <http://demo.openlinksw.com/schemas/northwind#>
select distinct ?emp
from <http://myhost.example.com/Northwind>
where {
?order1 northwind:has_salesrep ?emp ; northwind:shipCountry ?country1 .
?order2 northwind:has_salesrep ?emp ; northwind:shipCountry ?country2 .
filter (?country1 != ?country2) }
is much slower than similar SQL statement. The call of sparql_to_sql_text returns equivalent SQL statement:
SELECT DISTINCT sprintf_iri ( 'http://myhost.example.com/Northwind/Employee/%U%U%d#this' ,
/*retval[*/ "s-6-1-t0"."b067b7d~FirstName~0" /* emp */ /*]retval*/ ,
/*retval[*/ "s-6-1-t0"."b067b7d~FirstName~1" /*]retval*/ ,
/*retval[*/ "s-6-1-t0"."b067b7d~FirstName~2" /*]retval*/ ) AS /*tmpl*/ "emp"
FROM (SELECT "s-6-1-t0-int~orders"."OrderID" AS /*tmpl*/ "20ffecc~OrderID",
"s-6-1-t0-int~employees"."FirstName" AS /*as-name-N*/ "b067b7d~FirstName~0",
"s-6-1-t0-int~employees"."LastName" AS /*as-name-N*/ "b067b7d~FirstName~1",
"s-6-1-t0-int~employees"."EmployeeID" AS /*as-name-N*/ "b067b7d~FirstName~2"
FROM Demo.demo.Employees AS "s-6-1-t0-int~employees", Demo.demo.Orders AS "s-6-1-t0-int~orders"
WHERE /* inter-alias join cond */
"s-6-1-t0-int~orders".EmployeeID = "s-6-1-t0-int~employees".EmployeeID) AS "s-6-1-t0",
(SELECT "s-6-1-t1-int~orders"."OrderID" AS /*tmpl*/ "20ffecc~OrderID",
"s-6-1-t1-int~orders"."ShipCountry" AS /*tmpl*/ "e45a7f~ShipCountry"
FROM Demo.demo.Orders AS "s-6-1-t1-int~orders") AS "s-6-1-t1",
(SELECT "s-6-1-t2-int~orders"."OrderID" AS /*tmpl*/ "20ffecc~OrderID",
"s-6-1-t2-int~employees"."FirstName" AS /*as-name-N*/ "b067b7d~FirstName~0",
"s-6-1-t2-int~employees"."LastName" AS /*as-name-N*/ "b067b7d~FirstName~1",
"s-6-1-t2-int~employees"."EmployeeID" AS /*as-name-N*/ "b067b7d~FirstName~2"
FROM Demo.demo.Employees AS "s-6-1-t2-int~employees", Demo.demo.Orders AS "s-6-1-t2-int~orders"
WHERE /* inter-alias join cond */
"s-6-1-t2-int~orders".EmployeeID = "s-6-1-t2-int~employees".EmployeeID) AS "s-6-1-t2",
(SELECT "s-6-1-t3-int~orders"."OrderID" AS /*tmpl*/ "20ffecc~OrderID",
"s-6-1-t3-int~orders"."ShipCountry" AS /*tmpl*/ "e45a7f~ShipCountry"
FROM Demo.demo.Orders AS "s-6-1-t3-int~orders") AS "s-6-1-t3"
WHERE /* two fields belong to same equiv */
/*retval[*/ "s-6-1-t0"."20ffecc~OrderID" /* order1 */ /*]retval*/ =
/*retval[*/ "s-6-1-t1"."20ffecc~OrderID" /* order1 */ /*]retval*/
AND /* two fields belong to same equiv */
sprintf_iri ( 'http://myhost.example.com/Northwind/Employee/%U%U%d#this' ,
/*retval[*/ "s-6-1-t0"."b067b7d~FirstName~0" /* emp */ /*]retval*/ ,
/*retval[*/ "s-6-1-t0"."b067b7d~FirstName~1" /*]retval*/ ,
/*retval[*/ "s-6-1-t0"."b067b7d~FirstName~2" /*]retval*/ ) =
sprintf_iri ( 'http://myhost.example.com/Northwind/Employee/%U%U%d#this' ,
/*retval[*/ "s-6-1-t2"."b067b7d~FirstName~0" /* emp */ /*]retval*/ ,
/*retval[*/ "s-6-1-t2"."b067b7d~FirstName~1" /*]retval*/ ,
/*retval[*/ "s-6-1-t2"."b067b7d~FirstName~2" /*]retval*/ )
AND /* two fields belong to same equiv */
/*retval[*/ "s-6-1-t2"."20ffecc~OrderID" /* order2 */ /*]retval*/ =
/*retval[*/ "s-6-1-t3"."20ffecc~OrderID" /* order2 */ /*]retval*/
AND /* filter */
( /*retval[*/ "s-6-1-t1"."e45a7f~ShipCountry" /* country1 */ /*]retval*/ <>
/*retval[*/ "s-6-1-t3"."e45a7f~ShipCountry" /* country2 */ /*]retval*/ )
OPTION (QUIETCAST)
The query is next to unreadable but some comments split it into meaningful expressions. Every triple (or list of similar triples) become a subquery that return fields needed to build values of bound variables. That fields are printed outside with comments like /*retval[*/ expression /* original variable name */ /*]retval*/. Names like"s-6-1-t0" contain source line number where a group pattern begin (6) and serial number of triple (0). Comment /* inter-alias join cond */ means that the expression after that is the condition as it is written in the declaration of quad map pattern. Comment /* filter */ is before expressions for FILTER expressions in the source SPARQL. The word "equiv" means "equivalence class", i.e. group of occurencies of variables in the source query that all occurencies are bound to same value. E.g. when name repeates in many triples of a group, all its occurencies form an equivalence class. In sone cases compiler can prove that two variables are always equal even if names differ -- that variables are also placed into one "equiv".
Looking at this query, one may notice equality like sprintf_iri (...) = sprintf_iri (...). That is bad because it means that no index will be used to optimize the join and that there will be one function call per row. When variable ?emp appears in two different triples it means that the value of the variable is same in both triples. The query compares IRIs instead of comparing arguments of sprintf_iri because the format string is not proven to be a bijection. Indeed it can not be bjection for arbitrary strings, but the database is about real world. If it is known that real names of persons never start with digit so digits of %d%U fragment will always be distinguishable from name then the IRI class can be declared as a bijection even if it is not true for arbitrary strings; the script can include "suspicious" option (bijection) as follows:
create iri class sample:Employee "http://example.com/Employee/%d%U#this" (in employee_id integer not null, in employee_lastname varchar not null) option (bijection) .
Unfortunately, the attempt of making same trick with the declaration from Northwind example will fail:
create iri class northwind:Employee "http://^{URIQADefaultHost}^/Northwind/Employee/%U%U%d#this"
(in employee_firstname varchar not null, in employee_lastname varchar not null, in employee_id integer not null)
option (bijection) .
Bijection will allow the parsing but it will never give proper result, because first %U will read the whole concatenation of %U%U%d, leave nothing before #this for second %U (that is wrong) and leave nothing for %d (that is explicit parse error, as integer notation can not be empty)
.The string parser will process the string from left to right so it will be unable to parse the string. The compiler might sometimes report an error if it can prove that the format string is not appropriate for bijection.
The proper way of improving the Northwind example is to make true and reliable bijection by adding strong delimiters:
create iri class northwind:Employee "http://^{URIQADefaultHost}^/Northwind/Employee/%U/%U/%d#this"
(in employee_firstname varchar not null, in employee_lastname varchar not null, in employee_id integer not null)
option (bijection) .
After running the updated script, the query contains three comparisons of fields that were arguments of sprintf_iri in the previous version.
|
Previous
Business Intelligence Extensions for SPARQL |
Chapter Contents |
Next
Virtuoso RDF Performance Tuning |