www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

RDF Database and SPARQL

Overview
Data Representation
RDF and SPARQL API and SQL
SPARUL -- an Update Language For RDF Graphs
RDF Insert Methods in Virtuoso
Virtuoso Sponger
Dereferencable IRIs and RDF Linked Data
RDF Views -- Mapping Relational Data to RDF
RDF Inference in Virtuoso
Using Full Text Search in SPARQL
Virtuoso SPARQL Query Service
Business Intelligence Extensions for SPARQL
Aggregates in SPARQL Note on Aggregates and Inference Pointer Operator (+> and *>) Subqueries in SPARQL Expressions in Triple Patterns
Debugging SPARQL queries
Virtuoso RDF Performance Tuning
RDF Store Benchmarks
SPARQL Implementation Details
Native RDF Storage Providers
SPARQL predicates usage

15.12. Business Intelligence Extensions for SPARQL

Virtuoso extends SPARQL with expressions in results, subqueries, aggregates and grouping. These extensions allow a straightforward translation of arbitrary SQL queries to SPARQL. This extension is called "SPARQL BI", because the primary objective is to match needs of Business Intelligence. The extended features apply equally to querying physical quads or relational tables mapped through RDF views.

Note:

In this section, many examples use TPC-H namespace. You may test them on your local demo database. They use data from TPC-H dataset that is mapped into graph with IRI like http://example.com/tpch , but you should replace fake host name "example.com" with host name of your own installation, verbatim, that is specified as "DefaultHost" parameter in [URIQA] section of Virtuoso configuration file.

15.12.1. Aggregates in SPARQL

Virtuoso extends SPARQL with SQL like aggregate and group by functionality. This functionality is also available through embedding SPARQL text inside SQL but the SPARQL extension syntax has the benefit of working also over the SPARQL protocol and of looking more SPARQL-like.

The supported aggregates are count, min, max, avg and sum. These can take an optional distinct keyword. These are permitted only in the selection part of a select query. If a selection list consists of a mix of variables and aggregates, the non-aggregate selected items are considered to be grouping columns and a group by over them is implicitly added at the end of the generated SQL query. There is no explicit syntax for group by or having in Virtuoso SPARQL.

If a selection consists of aggregates exclusively, the result set has one row with the values of the aggregates. If there are aggregates and variables in the selection, the result set as has many rows as there are distinct combinations of the variables and the aggregates are calculated over each such distinct combination, as if there were a SQL group by over all non-aggregates.

With the count aggregate the argument may be either *, meaning counting all rows or a variable name, meaning counting all the rows where this variable is bound. If there is no implicit group by, there can be an optional distinct keyword before the variable that is the argument of an aggregate.

There is a special syntax for counting distinct combinations of selected variables. This is:

select count distinct ?v1 ... ?vn
  from ....

User-defined aggregate functions are not supported in current version of SPARQL compiler.

15.12.1.1. Examples

sparql
select count (*)
  from <http://mygraph.com>
 where {?s ?p ?o}

-- Returns the count of physical triples in http://mygraph.com.
sparql define input:inference 'http://mygraph.com'
select ?p count (?o)
  from <http://mygraph.com>
 where {?s ?p ?o};

-- Returns the count of O's for each distinct P.
sparql define input:inference 'http://mygraph.com'
select count (?p) count (?o) count (distinct ?o)
 from <http://mygraph.com>
where {?s ?p ?o};

-- returns the count of triples, including inferred triples and the count of distinct O values.
sparql define input:inference 'http://mygraph.com'
select count distinct ?s ?p ?o
  from <http://mygraph.com>
 where {?s ?p ?o};

-- Returns the number of distinct bindings of ?s ?p ?o.
sparql
prefix tpch: <http://www.openlinksw.com/schemas/tpch#>
select ?status count(*) sum(?extendedprice)
from <http://localhost.localdomain:8310/tpch>
where {
    ?l a tpch:lineitem ;
      tpch:lineextendedprice ?extendedprice ;
      tpch:linestatus ?status .
  };

-- Get counts and total prices of ordered items, grouped by item status.

Example with dataset of people, some duplicated:

Suppose there is a dataset with many people, some of them duplicated. To list execute the query:

sparql
SELECT DISTINCT
 (?name) ?person ?mail
 WHERE {
   ?person rdf:type foaf:Person .
   ?person foaf:name ?name .
   ?person foaf:mbox_sha1sum ?mail
 };

Note the newline after DISTINCT to read the query.

SELECT DISTINCT is as in plain SPARQL, return values are expression ?name and two variables ?person and ?mail. Unfortunately, there's no such thing as DISTINCT for a part of the result set row. If there's a regular need in such feature and duplicates are seldom then we may introduce an aggregate called, say, SPECIMEN, that will return the very first of aggregated values, unchanged:

sparql
SELECT ?name (specimen(?person)) (specimen(?mail))
WHERE
  {
    ?person rdf:type foaf:Person .
    ?person foaf:name ?name .
    ?person foaf:mbox_sha1sum ?mail
  };

If duplicates are seldom and there's no requirement that ?person should correspond to ?mail (so the result should contains some person node and some mail but they don't have to be connected by foaf:mbox_sha1sum), then MIN aggregate will work:

sparql
SELECT ?name (min(?person)) (min(?mail))
WHERE
  {
    ?person rdf:type foaf:Person .
    ?person foaf:name ?name .
    ?person foaf:mbox_sha1sum ?mail
  };

Otherwise, a complicated query is needed:

sparql
SELECT
 ?name
 ((select (min (?person3))
     where {
         ?person3 rdf:type foaf:Person .
         ?person3 foaf:name ?name .
         ?person3 foaf:mbox_sha1sum ?mail } )) as ?person
 ?mail
 WHERE {
     { select distinct ?name
       where {
           ?person1 rdf:type foaf:Person .
           ?person1 foaf:name ?name .
           ?person1 foaf:mbox_sha1sum ?mail1 } }
     { select ?name (min(?mail2)) as ?mail
       where {
           ?person2 rdf:type foaf:Person .
           ?person2 foaf:name ?name .
           ?person2 foaf:mbox_sha1sum ?mail2 } }
 };


15.12.2. Note on Aggregates and Inference

Inference is added to a SPARQL query only for the variables whose value is actually used. Thus,

select count (*)
 from <http://mygraph.com>
where {?s ?p ?o}

will not make inferred values since s, p, and o are actually not used. Instead,

sparql
select count (?s) count (?p) count (?o)
 from <http://mygraph.com>
where {?s ?p ?o};

will get all the inferred triples also.

This may be confusing and will likely be corrected in the future.


15.12.3. Pointer Operator (+> and *>)

When expressions occur in result sets, many variables are introduced only to pass a value from triple pattern to the result expression. This is inconvenient because many triple patterns are trivial. Big number of variable names masquerade "interesting" variables that are used more than one in pattern part of the query and establish logical relationshops between parts of the query. As a solution we introduce pointer operators.

The +> (pointer) operator allows referring to a property value without naming it as a variable and explicit writing triple pattern. We can shorten the example above as

sparql
prefix tpch: <http://www.openlinksw.com/schemas/tpch#>
select ?l+>tpch:linestatus count(*) sum(?l+>tpch:lineextendedprice)
from <http://localhost.localdomain:8310/tpch>
where { ?l a tpch:lineitem }

The ?subject+>propertyname notation is exactly equivalent to having a triple pattern ?subject propertyname ?aux_var binding an auxiliary variable to the mentioned property of the subject within the group pattern enclosing the reference. For a SELECT, the enclosing group pattern is considered to be the top level pattern of the where clause or in the event of a union, the top level of each term of the union. Each distinct pointer adds exactly one triple pattern to the enclosing group pattern, thus multiple uses of +> with same arguments do not each add a triple pattern. Having multiple copies of an identical pattern might lead to changes in cardinality if multiple input graphs were being considered. If a lineitem had multiple discounts or extended prices, then we would get the cartesian product of both.

If a property referenced via +> is absent, the variable at left side of the operator is not bound in the enclosing group pattern because it should be bound in all triple patterns where it appears as a field, including implicitly added patterns. The ?subject*>propertyname notation is introduced in order to access optional property values. It adds an OPTIONAL group OPTIONAL { ?subject propertyname ?aux_var }, not plain triple pattern, so the binding of ?subject is not changed even if the object variable is not bound. If the property is set for all subjects in question then the results of *> and +> are the same. All other equal, +> operator produces better SQL code than *> so use *> only when it is really needed.


15.12.4. Subqueries in SPARQL

Pure SPARQL does not allow binding a value that is not retrieved through a triple pattern. We lift this restriction by allowing expressions in the result set and providing names for result columns. We also let a sparql SELECT statement to appear in other sparql statement in any place where a group pattern may appear, so names of result columns become names of variables bound using values from returned rows. This resembles derived tables in SQL.

For instance, the following statement finds prices of 1000 order lines with biggest discounts.

sparql
define sql:signal-void-variables 1
prefix tpch: <http://www.openlinksw.com/schemas/tpch#>
select ?line ?discount (?extendedprice * (1 - ?discount)) as ?finalprice
from <http://localhost.localdomain:8310/tpch>
where {
  ?line a tpch:lineitem ;
    tpch:lineextendedprice ?extendedprice ;
    tpch:linediscount ?discount . }
order by desc (?extendedprice * ?discount)
limit 1000;

As soon as we ensure that the query works fine, we can use it in answering more complex question. Imagine that we want to find how big are customers who has got biggest discounts ever made by company.

sparql
define sql:signal-void-variables 1
prefix tpch: <http://www.openlinksw.com/schemas/tpch#>
select ?cust sum(?extendedprice2 * (1 - ?discount2)) max (?bigdiscount)
from <http://localhost.localdomain:8310/tpch>
where
  {
    {
      select ?line (?extendedprice * ?discount) as ?bigdiscount
      where {
        ?line a tpch:lineitem ;
          tpch:lineextendedprice ?extendedprice ;
          tpch:linediscount ?discount . }
      order by desc (?extendedprice * ?discount)
      limit 1000
    }
    ?line tpch:has_order ?order .
    ?order tpch:has_customer ?cust .
    ?cust tpch:customer_of ?order2 .
    ?order2 tpch:order_of ?line2 .
    ?line2 tpch:lineextendedprice ?extendedprice2 ;
      tpch:linediscount ?discount2 .
  }
order by (sum(?extendedprice2 * (1 - ?discount2)) / max (?bigdiscount));

Looking at the result we can see that customers in question are really big so even big 10% discounts on individual items looks like additional 0.5% to 1.5% discounts from sums of total purchases.

The inner select finds 1000 biggest (in absolute value) discounts and their order lines. For each line we find order of that line and the customer. For every found customers we find all orders he made and all lines of that orders (variable ?line2). The rest is trivial.

Note that the inner select does not contains FROM clauses. It is not required because inner select inherits access permissions of all outer queries. It is also important that internal variable bindings of subquery are not visible in outer query, only result set variables are bound, similarly, variables bound in outer query are not accessible inside.

Note also the declaration define sql:signal-void-variables 1 that forces SPARQL compiler to signal errors if some variables can not be bound due to typos in names or attempts of making joins of disjoint domains. This diagnostics is especially important when the query is long.


15.12.5. Expressions in Triple Patterns

In addition to expressions in filters and result sets, Virtuoso allows the use of expressions in triples of CONSTRUCT pattern or WHERE pattern: an expression can be used instead of constant or variable name for subject, predicate or object. In this case, an expression is surrounded by backquotes.

Example with WHERE Clause:

The following return all distinct 'fragment' parts of all subjects in all graphs that have some predicate whose value is equal to 2+2.

SQL>sparql select distinct (bif:subseq (?s, bif:strchr (?s, '#')))
   where {
     graph ?g {
       ?s ?p `2+2` .
       filter (! bif:isnull (bif:strchr (?s, '#') ) )
     } };

callret
VARCHAR
----------
#four

Inside WHERE part, every expression in a triple pattern is replaced with new variable, a filter expression is added to the enclosing group. The new filter is an equality of new variable and the expression. Hence the sample above is identical to

sparql
select distinct (bif:subseq (?s, bif:strchr (?s, '#')))
   where {
     graph ?g {
       ?s ?p ?newvariable .
       filter (! bif:isnull (bif:strchr (?s, '#') ) )
       filter (?newvariable = (2+2)) .
     } };

Example with CONSTRUCT:

CONSTRUCT {
  <http://bio2rdf.org/interpro:IPR000181>
<http://bio2rdf.org/ns/bio2rdf#hasLinkCount>
`(select (count(?s)) as ?countS
   where { ?s ?p <http://bio2rdf.org/interpro:IPR000181> })` }
where { ?s1 ?p1 ?o1 } limit 1

The result should be:

<http://bio2rdf.org/interpro:IPR000181> <http://bio2rdf.org/ns/bio2rdf#hasLinkCount> "0"^^<http://www.w3.org/2001/XMLSchema#integer> .

Example create graph:

SQL>sparql insert into graph <http://MyNewGraph.com/> {
<http://bio2rdf.org/interpro:IPR000181>
<http://bio2rdf.org/ns/bio2rdf#hasLinkCount>
  `(select (count(?s)) as ?countS
    where { ?s ?p <http://bio2rdf.org/interpro:IPR000181> })` }
 where { ?s1 ?p1 ?o1 } limit 1  ;

callret-0
VARCHAR
_______________________________________________________________________________

Insert into <http://MyNewGraph.com/>, 1 triples -- done

1 Rows. -- 30 msec.