10.6.1.SQL optimization and Dpipe

Prev	Up	Next
10.6.Distributed Pipe	Home	10.7.Cluster and RDF

¶

10.6.1.SQL optimization and Dpipe

Calls to SQL functions in queries can be translated to dpipe operations if a dpipe equivalent is declared. Consider a select statement returning the values of a user defined function of some column selected in the query. If this function itself accesses the database, possibly specially if the data accessed is partitioned, calling the function for each row and waiting for the result before processing the next row would have a catastrophic performance impact. The dpipe mechanism allows the functions to be partitioned by the SQL run time and sent to the appropriate cluster nodes as large batches. This applies to any operation that can be expressed as a sequence of steps which will each run within a single partition.

We take a trivial example:

create table ct (row_no int primary key, dt varchar);
alter index CT on CT partition (ROW_NO int (255));

create procedure CT_DT_PIPE (in n int)
{
 return vector ((select dt from ct table option (no cluster) where row_no = n), 1);
}

dpipe_define ('CT_DT', 'DB.DBA.CT', 'CT', 'DB.DBA.CT_DT_PIPE', 0);

We can now use statements like:

select sum (length (ct_dt (row_no))) from ct;

This is equivalent to:

select sum (length (b.dt))
  from ct a, ct b
  where a.row_no = b.row_no;

The point is that the function call will be mapped into the dpipe operation of the same name defined in the dpipe_define. This means that a large batch of row_no's is retrieved and for each, the partition where to invoke the dpipe function is calculated. Then the function calls are sent in concurrent batches to each of these places, running the function in parallel, each on the node that has the data of interest. Getting the value of a column of a partitioned table is a trivial example but any complex computation might be substituted for this, including ones that updated the state on the node, with the proviso that they should not access data from outside the host on which they run.

Further, because the results are summed into an aggregate, the results can be processed in the order they come, thus improving parallelism.

Prev	Up	Next
10.6.Distributed Pipe	Home	10.7.Cluster and RDF

Prefix	Namespace IRI
schema	http://schema.org/
n4	http://creativecommons.org/licenses/by/4.0/deed.
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
n5	http://www.openlinksw.com/#
xsdh	http://www.w3.org/2001/XMLSchema#
n2	http://docs.openlinksw.com/virtuoso/clusterprogrammingdpipesqloptimz/

Namespace Prefix	Namespace URI
xmlns:schema	http://schema.org/
xmlns:n4	http://creativecommons.org/licenses/by/4.0/deed.
xmlns:rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:n5	http://www.openlinksw.com/#
xmlns:xsdh	http://www.w3.org/2001/XMLSchema#
xmlns:n2	http://docs.openlinksw.com/virtuoso/clusterprogrammingdpipesqloptimz/

Subject	Predicate	Object
n2:	rdf:type	schema:TechArticle
n2:	schema:name	10.6.1.ÃÂ SQL optimization and Dpipe
n2:	schema:copyrightHolder	_:vb78718
n2:	schema:datePublished	2016-09-09 16:16:54
n2:	schema:headline	10.6.1.ÃÂ SQL optimization and Dpipe
n2:	schema:keywords	OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data
n2:	schema:license	n4:en_US
n2:	schema:publisher	_:vb78717
n2:	schema:url	n2:
_:vb78717	rdf:type	schema:Organization
_:vb78717	schema:name	OpenLink Software
_:vb78717	schema:url	n5:this
_:vb78718	rdf:type	schema:Organization
_:vb78718	schema:name	OpenLink Software
_:vb78718	schema:url	n5:this