1.5.32. What is best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?

The best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint, is decimation in its original style:

SELECT ?s ?p ?o
FROM <some-graph>
WHERE
  {
    ?s ?p ?o .
    FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) )
  }

By tweaking first argument of bif:rnd() and the left side of the inequality you can tweak decimation ratio from 1/10 to the desired value. What's important is to know that the SQL optimizer has a right to execute bif:rnd (10) only once at the beginning of the query, so we had to pass additional three arguments that can be known only when a table row is fetched so bif:rnd (10, ?s, ?p, ?o) is calculated for every row and thus any given row is either returned or ignored independently from others.

However, bif:rnd (10, ?s, ?p, ?o) contains a subtle inefficiency. In RDF store, graph nodes are stored as numeric IRI IDs and literal objects can be stored in a separate table. The call of an SQL function needs arguments of traditional SQL datatypes, so the query processor will extract the text of IRI for each node and the full value for each literal object. That is significant waste of time. The workaround is:

SPARQL
SELECT ?s ?p ?o
FROM <some-graph>
WHERE
  {
    ?s ?p ?o .
    FILTER ( 1>  <SHORT_OR_LONG::bif:rnd>  (10, ?s, ?p, ?o))
  }

This tells the SPARQL front-end to omit redundant conversions of values.

Prefix	IRI
n4	http://docs.openlinksw.com/virtuoso/rndsalltr/
schema	http://schema.org/
n5	http://creativecommons.org/licenses/by/4.0/
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
n3	http://www.openlinksw.com/#
xsdh	http://www.w3.org/2001/XMLSchema#

Prefix	URI
xmlns:n4	http://docs.openlinksw.com/virtuoso/rndsalltr/
xmlns:schema	http://schema.org/
xmlns:n5	http://creativecommons.org/licenses/by/4.0/
xmlns:rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:n3	http://www.openlinksw.com/#
xmlns:xsdh	http://www.w3.org/2001/XMLSchema#

Prefix

URI

xmlns:n4

http://docs.openlinksw.com/virtuoso/rndsalltr/

xmlns:schema

http://schema.org/

xmlns:n5

http://creativecommons.org/licenses/by/4.0/

xmlns:rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

xmlns:n3

http://www.openlinksw.com/#

xmlns:xsdh

http://www.w3.org/2001/XMLSchema#

Subject	Predicate	Object
n4:	rdf:type	schema:TechArticle
n4:	rdf:type	schema:APIReference
n4:	schema:name	1.5.32.ÃÂ What is best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?
n4:	schema:copyrightHolder	_:vb81548
n4:	schema:datePublished	2016-09-09 16:16:54
n4:	schema:headline	1.5.32.ÃÂ What is best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?
n4:	schema:keywords	OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data,SPARQL
n4:	schema:license	n5:deed.en_US
n4:	schema:publisher	_:vb81547
n4:	schema:url	n4:
_:vb81547	rdf:type	schema:Organization
_:vb81547	schema:name	OpenLink Software
_:vb81547	schema:url	n3:this
_:vb81548	rdf:type	schema:Organization
_:vb81548	schema:name	OpenLink Software
_:vb81548	schema:url	n3:this

Subject

Predicate

Object

n4:

rdf:type

schema:TechArticle

n4:

rdf:type

schema:APIReference

n4:

schema:name

1.5.32.ÃÂ What is best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?

n4:

schema:copyrightHolder

_:vb81548

n4:

schema:datePublished

2016-09-09 16:16:54

n4:

schema:headline

1.5.32.ÃÂ What is best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?

n4:

schema:keywords

OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data,SPARQL

n4:

schema:license

n5:deed.en_US

n4:

schema:publisher

_:vb81547

n4:

schema:url

n4:

_:vb81547

rdf:type

schema:Organization

_:vb81547

schema:name

OpenLink Software

_:vb81547

schema:url

n3:this

_:vb81548

rdf:type

schema:Organization

_:vb81548

schema:name

OpenLink Software

_:vb81548

schema:url

n3:this

Prev	Up	Next
1.5.31. How can I replicate all graphs?	Home	1.5.33. How can I replicate all graphs?

1.5.32. What is best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?

Namespace Prefixes

Statements

Namespace Prefixes

Statements