www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

RDF Database and SPARQL

Overview
Data Representation
RDF and SPARQL API and SQL
SPARUL -- an Update Language For RDF Graphs
RDF Insert Methods in Virtuoso
Virtuoso Sponger
Virtuoso Cartridge-Supported Data Sources Virtuoso Sponger Cartridge RDF Extractor Extending SPARQL IRI Dereferencing with RDF Mappers
Dereferencable IRIs and RDF Linked Data
RDF Views -- Mapping Relational Data to RDF
RDF Inference in Virtuoso
Using Full Text Search in SPARQL
Virtuoso SPARQL Query Service
Business Intelligence Extensions for SPARQL
Debugging SPARQL queries
Virtuoso RDF Performance Tuning
RDF Store Benchmarks
SPARQL Implementation Details
Native RDF Storage Providers

15.6. Virtuoso Sponger

The Virtuoso Sponger is a middleware component of Virtuoso that generates RDF Linked Data from a variety of data sources. The sponger is transparently integrated into the Virtuoso SPARQL Query Processor, where it serves as part of the URI/IRI dereferencing functionality. It is also optionally used by the Virtuoso Content Crawler.

A majority of the worlds data naturally resides in non RDF form at the current time. The Sponger delivers middleware that accelerates the bootstrap of the Semantic Data Web by generating RDF from non RDF data sources, unobtrusively.

When an RDF aware client requests data from a network accessible resource via the Sponger the following events occur:

The imported data forms a local cache and its invalidation rules conform to those of traditional HTTP clients (Web Browsers). Thus, expiration time is determined based on subsequent data fetches of the same resource (note: the first data load will record the 'expires' header) with current time compared to expiration time stored in the local cache. If HTTP 'expires' header data isn't returned by the source data server, then the "Sponger" will derive it's own invalidation time frame by evaluating the 'date' header and 'last-modified' HTTP headers. Irrespective of path taken, local cache invalidation is driven by an assessment of current time relative to recorded expiration time.

Designed with a pluggable architecture, the Sponger's core functionality is provided by Catridges. Each catridge includes Data Extractors which extract data from one or more data sources, and Ontology Mappers which map the extracted data to one or more ontologies/schemas, and route to producing RDF Linked Data.

The Schema Mappers are typically XSLT (e.g. GRDDL and other OpenLink Mapping Schemes) or Virtuoso PL based. The Metadata Extractors may be developed in Virtuoso PL, C/C++, Java, or any other language that can be integrated into the Virtuoso via it's server extensions APIs.

The Sponger also includes a pluggable name resolution mechanism that enables the development of Custom Resolvers for naming schemes (e.g. URNs) associated with protocols beyond HTTP. Examples of custom resolvers include:

15.6.1. Virtuoso Cartridge-Supported Data Sources


15.6.2. Virtuoso Sponger Cartridge RDF Extractor

Used to extract RDF from a Web Data Source it consumes services from: Virtuoso PL, C/C++, Java based RDF Extractors

The RDF mappers provide a way to extract metadata from non-RDF documents such as HTML pages, images Office documents etc. and pass to SPARQL sponger (crawler which retrieve missing source graphs). For brevity further in this article the "RDF mapper" we simply will call "mapper".

The mappers consist of PL procedure (hook) and extractor, where extractor itself can be built using PL, C or any external language supported by Virtuoso server.

Once the mapper is developed it must be plugged into the SPARQL engine by adding a record in the table DB.DBA.SYS_RDF_MAPPERS.

If a SPARQL query instructs the SPARQL processor to retrieve target graph into local storage, then the SPARQL sponger will be invoked. If the target graph IRI represents a deferencable URL then content will be retrieved using content negotiation. The next step is the content type to be detected:

15.6.2.1. Virtuoso Sponger Cartridge RDF Extractor PL Requirements

PL hook requirements:

Every PL function used to plug a mapper into SPARQL engine must have following parameters in the same order:

Note: the names of the parameters are not important, but their order and presence are!

Example Implementation:

In the example script bellow we implement a basic mapper, which maps a text/plain mime type to an imaginary ontology, which extends the class Document from FOAF with properties 'txt:UniqueWords' and 'txt:Chars', where the prefix 'txt:' we specify as 'urn:txt:v0.0:'.

use DB;

create procedure DB.DBA.RDF_LOAD_TXT_META
 (
  in graph_iri varchar,
  in new_origin_uri varchar,
  in dest varchar,
  inout ret_body any,
  inout aq any,
  inout ps any,
  inout ser_key any
  )
{
  declare words, chars int;
  declare vtb, arr, subj, ses, str any;
  declare ses any;
  -- if any error we just say nothing can be done
  declare exit handler for sqlstate '*'
    {
      return 0;
    };
  subj := coalesce (dest, new_origin_uri);
  vtb := vt_batch ();
  chars := length (ret_body);

  -- using the text index procedures we get a list of words
  vt_batch_feed (vtb, ret_body, 1);
  arr := vt_batch_strings_array (vtb);

  -- the list has 'word' and positions array, so we must divide by 2
  words := length (arr) / 2;
  ses := string_output ();

  -- we compose a N3 literal
  http (sprintf ('<%s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .\n', subj), ses);
  http (sprintf ('<%s> <urn:txt:v0.0:UniqueWords> "%d" .\n', subj, words), ses);
  http (sprintf ('<%s> <urn:txt:v0.0:Chars> "%d" .\n', subj, chars), ses);
  str := string_output_string (ses);

  -- we push the N3 text into the local store
  DB.DBA.TTLP (str, new_origin_uri, subj);
  return 1;
};

delete from DB.DBA.SYS_RDF_MAPPERS where RM_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';

insert soft DB.DBA.SYS_RDF_MAPPERS (RM_PATTERN, RM_TYPE, RM_HOOK, RM_KEY, RM_DESCRIPTION)
values ('(text/plain)', 'MIME', 'DB.DBA.RDF_LOAD_TXT_META', null, 'Text Files (demo)');

-- here we set order to some large number so don't break existing mappers
update DB.DBA.SYS_RDF_MAPPERS set RM_ID = 2000 where RM_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';

To test the mapper we just use /sparql endpoint with option 'Retrieve remote RDF data for all missing source graphs' to execute:

select * from <URL-of-a-txt-file> where { ?s ?p ?o }

It is important that the SPARQL_UPDATE role to be granted to "SPARQL" account in order to allow local repository update via sponge feature.

Authentication in Sponger

To enable usage of user defined authentication, there are added more parameters to the /proxy/rdf and /sparql endpoints. So to use it, the RDF browser and iSPARQL should send following url parameters:


15.6.2.2. RDF Cartridges Use Cases

This section contains examples of Web resources which can be transformed by RDF Cartridges. It also states where additional setup for given cartrides is needed i.e. keys account names etc.

Service based:

GRDDL

URN handlers



15.6.3. Extending SPARQL IRI Dereferencing with RDF Mappers

The Virtuoso SPARQL engine (called for brevity just SPARQL bellow) supports IRI Dereferencing, however it understands only RDF data, that is it can retrieve only files containing RDF/XML, turtle or N3 serialized RDF data, if format is unknown it will try mapping with built-in WebDAV metadata extractor. In order to extend this feature with dereferencing web or file resources which naturally don't have RDF data (like PDF, JPEG files for example) is provided a special mechanism in SPARQL engine. This mechanism is called RDF mappers for translation of non-RDF data files to RDF.

In order to instruct the SPARQL to call a RDF mapper it needs to be registered and it will be called for a given URL or MIME type pattern. In other words, when unknown for SPARQL format is received during URL dereferencing process, it will look into a special registry (a table) to match either the MIME type or IRI using a regular expression, if match is found the mapper function will be called.

15.6.3.1. Regsitry

The table DB.DBA.SYS_RDF_MAPPERS is used as registry for registering RDF mappers.

create table DB.DBA.SYS_RDF_MAPPERS (
    RM_ID integer identity,         -- mapper ID, designate order of execution
    RM_PATTERN varchar,             -- a REGEX pattern to match URL or MIME type
    RM_TYPE varchar default 'MIME', -- what property of the current resource to match: MIME or URL are supported at present
    RM_HOOK varchar,                -- fully qualified PL function name e.q. DB.DBA.MY_MAPPER_FUNCTION
    RM_KEY  long varchar,           -- API specific key to use
    RM_DESCRIPTION long varchar,    -- Mapper description, free text
    RM_ENABLED integer default 1,   -- a flag 0 or 1 integer to include or exclude the given mapper from processing chain
    primary key (RM_TYPE, RM_PATTERN))
;

The current way to register/update/unregister a mapper is just a DML statement e.g. NSERT/UPDATE/DELETE.


15.6.3.2. Execution order and processing

When SPARQL retrieves a resource with unknown content it will look in the mappers registry and will loop over every record having RM_ENABLED flag true. The sequence of look-up is based on ordering by RM_ID column. For every record it will either try matching the MIME type or URL against RM_PATTERN value and if there is match the function specified in RM_HOOK column will be called. If the function doesn't exists or signal an error the SPARQL will look at next record.

When it stops looking? It will stop if value returned by mapper function is positive or negative number, if the return is negative processing stops with meaning no RDF was supplied, if return is positive the meaning is that RDF data was extracted, if zero integer is returned then SPARQL will look for next mapper. The mapper function also can return zero if it is expected next mapper in the chain to get more RDF data.

If none of the mappers matches the signature (MIME type nor URL) the built-in WebDAV metadata extractor will be called.


15.6.3.3. Extension function

The mapper function is a PL stored procedure with following signature:

THE_MAPPER_FUNCTION_NAME (
        in graph_iri varchar,
        in origin_uri varchar,
        in destination_uri varchar,
        inout content varchar,
        inout async_notification_queue any,
        inout ping_service any,
        inout keys any
        )
{
   -- do processing here
   -- return -1, 0 or 1 (as explained above in Execution order and processing section)
}
;

Parameters

Return value


15.6.3.4. RDF Mappers package content

The Virtuoso supply as a rdf_mappers_dav VAD package a cartridge for extracting RDF data from certain popular Web resources and file types. It can be installed (if not already) using VAD_INSTALL function, see the VAD chapter in documentation on how to do that.

HTTP-in-RDF

Maps the HTTP request response to HTTP Vocabulary in RDF, see http://www.w3.org/2006/http#.

This mapper is disabled by default. If it's enabled , it must be first in order of execution.

Also it always will return 0, which means any other mapper should push more data.

HTML

This mapper is composite, it looking for metadata which can specified in a HTML pages as follows:

The HTML page mapper will look for RDF data in order as listed above, it will try to extract metadata on each step and will return positive flag if any of the above step give a RDF data. In case where page URL matches some of other RDF mappers listed in registry it will return 0 so next mapper to extract more data. In order to function properly, this mapper must be executed before any other specific mappers.

Flickr URLs

This mapper extracts metadata of the Flickr images, using Flickr REST API. To function properly it must have configured key. The Flickr mapper extracts metadata using: CC license, Dublin Core, Dublin Core Metadata Terms, GeoURL, FOAF, EXIF: http://www.w3.org/2003/12/exif/ns/ ontology.

Amazon URLs

This mapper extracts metadata for Amazon articles, using Amazon REST API. It needs a Amazon API key in order to be functional.

eBay URLs

Implements eBay REST API for extracting metadata of eBay articles, it needs a key and user name to be configured in order to work.

Open Office (OO) documents

The OO documents contains metadata which can be extracted using UNZIP, so this extractor needs Virtuoso unzip plugin to be configured on the server.

Yahoo traffic data URLs

Implements transformation of the result of Yahoo traffic data to RDF.

iCal files

Transform iCal files to RDF as per http://www.w3.org/2002/12/cal/ical# .

Binary content, PDF, PowerPoint

The unknown binary content, PDF and MS PowerPoint files can be transformed to RDF using Aperture framework (http://aperture.sourceforge.net/). This mapper needs Virtuoso with Java hosting support, Aperture framework and MetaExtractor.class installed on the host system in order to work.

The Aperture framework & MetaExtractor.class must be installed on the system before to install the RDF mappers package. If the package is already installed, then to activate this mapper you can just re-install the VAD.

Setting-up Virtuoso with Java hosting to run Aperture framework

Important: the above is verified to work with aperture-2006.1-alpha-3 on Linux system. For different version of Aperture of operation system this may need some adjustments e.g. to re-build MetaExtractor.class & changes to CLASSPATH etc.

Examples & tutorials

How to write own RDF mapper? Look at Virtuoso tutorial on this subject http://demo.openlinksw.com/tutorial/rdf/rd_s_1/rd_s_1.vsp .


15.6.3.5. Sponger Proxy service

Sponger functionality is also exposed via Virtuoso's "/proxy/rdf/" endpoint, as an in-built REST style Web service available in any Virtuoso standard installation. This web service takes a target URL and either returns the content "as is" or tries to transform (by sponging) to RDF. Thus, the proxy service can be used as a 'pipe' for RDF browsers to browse non-RDF sources.

For more information see RDF Sponger Proxy service