www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

RDF Data Access and Data Management

Data Representation
SPARQL
Extensions
RDF Graphs Security
Linked Data Views over RDBMS Data Source
Automated Generation of Linked Data Views over Relational Data Sources
Virtuoso R2RML Support
Examples of Linked Data Views
RDF Insert Methods in Virtuoso
RDFizer Middleware (Sponger)
What Is The Sponger? Why is it Important? How Does It Work? Installation Steps Using The Sponger Consuming the Generated RDF Structured Data RDF Cartridges Use Cases Cartridge Architecture Sponger Programmers Guide Sponger Usage Examples
Virtuoso Faceted Browser Installation and configuration
Virtuoso Faceted Web Service
Linked Data
Inference Rules & Reasoning
RDF and Geometry
RDF Performance Tuning
RDF Data Access Providers (Drivers)
RDF Graph Replication

16.10. RDFizer Middleware (Sponger)

16.10.1. What Is The Sponger?

The Virtuoso Sponger is the Linked Data middleware component of Virtuoso that generates Linked Data from a variety of data sources, supporting a wide variety of data representation and serialization formats. The sponger is transparently integrated into Virtuoso's SPARQL Query Processor where it delivers URI de-referencing within SPARQL query patterns, across disparate data spaces. It also delivers configurable smart HTTP caching services. Optionally, it can be used by the Virtuoso Content Crawler to periodically populate and replenish data within the native RDF Quad Store.

The sponger is a fully fledged HTTP proxy service that is also directly accessible via SOAP or REST interfaces.

As depicted below, OpenLink's broad portfolio of Linked-Data-aware products supports a number of routes for creating or consuming Linked Data. The Sponger provides a key platform for developers to generate quality data meshes from unstructured or semi-structured data sources.

OpenLink Linked Data generation options

Figure: 16.10.1.1. OpenLink Linked Data generation options

Architecturally, the Sponger is comprised of a number of Cartridges two types of cartridges: Extractor and Meta Cartridges. Extractor Cartridges focus on data extraction and transformation services while the Meta Cartridges provide lookups and joins across other linked data spaces and Web 2.0 APIs. Both cartridge types are themselves comprised of a data extractors and RDF Schema/Ontology Mapper components.

Cartridges are is highly customizable. Custom cartridges can be developed using any language supported by the Virtuoso Server Extensions API enabling structured linked data generation from resource types not available in the default Sponger Cartridge collection bundled -- as part of the Virtuoso Cartridges VAD package.

Virtuoso metadata extraction & RDF structured data generation

Figure: 16.10.1.2. Virtuoso metadata extraction & RDF structured data generation

16.10.2. Why is it Important?

A majority of the worlds data naturally resides in non RDF form at the current time. The Sponger delivers middleware that accelerates the bootstrap of the Semantic Data Web by generating RDF from non RDF data sources, unobtrusively. This "Swiss army knife" for on-the-fly Linked Data generation provides a bridge between the traditional Document Web and the Linked Data Web ("Data Web").

Sponging data from non-RDF Web sources and converting it to RDF exposes the data in a canonical form for querying and inference, and enables fast and easy construction of linked data driven mesh-ups as opposed to code driven Web 2.0 mash-ups.

The RDF extraction and instance data generation products that offer functionality demonstrated by the Sponger are also commonly referred to as RDFizers.


16.10.3. How Does It Work?

When an RDF aware client requests data from a network accessible resource via the Sponger the following events occur:

The imported data forms a local cache and its invalidation rules conform to those of traditional HTTP clients (Web Browsers). Thus, expiration time is determined based on subsequent data fetches of the same resource (note: the first data load will record the 'expires' header) with current time compared to expiration time stored in the local cache. If HTTP 'expires' header data isn't returned by the source data server, then the "Sponger" will derive it's own invalidation time frame by evaluating the 'date' header and 'last-modified' HTTP headers. Irrespective of path taken, local cache invalidation is driven by an assessment of current time relative to recorded expiration time.

To manage the cache expiration, set the MinExpiration parameter in your Virtuoso.ini file.

Read full description of the parameter in the [SPARQL] ini section.

Designed with a pluggable architecture, the Sponger's core functionality is provided by Cartridges. Each cartridge includes Data Extractors which extract data from one or more data sources, and Ontology Mappers which map the extracted data to one or more ontologies/schemas, and route to producing RDF Linked Data.

The Schema Mappers are typically XSLT (e.g. GRDDL and other OpenLink Mapping Schemas) or Virtuoso PL based. The Metadata Extractors may be developed in Virtuoso PL, C/C++, Java, or any other language that can be integrated into the Virtuoso via it's server extensions APIs.

The Sponger also includes a pluggable name resolution mechanism that enables the development of Custom Resolvers for naming schemes (e.g. URNs) associated with protocols beyond HTTP. Examples of custom resolvers include:


16.10.4. Installation Steps

  1. Download the Cartridges VAD package.
  2. Install the cartridges_dav.vad package by using the Conductor UI or by using iSQL:
    SQL> DB.DBA.VAD_INSTALL('tmp/cartridges_dav.vad',0);
    SQL_STATE  SQL_MESSAGE
    VARCHAR  VARCHAR
    _______________________________________________________________________________
    
    00000    No errors detected
    00000    Installation of "Linked Data Cartridges" is complete.
    00000    Now making a final checkpoint.
    00000    Final checkpoint is made.
    00000    SUCCESS
    
    
    6 Rows. -- 1078 msec.
    
  3. Cartridge Configuration


16.10.5. Using The Sponger

The Sponger can be invoked via the following mechanisms:

  1. Virtuoso SPARQL query processor
  2. RDF Proxy Service
  3. OpenLink RDF client applications
  4. ODS-Briefcase (Virtuoso WebDAV) - a component of the OpenLink Data Spaces distributed collaborative application platform
  5. Directly via Virtuoso PL

16.10.5.1. SPARQL Query Processor IRI Dereferencing

The Sponger is transparently integrated into the Virtuoso SPARQL query processor, where it supports IRI dereferencing.

Virtuoso extends the SPARQL Query Language such that it is possible to download RDF resources from a given IRI, parse, and then store the resulting triples in a graph, with all three operations performed during the SPARQL query-execution process. The IRI/URI of the graph used to store the triples is usually equal to the URL where the resources are downloaded from, consequently the feature is known as "IRI/URI dereferencing". If a SPARQL query instructs the SPARQL processor to retrieve the target graph into local storage, then the SPARQL sponger will be invoked.

The SPARQL extensions for IRI dereferencing are described below. Essentially these enable downloading and local storage of selected triples either from one or more named graphs, or based on a proximity search from a starting URI for entities matching the select criteria and also related by the specified predicates, up to a given depth. For full details see section Linked Data - IRI Dereferencing.

Note: For brevity, any reference to URI/IRIs above or in subsequent sections implies an HTTP URI/IRI, where IRI is an internationalized URI. Similarly, in the context of the Sponger, the term IRI in the Virtuoso reference documentation should be taken to mean an HTTP IRI.

16.10.5.1.1. SPARQL Extensions for IRI Dereferencing of FROM Clauses

In addition to the "define get:..." SPARQL extensions for IRI dereferencing in FROM clauses, Virtuoso supports dereferencing SPARQL IRIs used in the WHERE clause (graph patterns) of a SPARQL query via a set of "define input:grab-..." pragmas.

Consider an RDF resource which describes a member of a contact list, user1, and also contains statements about other users, user2 anduser3 , known to him. Resource user3 in turn contains statements about user4 and so on. If all the data relating to these users were loaded into Virtuoso's RDF database, the query to retrieve the details of all the users could be quite simple. e.g.:

sparql
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?id ?firstname ?nick
where
  {
    graph ?g
      {
        ?id rdf:type foaf:Person .
        ?id foaf:firstName ?firstname .
        ?id foaf:knows ?fn .
        ?fn foaf:nick ?nick .
      }
   }
limit 10;

But what if some or all of these resources were not present in Virtuoso's quad store? The highly distributed nature of the Linked Data Web makes it highly likely that these interlinked resources would be spread across several data spaces. Virtuoso's 'input:grab-...' extensions to SPARQL enable IRI dereferencing in such a way that all appropriate Network resources are loaded, i.e. "being fetched", during query execution, even if some of the Network resources are not known beforehand. For any particular resource matched, and if necessary downloaded, by the query, it is possible to download related resources via a designated predicate path(s) to a specifiable depth i.e. number of 'hops', distance, or degrees of separation (i.e compute Transitive Closures in SPARQL).

Using Virtuoso's 'input:grab-' pragmas to enable sponging, the above query might be recast to:

sparql
define input:grab-var "?more"
define input:grab-depth 10
define input:grab-limit 100
define input:grab-base "http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1300"
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select ?id ?firstname ?nick
where {
    graph ?g {
               ?id rdf:type foaf:Person .
               ?id foaf:firstName ?firstname .
               ?id foaf:knows ?fn .
               ?fn foaf:nick ?nick .
               optional { ?id rdfs:SeeAlso ?more }
            }
}
limit 10;

Another example showing a designated predicate traversal path via the input:grab-seealso extension is:

sparql
define input:grab-iri <http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/sioc.ttl>
define input:grab-var "id"
define input:grab-depth 10
define input:grab-limit 100
define input:grab-base "http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1300"
define input:grab-seealso <foaf:maker>
prefix foaf: <http://xmlns.com/foaf/0.1/>

select ?id
where
  {
    graph ?g
      {
        ?id a foaf:Person .
      }
  }
limit 10;

A list of the input:grab pragmas is given below:


16.10.5.1.2. SPARQL Processor Usage Example

Network Resource Fetch can be performed directly from within the SPARQL processor.

After logging into Virtuoso's Conductor interface, the following query can be issued from the Interactive SQL (iSQL) panel:

sparql
define get:uri "http://www.ivan-herman.net/foaf.html"
define get:soft "soft"
select * from <http://mygraph> where {?s ?p ?o}

Here the sparql keyword invokes the SPARQL processor from the SQL interface and the RDF data fetched from page http://www.ivan-herman.net/foaf.html is loaded into the local RDF quad store as graph http://mygraph .

The new graph can then be queried using the basic SPARQL client normally available in a default Virtuoso installation at http://localhost:8890/sparql/. e.g.:

select * from <http://mygraph> where {?s ?p ?o}


16.10.5.2. RDF Proxy Service

The Sponger's functionality is also exposed via an in-built REST style Web service. This web service takes a target URL and either returns the content "as is" or tries to transform (by sponging) to RDF. Thus, the proxy service can be used as a 'pipe' for RDF browsers to browse non-RDF sources.

When the cartridges package is installed, Virtuoso reserves the path '/about/[id|data|rdf|html]/http/' for Sponger Proxy URI Service. For example, if a Virtuoso installation on host example.com listens for HTTP requests on port 8080 then client applications should use a 'service endpoint' string equal to 'http://example.com:8080/about/[id|data|rdf|html]/http/'. If the cartridges package is not installed, then the service uses the path '/proxy/rdf/'.

Note: The old Sponger Proxy URI Service pattern '/proxy/' is now deprecated.

16.10.5.2.1. Example 1

The following URLs return information about musician John Cale, gleaned from the MusicBrainz music metadatabase, rendered as RDF or HTML respectively. (The Network Resource fetched data is available in the HTML rendering through the foaf:primaryTopic property.)


16.10.5.2.2. Example 2

The file http://www.ivan-herman.net/foaf.html contains a short profile of the W3C Semantic Web Activity Lead Ivan Herman. This XHTML file contains RDF embedded as RDFa. Running the file through the Sponger via Virtuoso's RDF proxy service extracts the embedded FOAF data as pure RDF, as can be seen by executing:

$ curl -L -H "Accept:application/rdf+xml" http://linkeddata.uriburner.com/about/id/entity/http/www.ivan-herman.net/foaf.html
<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
  <rdf:Description rdf:about="http://linkeddata.uriburner.com/about/id/http/www.ivan-herman.net/foaf.html#Person1Stat"><scovo:dimension xmlns:scovo="http://purl.org/NET/scovo#" rdf:resource="http://rdfs.org/ns/void#numberOfResources"/></rdf:Description>
  <rdf:Description rdf:nodeID="b145981159"><rdf:rest rdf:nodeID="b145981158"/></rdf:Description>
  <rdf:Description rdf:about="http://linkeddata.uriburner.com/about/id/entity/http/www.mendeley.com/profiles/ivan-herman"><foaf:accountName xmlns:foaf="http://xmlns.com/foaf/0.1/">ivan-herman</foaf:accountName></rdf:Description>
  etc ..
  <rdf:Description rdf:nodeID="b145981130"><http-voc:elementName xmlns:http-voc="http://www.w3.org/2006/http#">text/html</http-voc:elementName></rdf:Description>
</rdf:RDF>

(linkeddata.uriburner.com hosts a public Virtuoso instance.) Though this example demonstrates the action of the /about/id/entity/ service quite transparently, it is a basic and unwieldy way to view RDF. As described earlier, the OpenLink Data Explorer uses the same proxy service to provide a more polished means to extract and view fetched RDF data.


16.10.5.2.3. Usage of the Sponger Middleware via REST patterns

Delegation and proxies are part of the Internet and Web's federated architecture. Thus, developers of RESTful applications benefit immensely from the ability to leverage Sponger functionality via delegation to it as a proxy.

The following table presents list of the supported URL parameters:

Parameter Value Description Example
refresh clean Usage: for overwriting. The 'clean' usage explicitly clears the graph i.e. will cause the Sponger to drop cache even if it is marked to be in the fly.Thus, if fetched cache by some reason is left in some inconsistent state like shutdown during Network Resource fetching, then 'clean' is required as it doesn't check cache state.Note: must be used with caution as other threads may be doing fetching of network resources at same time. Explicitly clear the graph
sponger:get add Usage: Add new triples to named graphs, progressively. This is the default value for the parameter sponger:get. May be used together with refresh=<seconds> to overwrite the expiration in the cache. Add new triples and refresh on every 10 seconds
sponger:get soft Usage: Network Resource Fetch data subject to cache invalidation mode and associated rules of instance. May be used together with refresh=<seconds> to overwrite the expiration in the cache. Network Resource Fetch data with option soft and refresh on every 10 seconds
sponger:get replace Usage: Replace subject to cache invalidation mode and rules, but coverage includes non fetched triples if such exist in a given named graph. may be used together with refresh=<seconds> to overwrite the expiration in the cache. Replace data and refresh on every 10 seconds



16.10.5.3. OpenLink RDF Client Applications

OpenLink currently provides two main RDF client applications:

ODE is a Linked Data explorer packaged as a Firefox plugin (support for other browsers is planned). iSPARQL is an interactive AJAX-based SPARQL query builder with support for SPARQL QBE, bundled as part of the OpenLink Ajax Toolkit (OAT). Both RIA clients utilise sponging extensively.

The ODE plugin is dual faceted - RDF data can be viewed and explored natively, through its integral RDF browser, or, as described above, rendered as HTML through ODE's 'View Page Metadata' option. The screenshots below show ODE's RDF browser being launched through the 'View Linked Data Sources' popup menu.

Launching ODE's RDF browser

Figure: 16.10.5.3.1. Launching ODE's RDF browser

The RDF browser then displays RDF data fetched via the Crunchbase cartridge.

ODE RDF browser displaying Crunchbase network resource fetched data

Figure: 16.10.5.3.2. ODE RDF browser displaying Crunchbase network resource fetched data

iSPARQL directs queries to the configured SPARQL endpoint. When targetting a Virtuoso /sparql service, Virtuoso specific sponging options can be enabled through the 'Preferences' dialog box.

The iSPARQL sponger settings are appended to SPARQL queries through the 'should-sponge' query parameter. These are translated to IRI dereferencing pragmas on the server as follows:

iSPARQL sponging setting /sparql endpoint: "should-sponge" query parameter value SPARQL processor directives
Use only local data N/A N/A
Retrieve remote RDF data for all missing source graphs soft define get:soft "soft"
Retrieve all missing remote RDF data that might be useful grab-all define input:grab-all "yes" define input:grab-depth 5 define input:grab-limit 100
Retrieve all missing remote RDF data that might be useful including seeAlso references grab-seealso define input:grab-all "yes" define input:grab-depth 5 define input:grab-limit 200define input:grab-seealso <http://www.w3.org.2000/01/rdf-schema#seeAlso>define input:grab-seealso <http://xmlns.com/foaf/0.1/seeAlso>
Try to download all referenced resources grab-everything define input:grab-all "yes"define input:grab-intermediate "yes"define input:grab-depth 5define input:grab-limit 500define input:grab-seealso <http://www.w3.org.2000/01/rdf-schema#seeAlso>define input:grab-seealso <http://xmlns.com/foaf/0.1/seeAlso>


16.10.5.4. ODS-Briefcase (Virtuoso WebDAV)

ODS-Briefcase is a component of OpenLink Data Spaces (ODS), a new generation distributed collaborative application platform for creating Semantic Web presence via Data Spaces derived from weblogs, wikis, feed aggregators, photo galleries, shared bookmarks, discussion forums and more. It is also a high level interface to the Virtuoso WebDAV repository.

ODS-Briefcase offers file-sharing functionality that includes the following features:

When resources or documents are put into the ODS Briefcase and are made publicly readable (via a Unix-style +r permission or ACL setting) and the resource in question is of a supported content type, metadata is automatically extracted at file upload time.

Note: ODS-Briefcase extracts metadata from a wide array of file formats, automatically.

The extracted metadata is available in two forms, pure WebDAV and RDF (with RDF/XML or N3/Turtle serialization options), that is optionally synchronized with the underlying Virtuoso Quad Store.

All public readable resources in WebDAV have their owner, creation time, update time, size and tags published, plus associated content type dependent metadata. This WebDAV metadata is also available in RDF form as a SPARQL queriable graph accessible via the SPARQL protocol endpoint using the WebDAV location as the RDF data set URI (graph or data source URI).

You can also use a special RDF_Sink folder to automate the process of uploading RDF resources files into the Virtuoso Quad Store via WebDAV or raw HTTP. The properties of the special folder control whether sponging (RDFization) occurs. Of course, by default, this feature is enabled across all Virtuoso and ODS installations (with an ODS-Briefcase Data Space instance enabled).

16.10.5.4.1. Raw HTTP Example for Extracting Metadata using CURL
Username: demo
Password: demo
Source File: wine.rdf
Destination Folder:
http://demo.openlinksw.com/DAV/home/demo/rdf_sink/
Content Type: application/rdf+xml

$ curl -v -T wine.rdf -H content-type:application/rdf+xml http://demo.openlinksw.com/DAV/home/demo/rdf_sink/ -u demo:demo

Finally, you can also get RDF data into Virtuoso's Quad Store via WebDAV using the Virtuoso Web Crawler utility (configurable via the Virtuoso Conductor UI). This feature also provides the ability to enable or disable Sponging as depicted below.


16.10.5.4.2. Sponger and ODS-Briefcase Structured Data Extractor Interrelationship

As the Sponger and ODS-Briefcase both extract structured data, what is the relationship between these two facilities?

The principal difference between the two is that the Sponger is anRDF data crawler & generator, whereas Briefcase's structured data extractor is a WebDAV resourcefilter. The Briefcase structured data extractor is aimed at providing RDF data from WebDAV resources. Thus, if none of the available Sponger cartridges are able to extract metadata and produce RDF structured data, the Sponger calls upon the Briefcase extractor as the last resort in the RDF structured data generation pipeline.

Conductor's content import configuration panel

Figure: 16.10.5.4.2.1. Conductor's content import configuration panel
Conductor's content import configuration panel

Figure: 16.10.5.4.2.2. Conductor's content import configuration panel
Conductor's content import configuration panel

Figure: 16.10.5.4.2.3. Conductor's content import configuration panel
Conductor's content import configuration panel

Figure: 16.10.5.4.2.4. Conductor's content import configuration panel


16.10.5.5. Directly via Virtuoso PL

Sponger cartridges are invoked through a cartridge hook which provides a Virtuoso PL entry point to the packaged functionality. Should you wish to utilize the Sponger from your own Virtuoso PL procedures, you can do so by calling these hook routines directly. Full details of the hook function prototype and how to define your own cartridges are presented here.



16.10.6. Consuming the Generated RDF Structured Data

The generated RDF-based structured data (RDF) can be consumed in a number of ways, depending on whether or not the data is persisted in Virtuoso's RDF Quad Store.

If the data is persisted, it can be queried through the Virtuoso SPARQL endpoint associated with any Virtuoso instance: /sparql. The RDF is exposed in a graph typically identified using a URL matching the source resource URL from which the RDF data was generated. Naturally, any SQL query can also access this, since SPARQL can be freely intermixed with SQL via Virtuoso's SPASQL (SPARQL inside SQL) functionality. RDF data is also accessible through Virtuoso's implementation of the URIQA protocol.

If not persisted, as is the case with the RDF Proxy Service, the data can be consumed by an RDF aware Web client, e.g. an RDF browser such as the OpenLink Data Explorer (ODE).


16.10.7. RDF Cartridges Use Cases

This section contains examples of Web resources which can be transformed by RDF Cartridges. It also states where additional setup for given cartridges is needed i.e. keys account names etc.

Service based:

GRDDL

URN handlers

Table: 16.10.7.1. URN handlers List
URN handler Sample URI Resource Description Linked Data View Linked Data Graph Needs
LSID urn:lsid:ubio.org:namebank:12292 HTML Representation Linked Data View Data Explorer View none
DOI doi:10.1038/35057062 HTML Representation Linked Data View Data Explorer View Needs hslookup plugin, relevant html, pdf, xml etc. mappers enabled.
OAI oai:dcmi.ischool.washington.edu:article/8 HTML Representation Linked Data View Data Explorer View none

16.10.7.2. SPARQL IRI Dereferencing

The Virtuoso SPARQL engine (called for brevity just SPARQL below) supports IRI Dereferencing, however it understands only RDF data, that is it can retrieve only files containing RDF/XML, turtle or N3 serialized RDF data, if format is unknown it will try mapping with built-in WebDAV metadata extractor. In order to extend this feature with dereferencing web or file resources which naturally don't have RDF data (like PDF, JPEG files for example) is provided a special mechanism in SPARQL engine. This mechanism is called RDF mappers for translation of non-RDF data files to RDF.

In order to instruct the SPARQL to call a RDF mapper it needs to be registered and it will be called for a given URL or MIME type pattern. In other words, when unknown for SPARQL format is received during URL dereferencing process, it will look into a special registry (a table) to match either the MIME type or IRI using a regular expression, if match is found the mapper function will be called.

16.10.7.2.1. Sponger Proxy service

Sponger functionality is also exposed via Virtuoso's "/proxy/rdf/" endpoint, as an in-built REST style Web service available in any Virtuoso standard installation. This web service takes a target URL and either returns the content "as is" or tries to transform (by sponging) to RDF. Thus, the proxy service can be used as a 'pipe' for RDF browsers to browse non-RDF sources.

For more information see RDF Sponger Proxy service


16.10.7.2.2. Cache Invalidation

To clear cache on all values of HS_LOCAL_IRI of the SYS_HTTP_SPONGE table use:

SPARQL clear graph <A-Named-Graph>;



16.10.8. Cartridge Architecture

16.10.8.1. What is a Cartridge?

See full description here


16.10.8.2. Extractor Cartridges

An Extractor Cartridge processes a Resource of a given format, extracting RDF according to rules appropriate to that format. External data does not come into play; only the content of the Resource fed to the Sponger.

16.10.8.2.1. Supported Standard Non-RDF Data Formats

These Cartridges handle open formats - typically community-developed, openly-documented, and freely-licensed data structures.

Cartridge Sample URI Resource Description Linked Data Graph
AB Meta example HTML Representation Data Explorer View
Atom example HTML Representation Data Explorer View
CSV example HTML Representation Data Explorer View
DC example HTML Representation Data Explorer View
eRDF example HTML Representation Data Explorer View
hAudio example HTML Representation Data Explorer View
hCalendar example HTML Representation Data Explorer View
hCard example HTML Representation Data Explorer View
hListing example HTML Representation Data Explorer View
hNews example HTML Representation Data Explorer View
hProduct example HTML Representation Data Explorer View
HR-XML example HTML Representation Data Explorer View
hRecipe example HTML Representation Data Explorer View
hResume example HTML Representation Data Explorer View
hReview example HTML Representation Data Explorer View
HTTP in RDF example HTML Representation Data Explorer View
iCalendar example HTML Representation Data Explorer View
Microsoft Word 2003 XML Document example HTML Representation Data Explorer View
Microsoft XML Spreadsheet 2003 example HTML Representation Data Explorer View
Microsoft Documents example HTML Representation Data Explorer View
OData example HTML Representation Data Explorer View
OO document example HTML Representation Data Explorer View
OPML example HTML Representation Data Explorer View
PPTX example HTML Representation Data Explorer View
RDFa example HTML Representation Data Explorer View
RSS example HTML Representation Data Explorer View
Slidy example HTML Representation Data Explorer View
vCalendar example HTML Representation Data Explorer View
vCard example HTML Representation Data Explorer View
WebDAV Metadata example HTML Representation Data Explorer View
XBRL example HTML Representation Data Explorer View
XFN Profile example HTML Representation Data Explorer View
XFN Profile2 example HTML Representation Data Explorer View
xHTML example HTML Representation Data Explorer View
XHTML example HTML Representation Data Explorer View


16.10.8.2.2. Supported Vendor-specific Non-RDF Data Formats

These Cartridges handle closed formats - typically proprietary; sometimes undocumented; possibly licensed to no-one except the format originator. Sometimes data may not be parsed as desired or expected, as many of these Cartridges have required reverse-engineering of the data format in question.

Cartridge Needs Sample URI Resource Description Linked Data Graph
Amazon API Key example HTML Representation Data Explorer View
BestBuy API Key example HTML Representation Data Explorer View
Bing none example HTML Representation Data Explorer View
Bugzillas none example HTML Representation Data Explorer View
CNET API Key example HTML Representation Data Explorer View
CrunchBase none example HTML Representation Data Explorer View
Delicious none example HTML Representation Data Explorer View
Digg none example HTML Representation Data Explorer View
Discogs php plugin, DBpedia Extractor example HTML Representation Data Explorer View
Disqus API Key, API Account example HTML Representation Data Explorer View
DOI hslookup plugin; relevant html-, pdf-, xml-, etc., -mappers enabled example HTML Representation Data Explorer View
Dublin Core none example HTML Representation Data Explorer View
eBay account, API Key example HTML Representation Data Explorer View
Evri none example HTML Representation Data Explorer View
Facebook API key and secret, OAuth token See details example HTML Representation Data Explorer View
Flickr API Key example HTML Representation Data Explorer View
Freebase none example HTML Representation Data Explorer View
Geonames none example HTML Representation Data Explorer View
geoURL none example HTML Representation Data Explorer View
Get Satisfaction none example HTML Representation Data Explorer View
Google+ API key See details example HTML Representation Data Explorer View
Google Base none example HTML Representation Data Explorer View
Google Book none example HTML Representation Data Explorer View
Google Document none example HTML Representation Data Explorer View
Google Social Graph none example HTML Representation Data Explorer View
Google Spreadsheet none example HTML Representation Data Explorer View
Hoovers none example HTML Representation Data Explorer View
ISBN API Key example HTML Representation Data Explorer View
LastFM API Key example HTML Representation Data Explorer View
LibraryThing API Key example HTML Representation Data Explorer View
LinkedIn API key and secret, OAuth token See details example HTML Representation Data Explorer View
LSID none example HTML Representation Data Explorer View
Meetup API Key example HTML Representation Data Explorer View
MusicBrainz none example HTML Representation Data Explorer View
Ning Metadata none example HTML Representation Data Explorer View
OAI none example HTML Representation Data Explorer View
Open Social none example HTML Representation Data Explorer View
OpenLibrary none example HTML Representation Data Explorer View
OpenStreetMap none example HTML Representation Data Explorer View
oReilly none example HTML Representation Data Explorer View
Picasa none example HTML Representation Data Explorer View
Radio Pop none example HTML Representation Data Explorer View
relLicense none example HTML Representation Data Explorer View
Revyu none example HTML Representation Data Explorer View
Rhapsody none example HTML Representation Data Explorer View
SalesForce.com API Key,user login example HTML Representation Data Explorer View
SlideShare API Key, SharedSecret example HTML Representation Data Explorer View
SlideSix none example HTML Representation Data Explorer View
SVG none example HTML Representation Data Explorer View
Tesco none example HTML Representation Data Explorer View
Tumblr none example HTML Representation Data Explorer View
TWFY (theyworkforyou) API Key example HTML Representation Data Explorer View
Twitter API key and secret, OAuth token See details example HTML Representation Data Explorer View
Ustream none example HTML Representation Data Explorer View
Web Resource CC (Licenses) none example HTML Representation Data Explorer View
Wikipedia none example HTML Representation Data Explorer View
xFolk none example HTML Representation Data Explorer View
Yahoo! Finance none example HTML Representation Data Explorer View
Yahoo! SearchMonkey none example HTML Representation Data Explorer View
Yahoo! Traffic Data none example HTML Representation Data Explorer View
Yahoo! Weather none example HTML Representation Data Explorer View
Yelp none example HTML Representation Data Explorer View
Youtube none example HTML Representation Data Explorer View
Zillow none example HTML Representation Data Explorer View



16.10.8.3. Meta Cartridges

A Meta Cartridge submits a Resource to a third-party Web Service for processing. Returned RDF supplements the RDF generated by Extractor and other Meta Cartridges. Locally generated RDF may also be submitted to the third-party services, instead-of or in-addition-to the original Resource itself.

Default Sponger behavior is for all installed Meta Cartridges to be brought to bear on all submitted Resources:

Cartridge Needs Sample URI Resource Description Linked Data Graph Notes
Alchemy API Key example HTML Representation Data Explorer View
Amazon Search for products API Key, secret example HTML Representation Data Explorer View
BBC Links example HTML Representation Data Explorer View
BestBuy Search for products API Key example HTML Representation Data Explorer View
Bing API Key example HTML Representation Data Explorer View
Bit.ly example HTML Representation Data Explorer View
CNET Search for products API Key example HTML Representation Data Explorer View
Collecta example HTML Representation Data Explorer View Check rdfs:seeAlso links found for microsoft.
Crunchbase example HTML Representation Data Explorer View
Dapper Search example HTML Representation Data Explorer View
DBpedia Meta example HTML Representation Data Explorer View
Delicious Meta User Login example HTML Representation Data Explorer View
Digg.com example HTML Representation Data Explorer View Check rdfs:seeAlso the links from Digg.com .
Discogs API Key, php plugin, DBpedia Extractor example HTML Representation Data Explorer View
Document Links example HTML Representation Data Explorer View
eBay Search for products account, API Key example HTML Representation Data Explorer View
Evri Meta example HTML Representation Data Explorer View
Facebook API Key, secret, persistent-session-id. See details example HTML Representation Data Explorer View
Flickr Search for photos API Key example HTML Representation Data Explorer View
FOAF-Search example HTML Representation Data Explorer View Check rdfs:seeAlso at: link1, link2, link3
Foursquare example HTML Representation Data Explorer View
Freebase NYTC API Key example HTML Representation Data Explorer View
Freebase NYTCF API Key example HTML Representation Data Explorer View
FriendFeed example HTML Representation Data Explorer View
Geonames Meta example HTML Representation Data Explorer View
Geopoints example HTML Representation Data Explorer View
Gowalla example HTML Representation Data Explorer View
Get Glue Meta User Login example HTML Representation Data Explorer View
Get Glue example HTML Representation Data Explorer View
Google Buzz example HTML Representation Data Explorer View
Google Book example HTML Representation Data Explorer View Check rdfs:seeAlso links like this one
Google Plus example HTML Representation Data Explorer View
Google Places example HTML Representation Data Explorer View
Google Search example HTML Representation Data Explorer View
Google Social Graph example HTML Representation Data Explorer View
Guardian API Key example HTML Representation Data Explorer View
Hoovers API Key example HTML Representation Data Explorer View
Jigsaw (company) example HTML Representation Data Explorer View Check the c:location link
Jigsaw (person) example HTML Representation Data Explorer View Check several Jigsaw search seeAlso link.
Journalisted example HTML Representation Data Explorer View
Last.FM API Key example HTML Representation Data Explorer View
LinkedIn API Key and Session Key; See details example HTML Representation Data Explorer View
Local Search example HTML Representation Data Explorer View
LOD example HTML Representation Data Explorer View
MIME Type example HTML Representation Data Explorer View
New York Times API Key example HTML Representation Data Explorer View
NPR Meta API Key example HTML Representation Data Explorer View Check rdfs:seeAlso links like: link1; link2; link3.
NYT: The Article Search example HTML Representation Data Explorer View Check the rdfs:seeAlso: link.
NYT: The TimesTags example HTML Representation Data Explorer View
OpenCalais any html page example HTML Representation Data Explorer View
Oreilly Search for products example HTML Representation Data Explorer View
Primal example HTML Representation Data Explorer View Check the set of sioc:topic and scot:hasScot.
ProgrammableWeb example HTML Representation Data Explorer View
Provenance example HTML Representation Data Explorer View
Punkt example HTML Representation Data Explorer View
RapLeaf example HTML Representation Data Explorer View
SameAs.org example HTML Representation Data Explorer View
Sindice example HTML Representation Data Explorer View
SimpleGeo example HTML Representation Data Explorer View
Technorati API Key example HTML Representation Data Explorer View
Tesco Product Search User Login, DeveloperKey, ApplicationKey example HTML Representation Data Explorer View Check set of Tesco rdfs:seeAlso links like this one.
Topsy example HTML Representation Data Explorer View Check the rdfs:seeAlso from topsy.com.
TrueKnowledge example HTML Representation Data Explorer View Check set of rdfs:seeAlso links like: link1; link2; link3.
Tweetme example HTML Representation Data Explorer View Check the rdfs:seeAlso link.
Twitter Meta User Login example HTML Representation Data Explorer View
uClassify example HTML Representation Data Explorer View
Uclassify example HTML Representation Data Explorer View Check diff langs uc:class: link
UMBEL min-score, max-results example HTML Representation Data Explorer View
USA Today Best-Selling Books example HTML Representation Data Explorer View
Ustream example HTML Representation Data Explorer View Check rdfs:seeAlso links like: this one.
Virtuoso Faceted Web Service example HTML Representation Data Explorer View
voID Statistics example HTML Representation Data Explorer View
whoisi? example HTML Representation Data Explorer View
World Bank API Key example HTML Representation Data Explorer View
WorldCat Basic Search example HTML Representation Data Explorer View Check seeAlso links like this one.
xISBN API Key example HTML Representation Data Explorer View Check set of owl:sameAs links: link1; link2.
XRD example HTML Representation Data Explorer View
Yahoo BOSS API Key example HTML Representation Data Explorer View
Yahoo Geocode API Key example HTML Representation Data Explorer View
Yelp Search for business API Key example HTML Representation Data Explorer View
Zemanta API Key, min-score, max-results example HTML Representation Data Explorer View
Zillow API Key example HTML Representation Data Explorer View

16.10.8.3.2. Meta Cartridge Usage via REST Request

Description.vsp underlies the /about/html/ page, and accepts the parameters described below.

Parameter Value Description Example
@Lookup@ The type of lookup
No Value When value is not given (i.e., @Lookup@=), all will work as if the parameter were not present. %BR% The "Lookup" name is chosen to distinguish between parameters belonging to the URL being processed, and parameters for the Sponger. Refresh the graph with all current cartridges, either type
0 NLP meta only Execute only NLP meta extraction
-2 Keywords-based only Execute only keywords-based meta extraction
x,y... A list of meta cartridges to be executed, by their unique IDs. The ID column can be found in Conductor -&gt; Linked Data -&gt; Sponger -&gt; Meta Cartridges Execute only CNET (ID=19) and NYT: The TimesTags (ID=22) meta cartridges
refresh=0,1,2 etc. Usage: for cache invalidation. When used 1 or larger number (n), adds get:refresh "N" (explicit refresh interval in seconds) as a directive to Sponger. A refresh of zero ("0") seconds will make a new graph on the next lookup with the '@Lookup@' parameter value. Refresh the graph with all current cartridges
refresh=clean Usage: for overwriting. The 'clean' usage explicitly clears the graph i.e. will cause the Sponger to drop cache even if it is marked to be in the fly. Thus, if network resource fetched cache by some reason is left in some inconsistent state like shutdown during the fetching, then 'clean' is required as it doesn't check cache state. Note: must be used with caution as other threads may be doing Network Resource Fetch at same time.


16.10.8.3.3. Meta Cartridges Parametrized Examples

All examples in the table below start from the same Resource, http://www.news.com, and submit it to the Sponger for processing with the single listed Meta Cartridge.

It can be informative to start by seeing what the results would be with no Meta Cartridges at all.

If you have a lot of time to spare, you may want to see what the results would be with all Meta Cartridges combined. As may be obvious, this must wait for each of the above services to respond, so it may take quite some time to return.

Cartridge URL Pattern Example
Alchemy @Lookup@=8&refresh=0 cURL example
Amazon Search for products @Lookup@=13&refresh=0 cURL example
BBC @Lookup@=1665&refresh=0 cURL example
BestBuy Search for products @Lookup@=14&refresh=0 cURL example
Bing @Lookup@=11&refresh=0 cURL example
Bit.ly @Lookup@=915&refresh=0 cURL example
CNET @Lookup@=19&refresh=0 cURL example
Crunchbase @Lookup@=839&refresh=0 cURL example
Dapper @Lookup@=243&refresh=0 cURL example
DBpedia @Lookup@=26&refresh=0 cURL example
Delicious Meta @Lookup@=23&refresh=0 cURL example
Discogs @Lookup@=840&refresh=0 cURL example
Document Links @Lookup@=34&refresh=0 cURL example
eBay @Lookup@=18&refresh=0 cURL example
Evri Meta @Lookup@=3966&refresh=0 cURL example
Flickr Search for photos @Lookup@=16&refresh=0 cURL example
Freebase NYTC @Lookup@=5&refresh=0 cURL example
Freebase NYTCF @Lookup@=4&refresh=0 cURL example
Geonames Meta @Lookup@=24&refresh=0 cURL example
Geopoints @Lookup@=3731&refresh=0 cURL example
Get Glue Meta @Lookup@=25&refresh=0 cURL example
Google Search @Lookup@=1382&refresh=0 cURL example
Google Social Graph @Lookup@=30&refresh=0 cURL example
Guardian @Lookup@=28&refresh=0 cURL example
Hoovers @Lookup@=2&refresh=0 cURL example
Journalisted @Lookup@=3174&refresh=0 cURL example
Local Search @Lookup@=15&refresh=0 cURL example
LOD @Lookup@=21&refresh=0 cURL example
MIME Type @Lookup@=1029&refresh=0 cURL example
New York Times @Lookup@=22&refresh=0 cURL example
NPR Meta @Lookup@=29&refresh=0 cURL example
NYT: The Article Search @Lookup@=9&refresh=0 cURL example
NYT: The TimesTags @Lookup@=22&refresh=0 cURL example
OpenCalais @Lookup@=1&refresh=0 cURL example
Oreilly Search for products @Lookup@=17&refresh=0 cURL example
RapLeaf @Lookup@=2745&refresh=0 cURL example
SameAs.org @Lookup@=3257&refresh=0 cURL example
Sindice @Lookup@=12&refresh=0 cURL example
Technorati @Lookup@=27&refresh=0 cURL example
Tesco @Lookup@=31&refresh=0 cURL example
TrueKnowledge @Lookup@=3967&refresh=0 cURL example
Twitter @Lookup@=4020&refresh=0 cURL example
uClassify @Lookup@=3086&refresh=0 cURL example
UMBEL @Lookup@=6&refresh=0 cURL example
Ustream @Lookup@=3902&refresh=0 cURL example
Virtuoso Faceted Web Service @Lookup@=21&refresh=0 cURL example
voID Statistics @Lookup@=35&refresh=0 cURL example
whoisi? @Lookup@=3052&refresh=0 cURL example
World Bank @Lookup@=3&refresh=0 cURL example
XRD @Lookup@=3650&refresh=0 cURL example
Yahoo BOSS @Lookup@=10&refresh=0 cURL example
Yahoo Geocode @Lookup@=2855&refresh=0 cURL example
Yelp Search for business @Lookup@=20&refresh=0 cURL example
Zemanta @Lookup@=7&refresh=0 cURL example
Zillow @Lookup@=32&refresh=0 cURL example




16.10.9. Sponger Programmers Guide

The Sponger forms part of the extensible RDF framework built into Virtuoso Universal Server. A key component of the Sponger's pluggable architecture is its support for Sponger Cartridges, which themselves are comprised of an Entity Extractor and an Ontology Mapper. Virtuoso bundles numerous pre-written cartridges for RDF data extraction from a wide range of data sources. However, developers are free to develop their own custom cartridges. This programmer's guide describes how.

The guide is a companion to the Virtuoso Sponger whitepaper. The latter describes the Sponger in depth, its architecture, configuration, use and integration with other Virtuoso facilities such as the Open Data Services (ODS) application framework. This guide focuses solely on custom cartridge development.

16.10.9.1. Configuration of CURIEs used by the Sponger

For configuring CURIEs used by the Sponger which is exposed via Sponger clients such as "description.vsp" - the VSP based information resource description utility, you can use the xml_set_ns_decl function.

Here is sample example to add curie pattern:

-- Example link: http://linkeddata.uriburner.com/about/rdf/http://twitter.com/guykawasaki/status/1144945513#this
XML_SET_NS_DECL ('uriburner',
                 'http://linkeddata.uriburner.com/about/rdf/http://',
                 2);

16.10.9.2. Cartridge Architecture

The Sponger is comprised of cartridges which are themselves comprised of an entity extractor and an ontology mapper. Entities extracted from non-RDF resources are used as the basis for generating structured data by mapping them to a suitable ontology. A cartridge is invoked through its cartridge hook, a Virtuoso/PL procedure entry point and binding to the cartridge's entity extractor and ontology mapper.

Entity Extractor

When an RDF aware client requests data from a network accessible resource via the Sponger the following events occur:

Extraction Pipeline

Depending on the file or format type detected at ingest, the Sponger applies the appropriate entity extractor. Detection occurs at the time of content negotiation instigated by the retrieval user agent. The normal extraction pipeline processing is as follows:

Ontology Mapper

Sponger ontology mappers peform the the task of generating RDF instance data from extracted entities (non-RDF) using ontologies associated with a given data source type. They are typically XSLT (using GRDDL or an in-built Virtuoso mapping scheme) or Virtuoso/PL based. Virtuoso comes preconfigured with a large range of ontology mappers contained in one or more Sponger cartridges.

Cartridge Registry

To be recognized by the SPARQL engine, a Sponger cartridge must be registered in the Cartridge Registry by adding a record to the table DB.DBA.SYS_RDF_MAPPERS, either manually via DML, or more easily through Conductor, Virtuoso's browser-based administration console, which provides a UI for adding your own cartridges. (Sponger configuration using Conductor is described in detail later.) The SYS_RDF_MAPPERS table definition is as follows:

create table "DB"."DBA"."SYS_RDF_MAPPERS"
(
"RM_ID" INTEGER IDENTITY,  -- cartridge ID. Determines the order of the cartridge's invocation in the Sponger processing chain
"RM_PATTERN" VARCHAR,  -- a REGEX pattern to match the resource URL or MIME type
"RM_TYPE" VARCHAR,  -- which property of the current resource to match: "MIME" or "URL"
"RM_HOOK" VARCHAR,  -- fully qualified Virtuoso/PL function name
"RM_KEY" LONG VARCHAR,  -- API specific key to use
"RM_DESCRIPTION" LONG VARCHAR,  -- cartridge description (free text)
"RM_ENABLED" INTEGER,  -- a 0 or 1 integer flag to exclude or include the cartridge from the Sponger processing chain
"RM_OPTIONS" ANY,  -- cartridge specific options
"RM_PID" INTEGER IDENTITY,
PRIMARY KEY ("RM_PATTERN", "RM_TYPE")
);

16.10.9.3. Cartridge Invocation

The Virtuoso SPARQL processor supports IRI dereferencing via the Sponger. If a SPARQL query references non-default graph URIs, the Sponger goes out (via HTTP) to Fetch the Network Resource data source URIs and inserts the extracted RDF data into the local RDF quad store. The Sponger invokes the appropriate cartridge for the data source type to produce RDF instance data. If none of the registered cartridges are capable of handling the received content type, the Sponger will attempt to obtain RDF instance data via the in-built WebDAV metadata extractor.

Sponger cartridges are invoked as follows:

When the SPARQL processor dereferences a URI, it plays the role of an HTTP user agent (client) that makes a content type specific request to an HTTP server via the HTTP request's Accept headers. The following then occurs:

Meta-Cartridges

The above describes the RDF generation process for 'primary' Sponger cartridges. Virtuoso also supports another cartridge type - a 'meta-cartridge'. Meta-cartridges act as post-processors in the cartridge pipeline, augmenting entity descriptions in an RDF graph with additional information gleaned from 'lookup' data sources and web services. Meta-cartridges are described in more detail in a later section.

Meta-Cartridges

Figure: 16.10.9.3.1. Meta-Cartridges

16.10.9.4. Cartridges Bundled with Virtuoso

16.10.9.4.1. Cartridges VAD

Virtuoso supplies a number of prewritten cartridges for extracting RDF data from a variety of popular Web resources and file types. The cartridges are bundled as part of the cartridges_dav VAD (Virtuoso Application Distribution).

To see which cartridges are available, look at the 'Linked Data' screen in Conductor. This can be reached through the Linked Data -> Sponger -> Extractor Cartridges and Meta Cartridges menu items.

RDF Cartridges

Figure: 16.10.9.4.1.1. RDF Cartridges

To check which version of the cartridges VAD is installed, or to upgrade it, refer to Conductor's 'VAD Packages' screen, reachable through the 'System Admin' > 'Packages' menu items.

The latest VADs for the closed source releases of Virtuoso can be downloaded from the downloads area on the OpenLink website. Select either the 'DBMS (WebDAV) Hosted' or 'File System Hosted' product format from the 'Distributed Collaborative Applications' section, depending on whether you want the Virtuoso application to use WebDAV or native filesystem storage. VADs for Virtuoso Open Source edition (VOS) are available for download from the VOS Wiki.


16.10.9.4.2. Example Source Code

For developers wanting example cartridge code, the most authoritative reference is the cartridges VAD source code itself. This is included as part of the VOS distribution. After downloading and unpacking the sources, the script used to create the cartridges, and the associated stylesheets can be found in:

Alternatively, you can look at the actual cartridge implementations installed in your Virtuoso instance by inspecting the cartridge hook function used by a particular cartridge. This is easily identified from the 'Cartridge name' field of Conductor's 'RDF Cartridges' screen, after selecting the cartridge of interest. The hook function code can be viewed from the 'Schema Objects' screen under the 'Database' menu, by locating the function in the 'DB' > 'Procedures' folder. Stylesheets used by the cartridges are installed in the WebDAV folder DAV/VAD/cartridges/xslt. This can be explored using Conductor's WebDAV interface. The actual rdf_mappers.sql file installed with your system can also be found in the DAV/VAD/cartridges folder.



16.10.9.5. Custom Cartridge

Virtuoso comes well supplied with a variety of Sponger cartridges and GRDDL filters. When then is it necessary to write your own cartridge?

In the main, writing a new cartridge should only be necessary to generate RDF from a REST-style Web service not supported by an existing cartridge, or to customize the output from an existing cartridge to your own requirements. Apart from these circumstances, the existing Sponger infrastructure should meet most of your needs. This is particularly the case for document resources.

16.10.9.5.1. Document Resources

We use the term document resource to identify content which is not being returned from a Web service. Normally it can broadly be conceived as some form of document, be it a text based entity or some form of file, for instance an image file.

In these cases, the document either contains RDF, which can be extracted directly, or it holds metadata in a supported format which can be transformed to RDF using an existing filter.

The following cases should all be covered by the existing Sponger cartridges:


16.10.9.5.2. GRDDL

GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is mechanism for deriving RDF data from XML documents and in particular XHTML pages. Document authors may associate transformation algorithms, typically expressed in XSLT, with their documents to transform embedded metadata into RDF.

The cartridges VAD installs a number of GRDDL filters for transforming popular microformats (such as RDFa, eRDF or hCalendar) into RDF. The available filters can be viewed, or configured, in Conductor's 'GRDDL Filters for XHTML' screen. Navigate to the 'RDF Cartridges' screen using the 'RDF' > 'RDF Cartridges' menu items, then SELECT the 'GRDDL Mappings' tab to display the 'GRDDL Filters for XHTML' screen. GRDDL filters are held in the WebDAV folder /DAV/VAD/rdf_cartridges/xslt/ alongside other XSLT templates. The Conductor interface allows you to add new GRDDL filters should you so wish.

For an introduction to GRDDL, try the GRDDL Primer. To underline GRDDL's utility, the primer includes an example of transforming Excel spreadsheet data, saved as XML, into RDF.

A comprehensive list of stylesheets for transforming HTML and non-HTML XML dialects is maintained on the ESW Wiki. The list covers a range of microformats, syndication formats and feedlists.


To see which Web Services are already catered for, view the list of cartridges in Conductor's 'RDF Cartridges' screen.


16.10.9.6. Creating Custom Cartridges

The Sponger is fully extensible by virtue of its pluggable cartridge architecture. New data formats can be fetched by creating new cartridges. While OpenLink is active in adding cartridges for new data sources, you are free to develop your own custom cartridges. Entity extractors can be built using Virtuoso PL, C/C++, Java or any other external language supported by Virtuoso's Server Extension API. Of course, Virtuoso's own entity extractors are written in Virtuoso PL.

16.10.9.6.1. The Anatomy of a Cartridge

Cartridge Hook Prototype

Every Virtuoso PL hook function used to plug a custom Sponger cartridge into the Virtuoso SPARQL engine must have a parameter list with the following parameters (the names of the parameters are not important, but their order and presence are):

Return Value

If the hook procedure returns zero the next cartridge will be tried. If the result is negative the sponging process stops, instructing the SPARQL engine that nothing was retrieved. If the result is positive the process stops, this time instructing the SPARQL engine that RDF data was successfully retrieved.

If your cartridge should need to test whether other cartridges are configured to handle a particular data source, the following extract taken from the RDF_LOAD_CALAIS hook procedure illustrates how you might do this:

if (xd is not null)
{
  -- Sponging successful. Load network resource data being fetched in the Virtuoso Quad Store:
  DB.DBA.RM_RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
  flag := 1;
}

declare ord any;
ord := (SELECT RM_ID FROM DB.DBA.SYS_RDF_MAPPERS WHERE
	  RM_HOOK = 'DB.DBA.RDF_LOAD_CALAIS');
for SELECT RM_PATTERN FROM DB.DBA.SYS_RDF_MAPPERS WHERE
  RM_ID > ord and RM_TYPE = 'URL' and RM_ENABLED = 1 ORDER BY RM_ID do
{
  if (regexp_match (RM_PATTERN, new_origin_uri) is not null)
    -- try next candidate cartridge
    flag := 0;
}
return flag;

Specifying the Target Graph

Two cartridge hook function parameters contain graph IRIs, graph_iri and dest. graph_iri identifies an input graph being crawled. dest holds the IRI specified in any input:grab-destination pragma defined to control the SPARQL processor's IRI dereferencing. The pragma overrides the default behaviour and forces all retrieved triples to be stored in a single graph, irrespective of their graph of origin.

So, under some circumstances depending on how the Sponger has been invoked and whether it is being used to crawl an existing RDF graph, or derive RDF data from a non-RDF data source, dest may be null.

Consequently, when loading network resource being fetched as RDF data into the quad store, cartridges typically specify the graph to receive the data using the coalesce function which returns the first non-null parameter. e.g.

DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));

Here xd is an RDF/XML string holding the fetched RDF.

Specifying & Retrieving Cartridge Specific Options

The hook function prototype allows cartridge specific data to be passed to a cartridge through the RM_OPTIONS parameter, a Virtuoso/PL vector which acts as a heterogeneous array.

In the following example, two options are passed, 'add-html-meta' and 'get-feeds' with both values set to 'no'.

insert soft DB.DBA.SYS_RDF_MAPPERS (
  RM_PATTERN, RM_TYPE, RM_HOOK, RM_KEY, RM_DESCRIPTION, RM_OPTIONS
)
values (
  '(text/html)|(text/xml)|(application/xml)|(application/rdf.xml)',
  'MIME', 'DB.DBA.RDF_LOAD_HTML_RESPONSE', null, 'xHTML',
  vector ('add-html-meta', 'no', 'get-feeds', 'no')
);

The RM_OPTIONS vector can be handled as an array of key-value pairs using the get_keyword function. get_keyword performs a case sensitive search for the given keyword at every even index of the given array. It returns the element following the keyword, i.e. the keyword value.

Using get_keyword, any options passed to the cartridge can be retrieved using an approach similar to that below:

create procedure DB.DBA.RDF_LOAD_HTML_RESPONSE (
  in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
  inout ret_body any, inout aq any, inout ps any, inout _key any,
  inout opts any )
{
  declare get_feeds, add_html_meta;
  ...
  get_feeds := add_html_meta := 0;
  if (isarray (opts) and 0 = mod (length(opts), 2))
  {
    if (get_keyword ('get-feeds', opts) = 'yes')
      get_feeds := 1;
    if (get_keyword ('add-html-meta', opts) = 'yes')
      add_html_meta := 1;
  }
  ...

XSLT - The Fulchrum

XSLT is the fulchrum of all OpenLink supplied cartridges. It provides the most convenient means of converting structured data extracted from web content by a cartridge's Entity Extractor into RDF.

Virtuoso's XML Infrastructure & Tools

Virtuoso's XML support and XSLT support are covered in detail in the on-line documentation. Virtuoso includes a highly capable XML parser and supports XPath, XQuery, XSLT and XML Schema validation.

Virtuoso supports extraction of XML documents from SQL datasets. A SQL long varchar, long xml or xmltype column in a database table can contain XML data as text or in a binary serialized format. A string representing a well-formed XML entity can be converted into an entity object representing the root node.

While Sponger cartridges will not normally concern themselves with handling XML extracted from SQL data, the ability to convert a string into an in-memory XML document is used extensively. The function xtree_doc(string) converts a string into such a document and returns a reference to the document's root. This document together with an appropriate stylesheet forms the input for the transformation of the extracted entities to RDF using XSLT. The input string to xtree_doc generally contains structured content derived from a web service.

Virtuoso XSLT Support

Virtuoso implements XSLT 1.0 transformations as SQL callable functions. The xslt() Virtuoso/PL function applies a given stylesheet to a given source XML document and returns the transformed document. Virtuoso provides a way to extend the abilities of the XSLT processor by creating user defined XPath functions. The functions xpf_extension() and xpf_extension_remove() allow addition and removal of XPath extension functions.

General Cartridge Pipeline

The broad pipeline outlined here reflects the steps common to most cartridges:

The MusicBrainz cartridge typifies this approach. MusicBrainz is a community music metadatabase which captures information about artists, their recorded works, and the relationships between them. Artists always have a unique ID, so the URL http://musicbrainz.org/artist/4d5447d7-c61c-4120-ba1b-d7f471d385b9.html takes you directly to entries for John Lennon.

If you were to look at this page in your browser, you would see that the information about the artist contains no RDF data. However, the cartridge is configured to intercept requests to URLs of the form http://musicbrainz.org/([^/]*)/([^.]*) and redirect to the cartridge to Fetch all the available information on the given artist, release, track or label.

The cartridge extracts entities by redirecting to the MusicBrainz XML Web Service using as the basis for the initial query the item ID, e.g. an artist or label ID, extracted from the original URL. Stripped to its essentials, the core of the cartridge is:

webservice_uri := sprintf ('http://musicbrainz.org/ws/1/%s/%s?type=xml&inc=%U',
					kind, id, inc);
content := RDF_HTTP_URL_GET (webservice_uri, '', hdr, 'GET', 'Accept: */*');
xt := xtree_doc (content);
...
xd := DB.DBA.RDF_MAPPER_XSLT (registry_get ('_cartridges_path_') || 'xslt/mbz2rdf.xsl', xt);
...
xd := serialize_to_UTF8_xml (xd);
DB.DBA.RM_RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));

In the above outline, RDF_HTTP_URL_GET sends a query to the MusicBrainz web service, using query parameters appropriate for the original request, and retrieves the response using Network Resource Fetch.

The returned XML is parsed into an in-memory parse tree by xtree_doc. Virtuoso/PL function RDF_MAPPER_XSLT is a simple wrapper around the function xslt which sets the current user to dba before returning an XML document transformed by an XSLT stylesheet, in this case mbz2rdf.xsl. Function serialize_to_UTF8_xml changes the character set of the in-memory XML document to UTF8. Finally, RM_RDF_LOAD_RDFXML is a wrapper around RDF_LOAD_RDFXML which parses the content of an RDF/XML string into a sequence of RDF triples and loads them into the quad store. XSLT stylesheets are usually held in the DAV/VAD/cartridges/xslt folder of Virtuoso's WebDAV store. registry_get('cartridges_path') returns the Cartridges VAD path, 'DAV/VAD/cartridges', from the Virtuoso registry.

Error Handling with Exit Handlers

Virtuoso condition handlers determine the behaviour of a Virtuoso/PL procedure when a condition occurs. You can declare one or more condition handlers in a Virtuoso/PL procedure for general SQL conditions or specific SQLSTATE values. If a statement in your procedure raises an SQLEXCEPTION condition and you declared a handler for the specific SQLSTATE or SQLEXCEPTION condition the server passes control to that handler. If a statement in your Virtuoso/PL procedure raises an SQLEXCEPTION condition, and you have not declared a handler for the specific SQLSTATE or the SQLEXCEPTION condition, the server passes the exception to the calling procedure (if any). If the procedure call is at the top-level, then the exception is signaled to the calling client.

A number of different condition handler types can be declared (see the Virtuoso reference documentation for more details.) Of these, exit handlers are probably all you will need. An example is shown below which handles any SQLSTATE. Commented out is a debug statement which outputs the message describing the SQLSTATE.

create procedure DB.DBA.RDF_LOAD_SOCIALGRAPH (in graph_iri varchar, ...)
{
  declare qr, path, hdr any;
  ...
  declare exit handler for sqlstate '*'
  {
    -- dbg_printf ('%s', __SQL_MESSAGE);
    return 0;
  };
  ...
  -- data extraction and mapping successful
  return 1;
}

Exit handlers are used extensively in the Virtuoso supplied cartridges. They are useful for ensuring graceful failure when trying to convert content which may not conform to your expectations. The RDF_LOAD_FEED_SIOC procedure (which is used internally by several cartridges) shown below uses this approach:

-- /* convert the feed in rss 1.0 format to sioc */
create procedure DB.DBA.RDF_LOAD_FEED_SIOC (in content any, in iri varchar, in graph_iri varchar, in is_disc int := '')
{
  declare xt, xd any;
  declare exit handler for sqlstate '*'
    {
      goto no_sioc;
    };
  xt := xtree_doc (content);
  xd := DB.DBA.RDF_MAPPER_XSLT (
      registry_get ('_cartridges_path_') || 'xslt/feed2sioc.xsl', xt,
      vector ('base', graph_iri, 'isDiscussion', is_disc));
  xd := serialize_to_UTF8_xml (xd);
  DB.DBA.RM_RDF_LOAD_RDFXML (xd, iri, graph_iri);
  return 1;
no_sioc:
  return 0;
}

Loading RDF into the Quad Store

RDF_LOAD_RDFXML & TTLP

The two main Virtuoso/PL functions used by the cartridges for loading RDF data into the Virtuoso quad store are DB.DBA.TTLP and DB.DBA.RDF_LOAD_RDFXML. Multithreaded versions of these functions, DB.DBA.TTLP_MT and DB.DBA.RDF_LOAD_RDFXML_MT, are also available.

RDF_LOAD_RDFXML parses the content of an RDF/XML string as a sequence of RDF triples and loads then into the quad store. TTLP parses TTL (Turtle or N3) and places its triples into quad storage. Ordinarily, cartridges use RDF_LOAD_RDFXML. However there may be occasions where you want to insert statements written as TTL, rather than RDF/XML, in which case you should use TTLP.

See Also:

Attribution

Many of the OpenLink supplied cartridges actually use RM_RDF_LOAD_RDFXML to load data into the quad store. This is a thin wrapper around RDF_LOAD_RDFXML which includes in the generated graph an indication of the external ontologies being used. The attribution takes the form:

<ontologyURI> a opl:DataSource .
<spongedResourceURI> rdfs:isDefinedBy <ontologyURI> .
<ontologyURI> opl:hasNamespacePrefix "<ontologyPrefix>" .

where prefix opl: denotes the ontology http://www.openlinksw.com/schema/attribution#.

Deleting Existing Graphs

Before loading network resource fetched RDF data into a graph, you may want to delete any existing graph with the same URI. To do so, select the 'RDF' > 'List of Graphs' menu commands in Conductor, then use the 'Delete' command for the appropriate graph. Alternatively, you can use one of the following SQL commands:

SPARQL CLEAR GRAPH
-- or
DELETE FROM DB.DBA.RDF_QUAD WHERE G = DB.DBA.RDF_MAKE_IID_OF_QNAME (graph_iri)

Proxy Service Data Expiration

When the Proxy Service is invoked by a user agent, the Sponger records the expiry date of the imported data in the table DB.DBA.SYS_HTTP_SPONGE. The data invalidation rules conform to those of traditional HTTP clients (Web browsers). The data expiration time is determined based on subsequent data fetches of the same resource. The first data retrieval records the 'expires' header. On subsequent fetches, the current time is compared to the expiration time stored in the local cache. If HTTP 'expires' header data isn't returned by the source data server, the Sponger will derive its own expiration time by evaluating the 'date' header and 'last-modified' HTTP headers.


16.10.9.6.2. Ontology Mapping

After extracting entities from a web resource and converting them to an in-memory XML document, the entities must be transformed to the target ontology using XSLT and an appropriate stylesheet. A typical call sequence would be:

xt := xtree_doc (content);
...
xd := DB.DBA.RDF_MAPPER_XSLT (registry_get ('_cartridges_path_') || 'xslt/mbz2rdf.xsl', xt);

Because of the wide variation in the data mapped by cartridges, it is not possible to present a typical XSL stylesheet outline. The Examples section presented later includes detailed extracts from the MusicBrainz? cartridge's stylesheet which provide a good example of how to map to an ontology. Rather than attempting to be an XSLT tutorial, the material which follows offers some general guidelines.

Passing Parameters to the XSLT Processor

Virtuoso's XSLT processor will accept default values for global parameters from the optional third argument of the xslt() function. This argument, if specified, must be a vector of parameter names and values of the form vector(name1, value1,... nameN, valueN), where name1 ... nameN must be of type varchar, and value1 ... valueN may be of any Virtuoso datatype, but may not be null.

This extract from the Crunchbase cartridge shows how parameters may be passed to the XSLT processor. The function RDF_MAPPER_XSLT (in xslt varchar, inout xt any, in params any := null) passes the parameters vector directly to xslt().

xt := DB.DBA.RDF_MAPPER_XSLT (
registry_get ('_cartridges_path_') || 'xslt/crunchbase2rdf.xsl', xt,
vector ('baseUri', coalesce (dest, graph_iri), 'base', base, 'suffix', suffix)
);

The corresponding stylesheet crunchbase2rdf.xsl retrieves the parameters baseUri, base and suffix as follows:

...
<xsl:output method="xml" indent="yes" />
  <xsl:variable name="ns">http://www.crunchbase.com/</xsl:variable>
  <xsl:param name="baseUri" />
  <xsl:param name="base"/>
  <xsl:param name="suffix"/>
  <xsl:template name="space-name">
...

An RDF Description Template

Defining A Generic Resource Description Wrapper

Many of the OpenLink cartridges create a resource description formed to a common "wrapper" template which describes the relationship between the (usually) non-RDF source network resource being fetched and the RDF description generated by the Sponger. The wrapper is appropriate for resources which can broadly be conceived as documents. It provides a generic minimal description of the source document, but also links to the much more detailed description provided by the Sponger. So, instead of just emitting a resource description, the Sponger factors the container into the generated graph constituting the RDF description.

The template is depicted below:

Template

Figure: 16.10.9.6.2.1. Template

To generate an RDF description corresponding to the wrapper template, a stylesheet containing the following block of instructions is used. This extract is taken from the eBay cartridge's stylesheet, ebay2rdf.xsl. Many of the OpenLink cartridges follow a similar pattern.

    <xsl:param name="baseUri"/>
    ...
    <xsl:variable name="resourceURL">
	<xsl:value-of select="$baseUri"/>
    </xsl:variable>
    ...
    <xsl:template match="/">
	<rdf:RDF>
	    <rdf:Description rdf:about="{$resourceURL}">
		<rdf:type rdf:resource="Document"/>
		<rdf:type rdf:resource="Document"/>
		<rdf:type rdf:resource="Container"/>
		<sioc:container_of rdf:resource="{vi:proxyIRI ($resourceURL)}"/>
		<foaf:primaryTopic rdf:resource="{vi:proxyIRI ($resourceURL)}"/>
		<dcterms:subject rdf:resource="{vi:proxyIRI ($resourceURL)}"/>
	    </rdf:Description>
	    <rdf:Description rdf:about="{vi:proxyIRI ($resourceURL)}">
		<rdf:type rdf:resource="Item"/>
		<sioc:has_container rdf:resource="{$resourceURL}"/>
		<xsl:apply-templates/>
	    </rdf:Description>
	</rdf:RDF>
    </xsl:template>
    ...

Using SIOC as a Generic Container Model

The generic resource description wrapper just described uses SIOC to establish the container/contained relationship between the source resource and the generated graph. Although the most important classes for the generic wrapper are obviously Container and Item, SIOC provides a generic data model of containers, items, item types, and associations between items which can be combined with other vocabularies such as FOAF and Dublin Core.

SIOC defines a number of other classes, such as User, UserGroup, Role, Site, Forum and Post. A separate SIOC types module (T-SIOC) extends the SIOC Core ontology by defining subclasses and subproperties of SIOC terms. Subclasses include: AddressBook, BookmarkFolder, Briefcase, EventCalendar, ImageGallery, Wiki, Weblog, BlogPost, Wiki plus many others.

OpenLink Data Spaces (ODS) uses SIOC extensively as a data space "glue" ontology to describe the base data and containment hierarchy of all the items managed by ODS applications (Data Spaces). For example, ODS-Weblog is an application of type sioc:Forum. Each ODS-Weblog application instance contains blogs of type sioct:Weblog. Each blog is a sioc:container_of posts of type sioc:Post.

Generally, when deciding how to describe resources handled by your own custom cartridge, SIOC provides a useful framework for the description which complements the SIOC-based container model adopted throughout the ODS framework.

Naming Conventions for Sponger Generated Descriptions

As can be seen from the stylesheet extract just shown, the URI of the resource description generated by the Sponger to describe the network resource being fetched, is given by the function {vi:proxyIRI ($resourceURL)} where resourceURL is the URL of the original network resource being fetched. proxyIRI is an XPath extension function defined in rdf_mappers.sql as

xpf_extension ('http://www.openlinksw.com/virtuoso/xslt/:proxyIRI', 'DB.DBA.RDF_SPONGE_PROXY_IRI');

which maps to the Virtuoso/PL procedure DB.DBA.RDF_SPONGE_PROXY_IRI. This procedure in turn generates a resource description URI which typically takes the form: http://<hostName:port>/about/html/http/<resourceURL>#this


16.10.9.6.3. Registering & Configuring Cartridges

Once you have developed a cartridge, you must register it in the Cartridge Registry to have the SPARQL processor recognize and use it. You should have compiled your cartridge hook function first by issuing a "create procedure DB.DBA.RDF_LOAD_xxx ..." command through one of Virtuoso's SQL interfaces. You can create the required Cartridge Registry entry either by adding a row to the SYS_REF_MAPPERS table directly using SQL, or by using the Conductor UI.

Using SQLs

If you choose register your cartridge using SQL, possibly as part of a Virtuoso/PL script, the required SQL will typically mirror one of the following INSERT commands.

Below, a cartridge for OpenCalais is being installed which will be tried when the MIME type of the network resource data being fetched is one of text/plain, text/xml or text/html. (The definition of the SYS_RDF_MAPPERS table was introduced earlier in section 'Cartridge Registry'.)

insert soft DB.DBA.SYS_RDF_MAPPERS (
  RM_PATTERN, RM_TYPE, RM_HOOK, RM_KEY, RM_DESCRIPTION, RM_ENABLED)
values (
  '(text/plain)|(text/xml)|(text/html)', 'MIME', 'DB.DBA.RDF_LOAD_CALAIS',
  null, 'Opencalais', 1);

As an alternative to matching on the content's MIME type, candidate cartridges to be tried in the conversion pipeline can be identified by matching the data source URL against a URL pattern stored in the cartridge's entry in the Cartridge Registry.

insert soft DB.DBA.SYS_RDF_MAPPERS (
  RM_PATTERN, RM_TYPE, RM_HOOK, RM_KEY, RM_DESCRIPTION, RM_OPTIONS)
values (
  '(http://api.crunchbase.com/v/1/.*)|(http://www.crunchbase.com/.*)', 'URL',
  'DB.DBA.RDF_LOAD_CRUNCHBASE', null, 'CrunchBase', null);

The value of RM_ID to set depends on where in the cartridge invocation order you want to position a particular cartridge. RM_ID should be set lower than 10028 to ensure the cartridge is tried before the ODS-Briefcase (WebDAV) metadata extractor, which is always the last mapper to be tried if no preceding cartridge has been successful.

UPDATE DB.DBA.SYS_RDF_MAPPERS
SET RM_ID = 1000
WHERE RM_HOOK = 'DB.DBA.RDF_LOAD_BIN_DOCUMENT';

Using Conductor

Cartridges can be added manually using the 'Add' panel of the 'RDF Cartridges' screen.

RDF Cartridges

Figure: 16.10.9.6.3.1. RDF Cartridges

Installing Stylesheets

Although you could place your cartridge stylesheet in any folder configured to be accessible by Virtuoso, the simplest option is to upload them to the DAV/VAD/cartridges/xslt folder using the WebDAV browser accessible from the Conductor UI.

WebDAV browser

Figure: 16.10.9.6.3.1. WebDAV browser

Should you wish to locate your stylesheets elsewhere, ensure that the DirsAllowed setting in the virtuoso.ini file is configured appropriately.


Setting API Key

Some Cartridges require and API account and/or API Key to be provided for accessing the required service. This can be done from the Linked Data -> Sponger tab of the Conductor by selecting the cartridge from the list provided, entering the API Account and API Key in the dialog at the bottom of the page and click update to save, as indicated in the screenshot below:

Registering API Key

Figure: 16.10.9.6.3.1. Registering API Key

For example, for the service Flickr developers must register to obtain a key. See http://developer.yahoo.com/flickr/. In order to cater for services which require an application key, the Cartridge Registry SYS_RDF_MAPPERS table includes an RM_KEY column to store any key required for a particular service. This value is passed to the service's cartridge through the _key parameter of the cartridge hook function.

Alternatively a cartridge can store a key value in the virtuoso.ini configuration file and retrieve it in the hook function.


Flickr Cartridge

This example shows an extract from the Flickr cartridge hook function DB.DBA.RDF_LOAD_FLICKR_IMG and the use of an API key. Also, commented out, is a call to cfg_item_value() which illustrates how the API key could instead be stored and retrieved from the SPARQL section of the virtuoso.ini file.

create procedure DB.DBA.RDF_LOAD_FLICKR_IMG (
in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
inout _ret_body any, inout aq any, inout ps any, inout _key any,
inout opts any )
{
declare xd, xt, url, tmp, api_key, img_id, hdr, exif any;
declare exit handler for sqlstate '*'
{
 return 0;
};
tmp := sprintf_inverse (new_origin_uri,
  'http://farm%s.static.flickr.com/%s/%s_%s.%s', 0);
img_id := tmp[2];
api_key := _key;
--cfg_item_value (virtuoso_ini_path (), 'SPARQL', 'FlickrAPIkey');
if (tmp is null or length (tmp) <> 5 or not isstring (api_key))
  return 0;
url :=  sprintf('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_id=%s&api_key=%s',img_id, api_key);
tmp := http_get (url, hdr);


16.10.9.6.4. MusicBrainz Example: A Music Metadatabase

To illustrate some of the material presented so far, we'll delve deeper into the MusicBrainz cartridge mentioned earlier.

MusicBrainz XML Web Service

The cartridge extracts data through the MusicBrainz XML Web Service using, as the basis for the initial query, an item type and MBID (MusicBrainz ID) extracted from the original URI submitted to the RDF proxy. A range of item types are supported including artist, release and track.

Using the album "Imagine" by John Lennon as an example, a standard HTML description of the album (which has an MBID of f237e6a0-4b0e-4722-8172-66f4930198bc) can be retrieved direct from MusicBrainz using the URL:

http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html

Alternatively, information can be extracted in XML form through the web service. A description of the tracks on the album can be obtained with the query:

http://musicbrainz.org/ws/1/release/f237e6a0-4b0e-4722-8172-66f4930198bc?type=xml&inc=tracks

The XML returned by the web service is shown below (only the first two tracks are shown for brevity):

<?xml version="1.0" encoding="UTF-8"?>
  <metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#"
   xmlns:ext="http://musicbrainz.org/ns/ext-1.0#">
    <release id="f237e6a0-4b0e-4722-8172-66f4930198bc" type="Album Official" >
      <title>Imagine</title>
        <text-representation language="ENG" script="Latn"/>
        <asin>B0000457L2</asin>
        <track-list>
          <track id="b88bdafd-e675-4c6a-9681-5ea85ab99446">
            <title>Imagine</title>
            <duration>182933</duration>
          </track>
          <track id="b38ce90d-3c47-4ccd-bea2-4718c4d34b0d">
            <title>Crippled Inside</title>
            <duration>227906</duration>
          </track>
	  . . .
        </track-list>
      </release>
  </metadata>

Although, as shown above, MusicBrainz defines its own XML Metadata Format to represent music metadata, the MusicBrainz sponger converts the raw data to a subset of the Music Ontology, an RDF vocabulary which aims to provide a set of core classes and properties for describing music on the Semantic Web. Part of the subset used is depicted in the following RDF graph (representing in this case a John Cale album).

RDF graph

Figure: 16.10.9.6.4.1. RDF graph

With the prefix mo: denoting the Music Ontology at http://purl.org/ontology/mo/, it can be seen that artists are represented by instances of class mo:Artist, their albums, records etc. by instances of class mo:Release and tracks on these releases by class mo:Track. The property foaf:made links an artist and his/her releases. Property mo:track links a release with the tracks it contains

RDF Output

An RDF description of the album can be obtained by sponging the same URL, i.e. by submitting it to the Sponger's proxy interface using the URL:

http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html

The extract below shows part of the (reorganized) RDF output returned by the Sponger for "Imagine". Only the album's title track is included.

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">

<rdf:Description
 rdf:about="http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html">
  <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
</rdf:Description>

<rdf:Description
 rdf:about="http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html">
  <foaf:primaryTopic xmlns:foaf="http://xmlns.com/foaf/0.1/"
   rdf:resource="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html#this"/>
</rdf:Description>

<rdf:Description rdf:about="http://purl.org/ontology/mo/">
  <rdf:type rdf:resource="http://www.openlinksw.com/schema/attribution#DataSource"/>
</rdf:Description>
...
<rdf:Description
 rdf:about="http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html">
  <rdfs:isDefinedBy rdf:resource="http://purl.org/ontology/mo/"/>

</rdf:Description>
...
<!-- Record description -->
<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html#this">
  <rdf:type rdf:resource="http://purl.org/ontology/mo/Record"/>
</rdf:Description>

<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html#this">
  <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Imagine</dc:title>
</rdf:Description>

<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html#this">
  <mo:release_status xmlns:mo="http://purl.org/ontology/mo/" rdf:resource="http://purl.org/ontology/mo/official"/>
</rdf:Description>

<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html#this">
  <mo:release_type xmlns:mo="http://purl.org/ontology/mo/"
   rdf:resource="http://purl.org/ontology/mo/album"/>
</rdf:Description>
<!-- Title track description -->
<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc.html#this">
  <mo:track xmlns:mo="http://purl.org/ontology/mo/"
   rdf:resource="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/track/b88bdafd-e675-4c6a-9681-5ea85ab99446.html#this"/>
</rdf:Description>
<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/track/b88bdafd-e675-4c6a-9681-5ea85ab99446.html#this">
  <rdf:type rdf:resource="http://purl.org/ontology/mo/Track"/>
</rdf:Description>

<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/track/b88bdafd-e675-4c6a-9681-5ea85ab99446.html#this">
  <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Imagine</dc:title>
</rdf:Description>

<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/track/b88bdafd-e675-4c6a-9681-5ea85ab99446.html#this">
  <mo:track_number xmlns:mo="http://purl.org/ontology/mo/">1</mo:track_number>
</rdf:Description>

<rdf:Description
 rdf:about="http://demo.openlinksw.com/about/rdf/http://musicbrainz.org/track/b88bdafd-e675-4c6a-9681-5ea85ab99446.html#this">
  <mo:duration xmlns:mo="http://purl.org/ontology/mo/" rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">182933</mo:duration>
</rdf:Description>
</rdf:RDF>

Cartridge Hook Function

The cartridge's hook function is listed below. It is important to note that MusicBrainz supports a variety of query types, each of which returns a different set of information, depending on the item type being queried. Full details can be found on the MusicBrainz? site. The sponger cartridge is capable of handling all the query types supported by MusicBrainz? and is intended to be used in a drill-down scenario, as would be the case when using an RDF browser such as the OpenLink Data Explorer (ODE). This example focuses primarily on the types release and track.

create procedure DB.DBA.RDF_LOAD_MBZ (
  in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
  inout _ret_body any, inout aq any, inout ps any, inout _key any,
  inout opts any)
{
  declare kind, id varchar;
  declare tmp, incs any;
  declare uri, cnt, hdr, inc, xd, xt varchar;
  tmp := regexp_parse ('http://musicbrainz.org/([^/]*)/([^\.]+)', new_origin_uri, 0);
  declare exit handler for sqlstate '*'
  {
    -- dbg_printf ('%s', __SQL_MESSAGE);
    return 0;
  };
  if (length (tmp) < 6)
    return 0;

  kind := subseq (new_origin_uri, tmp[2], tmp[3]);
  id :=   subseq (new_origin_uri, tmp[4], tmp[5]);
  incs := vector ();
  if (kind = 'artist')
    {
      inc := 'aliases artist-rels label-rels release-rels track-rels url-rels';
      incs :=
      	vector (
	'sa-Album', 'sa-Single', 'sa-EP', 'sa-Compilation', 'sa-Soundtrack',
	'sa-Spokenword', 'sa-Interview', 'sa-Audiobook', 'sa-Live', 'sa-Remix', 'sa-Other'
	, 'va-Album', 'va-Single', 'va-EP', 'va-Compilation', 'va-Soundtrack',

	'va-Spokenword', 'va-Interview', 'va-Audiobook', 'va-Live', 'va-Remix', 'va-Other'
	);
    }
  else if (kind = 'release')
    inc := 'artist counts release-events discs tracks artist-rels label-rels release-rels track-rels url-rels track-level-rels labels';
  else if (kind = 'track')
    inc := 'artist releases puids artist-rels label-rels release-rels track-rels url-rels';
  else if (kind = 'label')
    inc := 'aliases artist-rels label-rels release-rels track-rels url-rels';
  else
    return 0;
  if (dest is null)
    DELETE FROM DB.DBA.RDF_QUAD WHERE G = DB.DBA.RDF_MAKE_IID_OF_QNAME (graph_iri);
  DB.DBA.RDF_LOAD_MBZ_1 (graph_iri, new_origin_uri, dest, kind, id, inc);
  DB.DBA.TTLP (sprintf ('<%S> <http://xmlns.com/foaf/0.1/primaryTopic> <%S> .\n<%S> a <http://xmlns.com/foaf/0.1/Document> .',
  	new_origin_uri, DB.DBA.RDF_SPONGE_PROXY_IRI (new_origin_uri), new_origin_uri),
  	'', graph_iri);
  foreach (any inc1 in incs) do
    {
      DB.DBA.RDF_LOAD_MBZ_1 (graph_iri, new_origin_uri, dest, kind, id, inc1);
    }
  return 1;
};

The hook function uses a subordinate procedure RDF_LOAD_MBZ_1:

create procedure DB.DBA.RDF_LOAD_MBZ_1 (in graph_iri varchar, in new_origin_uri varchar,
   in dest varchar, in kind varchar, in id varchar, in inc varchar)
{
  declare uri, cnt, xt, xd, hdr any;
  uri := sprintf ('http://musicbrainz.org/ws/1/%s/%s?type=xml&inc=%U', kind, id, inc);
  cnt := RDF_HTTP_URL_GET (uri, '', hdr, 'GET', 'Accept: */*');
  xt := xtree_doc (cnt);
  xd := DB.DBA.RDF_MAPPER_XSLT (registry_get ('_cartridges_path_') || 'xslt/mbz2rdf.xsl', xt,
        vector ('baseUri', new_origin_uri));
  xd := serialize_to_UTF8_xml (xd);
  DB.DBA.RM_RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
};

XSLT Stylesheet

The key sections of the MusicBrainz XSLT template relevant to this example are listed below. Only the sections relating to an artist, his releases, or the tracks on those releases, are shown.

<!DOCTYPE xsl:stylesheet [
<!ENTITY xsd "http://www.w3.org/2001/XMLSchema#">
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#">
<!ENTITY mo "http://purl.org/ontology/mo/">
<!ENTITY foaf "http://xmlns.com/foaf/0.1/">
<!ENTITY mmd "http://musicbrainz.org/ns/mmd-1.0#">
<!ENTITY dc "http://purl.org/dc/elements/1.1/">
]>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:vi="http://www.openlinksw.com/virtuoso/xslt/"
    xmlns:rdf=""
    xmlns:rdfs=""
    xmlns:foaf=""
    xmlns:mo=""
    xmlns:mmd=""
    xmlns:dc=""
    >

    <xsl:output method="xml" indent="yes" />
    <xsl:variable name="base" select="'http://musicbrainz.org/'"/>
    <xsl:variable name="uc">ABCDEFGHIJKLMNOPQRSTUVWXYZ</xsl:variable>
    <xsl:variable name="lc">abcdefghijklmnopqrstuvwxyz</xsl:variable>
    <xsl:template match="/mmd:metadata">
	<rdf:RDF>
	    <xsl:apply-templates />
	</rdf:RDF>
    </xsl:template>

    ...

    <xsl:template match="mmd:artist[@type='Person']">
	<mo:MusicArtist rdf:about="{vi:proxyIRI (concat($base,'artist/',@id,'.html'))}">
	    <foaf:name><xsl:value-of select="mmd:name"/></foaf:name>
	    <xsl:for-each select="mmd:release-list/mmd:release|mmd:relation-list[@target-type='Release']/mmd:relation/mmd:release">
		<foaf:made rdf:resource="{vi:proxyIRI (concat($base,'release/',@id,'.html'))}"/>
	    </xsl:for-each>
	</mo:MusicArtist>
	<xsl:apply-templates />
    </xsl:template>

    <xsl:template match="mmd:release">
	<mo:Record rdf:about="{vi:proxyIRI (concat($base,'release/',@id,'.html'))}">
	    <dc:title><xsl:value-of select="mmd:title"/></dc:title>
	    <mo:release_type rdf:resource="{translate (substring-before (@type, ' '),
                                                          $uc, $lc)}"/>
	    <mo:release_status rdf:resource="{translate (substring-after (@type, ' '), $uc,
                                                  $lc)}"/>
	    <xsl:for-each select="mmd:track-list/mmd:track">
		<mo:track rdf:resource="{vi:proxyIRI (concat($base,'track/',@id,'.html'))}"/>

	    </xsl:for-each>
	</mo:Record>
	<xsl:apply-templates select="mmd:track-list/mmd:track"/>
    </xsl:template>

    <xsl:template match="mmd:track">
	<mo:Track rdf:about="{vi:proxyIRI (concat($base,'track/',@id,'.html'))}">
	    <dc:title><xsl:value-of select="mmd:title"/></dc:title>
	    <mo:track_number><xsl:value-of select="position()"/></mo:track_number>
	    <mo:duration rdf:datatype="integer">
             <xsl:value-of select="mmd:duration"/>
           </mo:duration>
	    <xsl:if test="artist[@id]">
		<foaf:maker rdf:resource="{vi:proxyIRI (concat ($base, 'artist/',
                                          artist/@id, '.html'))}"/>
	    </xsl:if>
	    <mo:musicbrainz rdf:resource="{vi:proxyIRI (concat ($base, 'track/', @id, '.html'))}"/>
	</mo:Track>
    </xsl:template>

    ...

    <xsl:template match="text()"/>
</xsl:stylesheet>

16.10.9.6.5. Entity Extractor & Mapper Component

Used to extract RDF from a Web Data Source the Virtuoso Sponger Cartridge RDF Extractor consumes services from: Virtuoso PL, C/C++, Java based RDF Extractors

The RDF mappers provide a way to extract metadata from non-RDF documents such as HTML pages, images Office documents etc. and pass to SPARQL sponger (crawler which retrieve missing source graphs). For brevity further in this article the "RDF mapper" we simply will call "mapper".

The mappers consist of PL procedure (hook) and extractor, where extractor itself can be built using PL, C or any external language supported by Virtuoso server.

Once the mapper is developed it must be plugged into the SPARQL engine by adding a record in the table DB.DBA.SYS_RDF_MAPPERS.

If a SPARQL query instructs the SPARQL processor to retrieve target graph into local storage, then the SPARQL sponger will be invoked. If the target graph IRI represents a dereferenceable URL then content will be retrieved using content negotiation. The next step is the content type to be detected:

Virtuoso/PL based Extractors

PL hook requirements:

Every PL function used to plug a mapper into SPARQL engine must have following parameters in the same order:

Note: the names of the parameters are not important, but their order and presence are!

Example Implementation:

In the example script below we implement a basic mapper, which maps a text/plain mime type to an imaginary ontology, which extends the class Document from FOAF with properties 'txt:UniqueWords' and 'txt:Chars', where the prefix 'txt:' we specify as 'urn:txt:v0.0:'.

use DB;

create procedure DB.DBA.RDF_LOAD_TXT_META
 (
  in graph_iri varchar,
  in new_origin_uri varchar,
  in dest varchar,
  inout ret_body any,
  inout aq any,
  inout ps any,
  inout ser_key any
  )
{
  declare words, chars int;
  declare vtb, arr, subj, ses, str any;
  declare ses any;
  -- if any error we just say nothing can be done
  declare exit handler for sqlstate '*'
    {
      return 0;
    };
  subj := coalesce (dest, new_origin_uri);
  vtb := vt_batch ();
  chars := length (ret_body);

  -- using the text index procedures we get a list of words
  vt_batch_feed (vtb, ret_body, 1);
  arr := vt_batch_strings_array (vtb);

  -- the list has 'word' and positions array, so we must divide by 2
  words := length (arr) / 2;
  ses := string_output ();

  -- we compose a N3 literal
  http (sprintf ('<%s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .\n', subj), ses);
  http (sprintf ('<%s> <urn:txt:v0.0:UniqueWords> "%d" .\n', subj, words), ses);
  http (sprintf ('<%s> <urn:txt:v0.0:Chars> "%d" .\n', subj, chars), ses);
  str := string_output_string (ses);

  -- we push the N3 text into the local store
  DB.DBA.TTLP (str, new_origin_uri, subj);
  return 1;
};

DELETE FROM DB.DBA.SYS_RDF_MAPPERS WHERE RM_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';

INSERT SOFT DB.DBA.SYS_RDF_MAPPERS (RM_PATTERN, RM_TYPE, RM_HOOK, RM_KEY, RM_DESCRIPTION)
VALUES ('(text/plain)', 'MIME', 'DB.DBA.RDF_LOAD_TXT_META', null, 'Text Files (demo)');

-- here we set order to some large number so don't break existing mappers
update DB.DBA.SYS_RDF_MAPPERS
SET RM_ID = 2000
WHERE RM_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';

To test the mapper we just use /sparql endpoint with option 'Retrieve remote RDF data for all missing source graphs' to execute:

SELECT *
FROM <URL-of-a-txt-file>
WHERE { ?s ?p ?o }

It is important that the SPARQL_UPDATE role to be granted to "SPARQL" account in order to allow local repository update via Network Resource Fetch feature.

Authentication in Sponger

To enable usage of user defined authentication, there are added more parameters to the /proxy/rdf and /sparql endpoints. So to use it, the RDF browser and iSPARQL should send following url parameters:


Registry

The table DB.DBA.SYS_RDF_MAPPERS is used as registry for registering RDF mappers.

create table DB.DBA.SYS_RDF_MAPPERS (
    RM_ID integer identity,         -- mapper ID, designate order of execution
    RM_PATTERN varchar,             -- a REGEX pattern to match URL or MIME type
    RM_TYPE varchar default 'MIME', -- what property of the current resource to match: MIME or URL are supported at present
    RM_HOOK varchar,                -- fully qualified PL function name e.q. DB.DBA.MY_MAPPER_FUNCTION
    RM_KEY  long varchar,           -- API specific key to use
    RM_DESCRIPTION long varchar,    -- Mapper description, free text
    RM_ENABLED integer default 1,   -- a flag 0 or 1 integer to include or exclude the given mapper from processing chain
    primary key (RM_TYPE, RM_PATTERN))
;

The current way to register/update/unregister a mapper is just a DML statement e.g. NSERT/UPDATE/DELETE.


Execution order and processing

When SPARQL retrieves a resource with unknown content it will look in the mappers registry and will loop over every record having RM_ENABLED flag true. The sequence of look-up is based on ordering by RM_ID column. For every record it will either try matching the MIME type or URL against RM_PATTERN value and if there is match the function specified in RM_HOOK column will be called. If the function doesn't exists or signal an error the SPARQL will look at next record.

When it stops looking? It will stop if value returned by mapper function is positive or negative number, if the return is negative processing stops with meaning no RDF was supplied, if return is positive the meaning is that RDF data was extracted, if zero integer is returned then SPARQL will look for next mapper. The mapper function also can return zero if it is expected next mapper in the chain to get more RDF data.

If none of the mappers matches the signature (MIME type nor URL) the built-in WebDAV metadata extractor will be called.


Extension function

The mapper function is a PL stored procedure with following signature:

THE_MAPPER_FUNCTION_NAME (
        in graph_iri varchar,
        in origin_uri varchar,
        in destination_uri varchar,
        inout content varchar,
        inout async_notification_queue any,
        inout ping_service any,
        inout keys any
        )
{
   -- do processing here
   -- return -1, 0 or 1 (as explained above in Execution order and processing section)
}
;

Parameters

Return value


Cartridges package content

The Virtuoso supply as a cartridges_dav.vad VAD package a cartridge for extracting RDF data from certain popular Web resources and file types. It can be installed (if not already) using VAD_INSTALL function, see the VAD chapter in documentation on how to do that.

HTTP-in-RDF

Maps the HTTP request response to HTTP Vocabulary in RDF, see http://www.w3.org/2006/http#.

This mapper is disabled by default. If it's enabled , it must be first in order of execution.

Also it always will return 0, which means any other mapper should push more data.

HTML

This mapper is composite, it looking for metadata which can specified in a HTML pages as follows:

The HTML page mapper will look for RDF data in order as listed above, it will try to extract metadata on each step and will return positive flag if any of the above step give a RDF data. In case where page URL matches some of other RDF mappers listed in registry it will return 0 so next mapper to extract more data. In order to function properly, this mapper must be executed before any other specific mappers.

Flickr URLs

This mapper extracts metadata of the Flickr images, using Flickr REST API. To function properly it must have configured key. The Flickr mapper extracts metadata using: CC license, Dublin Core, Dublin Core Metadata Terms, GeoURL, FOAF, EXIF: http://www.w3.org/2003/12/exif/ns/ ontology.

Amazon URLs

This mapper extracts metadata for Amazon articles, using Amazon REST API. It needs a Amazon API key in order to be functional.

eBay URLs

Implements eBay REST API for extracting metadata of eBay articles, it needs a key and user name to be configured in order to work.

Open Office (OO) documents

The OO documents contains metadata which can be extracted using UNZIP, so this extractor needs Virtuoso unzip plugin to be configured on the server.

Yahoo traffic data URLs

Implements transformation of the result of Yahoo traffic data to RDF.

iCal files

Transform iCal files to RDF as per http://www.w3.org/2002/12/cal/ical# .

Binary content, PDF, PowerPoint

The unknown binary content, PDF and MS PowerPoint files can be transformed to RDF using Aperture framework (http://aperture.sourceforge.net/). This mapper needs Virtuoso with Java hosting support, Aperture framework and MetaExtractor.class installed on the host system in order to work.

The Aperture framework & MetaExtractor.class must be installed on the system before to install the Cartridges VAD package. If the package is already installed, then to activate this mapper you can just re-install the VAD.

Setting-up Virtuoso with Java hosting to run Aperture framework

To check the cartridge has been configured, connect with Virtuoso's ISQL tool:

You should now be able to Fetch all Network Resource document types supported by the Aperture framework, (using one of the standard Sponger invocation mechanisms, for instance with a URL of the form http://localhost:8890/about/rdf/http://targethost/targetfile.pdf), subject to the MIME type pattern filters configured for the cartridge in the Conductor UI. By default the Aperture cartridge is registered to match MIME types (application/octet-stream)|(application/pdf)|(application/mspowerpoint). To Fetch all the Network Resource MIME types Aperture is capable of handling, changed the MIME type pattern to 'application/.*'.

Important: The installation guidelines presented above have been verified on Mac OS X with Aperture 1.2.0. Some adjustment may be needed for different operating systems or versions of Aperture.

Examples & tutorials

How to write own RDF mapper? Look at Virtuoso tutorial on this subject http://demo.openlinksw.com/tutorial/rdf/rd_s_1/rd_s_1.vsp .




16.10.9.7. Meta-Cartridges

So far the discussion has centered on 'primary' cartridges. However, Virtuoso supports an alternative type of cartridge, a 'meta-cartridge'. The way a meta-cartridge operates is essentially the same as a primary cartridge, that is it has a cartridge hook function with the same signature and its inserts data into the quad store through entity extraction and ontology mapping as before. Where meta-cartridges differ from primary cartridges is in their intent and their position in the cartridge invocation pipeline.

The purpose of meta-cartridges is to enrich graphs produced by other (primary) cartridges. They serve as general post-processors to add additional information about selected entities in an RDF graph. For instance, a particular meta-cartridge might be designed to search for entities of type 'umbel:Country' in a given graph, and then add additional statements about each country it finds, where the information contained in these statements is retrieved from the web service targeted by the meta-cartridge. One such example might be a 'World Bank' meta-cartridge which adds information relating to a country's GDP, its exports of goods and services as a percentage of GDP etc; retrieved using the World Bank web service API. In order to benefit from the World Bank meta-cartridge, any primary cartridge which might generate instance data relating to countries should ensure that each country instance it handles is also described as being of rdf:type 'umbel:Country'. Here, the UMBEL (Upper Mapping and Binding Exchange Layer) ontology is used as a data-source-agnostic classification system. It provides a core set of 20,000+ subject concepts which act as "a fixed set of reference points in a global knowledge space". The use of UMBEL in this way serves to decouple meta-cartridges from primary cartridges and data source specific ontologies.

Virtuoso includes two default meta-cartridges which use UMBEL and OpenCalais to augment source graphs.

Registration

Meta-cartridges must be registered in the RDF_META_CARTRIDGES table, which fulfills a role similar to the SYS_RDF_MAPPERS table used by primary cartridges. The structure of the table, and the meaning and use of its columns, are similar to SYS_RDF_MAPPERS. The meta-cartridge hook function signature is identical to that for primary cartridges.

The RDF_META_CARTRIDGES table definition is as follows:

create table DB.DBA.RDF_META_CARTRIDGES (
MC_ID INTEGER IDENTITY,		-- meta-cartridge ID. Determines the order of the
                           	   meta-cartridge's invocation in the Sponger
                                 processing chain
MC_SEQ INTEGER IDENTITY,
MC_HOOK VARCHAR,			-- fully qualified Virtuoso/PL function name
MC_TYPE VARCHAR,
MC_PATTERN VARCHAR, 		-- a REGEX pattern to match resource URL or
					   MIME type
MC_KEY VARCHAR,			-- API specific key to use
MC_OPTIONS ANY,			-- meta-cartridge specific options
MC_DESC LONG VARCHAR, 		-- meta-cartridge description (free text)
MC_ENABLED INTEGER		-- a 0 or 1 integer flag to exclude or include
					   meta-cartridge from Sponger processing chain
);

(At the time of writing there is no Conductor UI for registering meta-cartridges, they must be registered using SQL. A Conductor interface for this task will be added in due course.)

Invocation

Meta-cartridges are invoked through the post-processing hook procedure RDF_LOAD_POST_PROCESS which is called, for every document retrieved, after RDF_LOAD_RDFXML loads fetched data into the Quad Store.

Cartridges in the meta-cartridge registry (RDF_META_CARTRIDGES) are configured to match a given MIME type or URI pattern. Matching meta-cartridges are invoked in order of their MC_SEQ value. Ordinarily a meta-cartridge should return 0, in which case the next meta-cartridge in the post-processing chain will be invoked. If it returns 1 or -1, the post-processing stops and no further meta-cartridges are invoked.

The order of processing by the Sponger cartridge pipeline is thus:

  1. Try to get RDF in the form of TTL or RDF/XML. If RDF is retrieved if go to step 3
  2. Try generating RDF through the Sponger primary cartridges as before
  3. Post-process the RDF using meta-cartridges in order of their MC_SEQ value. If a meta-cartridge returns 1 or -1, stop the post-processing chain.

Notice that meta-cartridges may be invoked even if primary cartridges are not.

16.10.9.7.1. Example - A Campaign Finance Meta-Cartridge for Freebase

Note

The example which follows builds on a Freebase Sponger cartridge developed prior to the announcement of Freebase's support for generating Linked Data through the endpoint http://rdf.freebase.com/ . The OpenLink cartridge has since evolved to reflect these changes. A snapshot of the Freebase cartridge and stylesheet compatible with this example can be found here.

Freebase is an open community database of the world's information which serves facts and statistics rather than articles. Its designers see this difference in emphasis from article-oriented databases as beneficial for developers wanting to use Freebase facts in other websites and applications.

Virtuoso includes a Freebase cartridge in the cartridges VAD. The aim of the example cartridge presented here is to provide a lightweight meta-cartridge that is used to conditionally add triples to graphs generated by the Freebase cartridge, if Freebase is describing a U.S. senator.

New York Times Campaign Finance (NYTCF) API

The New York Times Campaign Finance (NYTCF) API allows you to retrieve contribution and expenditure data based on United States Federal Election Commission filings. You can retrieve totals for a particular presidential candidate, see aggregates by ZIP code or state, or get details on a particular donor.

The API supports a number of query types. To keep this example from being overly long, the meta-cartridge supports just one of these - a query for the candidate details. An example query and the resulting output follow:

Query:

http://api.nytimes.com/svc/elections/us/v2/president/2008/finances/candidates/obama,barack.xml?api-key=xxxx

Result:

<result_set>
 <status>OK</status>
 <copyright>
  Copyright (c) 2008 The New York Times Company. All Rights Reserved.
 </copyright>
 <results>
  <candidate>
    <candidate_name>Obama, Barack</candidate_name>
    <committee_id>C00431445</committee_id>
    <party>D</party>
    <total_receipts>468841844</total_receipts>
    <total_disbursements>391437723.5</total_disbursements>
    <cash_on_hand>77404120</cash_on_hand>
    <net_individual_contributions>426902994</net_individual_contributions>
    <net_party_contributions>150</net_party_contributions>
    <net_pac_contributions>450</net_pac_contributions>
    <net_candidate_contributions>0</net_candidate_contributions>
    <federal_funds>0</federal_funds>
    <total_contributions_less_than_200>222694981.5</total_contributions_less_than_200>
    <total_contributions_2300>76623262</total_contributions_2300>
    <net_primary_contributions>46444638.81</net_primary_contributions>
    <net_general_contributions>30959481.19</net_general_contributions>
    <total_refunds>2058240.92</total_refunds>
    <date_coverage_from>2007-01-01</date_coverage_from>
    <date_coverage_to>2008-08-31</date_coverage_to>
  </candidate>
 </results>
</result_set>

Sponging Freebase

Using OpenLink Data Explorer

The following instructions assume you have the OpenLink Data Explorer (ODE) browser extension installed in your browser.

An HTML description of Barack Obama can be obtained directly from Freebase by pasting the following URL into your browser: http://www.freebase.com/view/en/barack_obama

To view RDF data fetched from this page, select 'Linked Data Sources' from the browser's 'View' menu. An OpenLink Data Explorer interface will load in a new tab.

Clicking on the 'Barack Obama' link under the 'Person' category displayed by ODE fetches RDF data using the Freebase cartridge. Click the 'down arrow' adjacent to the 'Barack Obama' link to explore the retrieved data.

Assuming your Virtuoso instance is running on port 8890 on localhost, the list of data caches displayed by ODE should include: http://localhost:8890/about/html/http/www.freebase.com/view/en/barack_obama#this

The information displayed in the rest of the page relates to the entity instance identified by this URI. The prefix http://localhost:8890/about/html/http/ prepended to the original URI indicates that the Sponger Proxy Service has been invoked. The Sponger creates an associated entity instance (identified by the above URI with the #this suffix) which holds network resource information being fetched about the original entity.

Using the Command Line

As an alternative to ODE, you can perform Network Resource Fetch from the command line with the command:

curl -H "Accept: text/xml" "http://localhost:8890/about/html/http/www.freebase.com/view/en/barack_obama"

To view the results, you can use Conductor's browser-based SPARQL interface (e.g. http://localhost:8890/sparql) to query the resulting graph generated by the Sponger, http://www.freebase.com/view/en/barack_obama.

Installing the Meta-Cartridge

To register the meta-cartridge, a procedure similar to the following can be used:

create procedure INSTALL_RDF_LOAD_NYTCF ()
{
  -- delete any previous NYTCF cartridge installed as a primary cartridge
  DELETE FROM SYS_RDF_MAPPERS WHERE RM_HOOK = 'DB.DBA.RDF_LOAD_NYTCF';
  -- register in the meta-cartridge post-processing chain
  INSERT SOFT DB.DBA.RDF_META_CARTRIDGES (MC_PATTERN, MC_TYPE, MC_HOOK,
    MC_KEY, MC_DESC, MC_OPTIONS)
    VALUES (
    'http://www.freebase.com/view/.*',
    'URL', 'DB.DBA.RDF_LOAD_NYTCF', '2c1d95a62e5fxxxxx', 'Freebase NYTCF',
    vector ());
};

Looking at the list of cartridges in Conductor's 'RDF Cartridges' screen, you will see that the Freebase cartridge is configured by default to perform Network Resource Fetch of URIs which match the pattern "http://www.freebase.com/view/.*" The meta-cartridge is configured to match on the same URI pattern.

To use the Campaign Finance API, you must register and request an API key. The script above shows an invalid key. Replace it with your own key before executing the procedure.

NYTCF Meta-Cartridge Functions

The meta-cartridge function definitions are listed below. They can be executed by pasting them into Conductor's iSQL interface.

-- New York Times: Campaign Finance Web Service
-- See http://developer.nytimes.com/docs/campaign_finance_api

-- DB.DBA.RDF_NYTCF_LOOKUP is in effect a lightweight lookup cartridge that is used
-- to conditionally add triples to graphs generated by the Wikipedia and
-- Freebase cartridges. These cartridges call on RDF_NYTCF_LOOKUP when
-- handling an entity of rdf:type yago:Congressman109955781. The NYTCF lookup
-- cartridge (aka a metacartridge) is used to return campaign finance data
-- for the candidate in question retrieved from the New York Times Campaign
-- Finance web service.
create procedure DB.DBA.RDF_NYTCF_LOOKUP(
  in candidate_id any, 		-- id of candidate
  in graph_iri varchar,		-- graph into which the additional campaign finance triples should be loaded
  in api_key varchar		-- NYT finance API key
)
{
  declare version, campaign_type, year any;
  declare nyt_url, hdr, tmp any;
  declare xt, xd any;

  -- Common parameters - The NYT API only supports the following values at present:
  version := 'v2';
  campaign_type := 'president';
  year := '2008';

  -- Candidate summaries
  -- nyt_url := sprintf('http://api.nytimes.com/svc/elections/us/%s/%s/%s/finances/totals.xml?api-key=%s',
  --	version, campaign_type, year, api_key);

  -- Candidate details
  nyt_url := sprintf('http://api.nytimes.com/svc/elections/us/%s/%s/%s/finances/candidates/%s.xml?api-key=%s',
  	version, campaign_type, year, candidate_id, api_key);

  tmp := http_client_ext (nyt_url, headers=>hdr, proxy=>connection_get ('sparql-get:proxy'));
  if (hdr[0] not like 'HTTP/1._ 200 %')
    signal ('22023', trim(hdr[0], '\r\n'), 'DB.DBA.RDF_LOAD_NYTCF_LOOKUP');
  xd := xtree_doc (tmp);

  -- baseUri specifies what the generated RDF description is about
  -- <rdf:Description rdf:about="{baseUri}">
  -- Example baseUri's:
  -- http://localhost:8890/about/rdf/http://www.freebase.com/view/en/barack_obama#this
  -- http://localhost:8890/about/rdf/http://www.freebase.com/view/en/hillary_rodham_clinton#this
  declare path any;
  declare lang, k, base_uri varchar;

  if (graph_iri like 'http://rdf.freebase.com/ns/%.%')
    base_uri := graph_iri;
  else
    {
      path := split_and_decode (graph_iri, 0, '%\0/');
      k := path [length(path) - 1];
      lang := path [length(path) - 2];

      base_uri := sprintf ('http://rdf.freebase.com/ns/%U.%U', lang, k);
    }

  xt := DB.DBA.RDF_MAPPER_XSLT (registry_get ('_cartridges_path_') || 'xslt/nytcf2rdf.xsl', xd,
      	vector ('baseUri', base_uri));
  xd := serialize_to_UTF8_xml (xt);
  DB.DBA.RDF_LOAD_RDFXML (xd, '', graph_iri);
}
;

create procedure DB.DBA.RDF_MQL_RESOURCE_IS_SENATOR (
  in fb_graph_uri varchar	-- URI of graph containing Freebase resource
)
{
  -- Check if the resource described by Freebase is a U.S. senator. Only then does it make sense to query for campaign finance
  -- data from the NYT data space.
  --
  -- To test for senators, we start by looking for two statements in the Freebase cartridge output, similar to:
  --
  -- <rdf:Description rdf:about="http://localhost:8890/about/rdf/http://www.freebase.com/view/en/hillary_rodham_clinton#this">
  --   <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
  --   <rdfs:seeAlso rdf:resource="http://en.wikipedia.org/wiki/Hillary_Rodham_Clinton"/>
  --   ...
  -- where the graph generated by the Sponger will be <http://www.freebase.com/view/en/hillary_rodham_clinton>
  --
  -- To test whether a resource is a senator:
  -- 1) Check whether the Freebase resource is of rdf:type foaf:Person
  -- 2) Extract the person_name from the Wikipedia URI referenced by rdfs:seeAlso
  -- 3) Use the extracted person_name to build a URI to DBpedia's description of the person.
  -- 4) Query the DBpedia description to see if the person is of rdf:type yago:Senator110578471
  declare xp, xt, tmp any;
  declare qry varchar;			-- SPARQL query
  declare qry_uri varchar;		-- query URI
  declare qry_res varchar;		-- query result
  declare dbp_resource_name varchar;	-- Equivalent resource name in DBpedia
  declare fb_resource_uri varchar; 	-- Freebase resource URI
  declare path any;
  declare lang, k varchar;

  declare exit handler for sqlstate '*' {
    return 0;
  };

  if (fb_graph_uri like 'http://rdf.freebase.com/ns/%.%')
    fb_resource_uri := fb_graph_uri;
  else
    {
      path := split_and_decode (fb_graph_uri, 0, '%\0/');
      if (length (path) < 2)
	return 0;

      k := path [length(path) - 1];
      lang := path [length(path) - 2];

      fb_resource_uri := sprintf ('http://rdf.freebase.com/ns/%U.%U', lang, k);
    }

  -- 1) Check whether the Freebase resource is a politician from united_states
  {
    declare stat, msg varchar;
    declare mdata, rset any;

    qry := sprintf ('sparql ask from <%s> where { <%s> <http://rdf.freebase.com/ns/people.person.profession> <http://rdf.freebase.com/ns/en.politician> ; <http://rdf.freebase.com/ns/people.person.nationality> <http://rdf.freebase.com/ns/en.united_states> . }', fb_graph_uri, fb_resource_uri);
    exec (qry, stat, msg, vector(), 1, mdata, rset);
    if (length(rset) = 0 or rset[0][0] <> 1)
      return 0;
  }

  return 1;
}
;

create procedure DB.DBA.RDF_LOAD_NYTCF_META (in graph_iri varchar, in new_origin_uri varchar,  in dest varchar,
    inout _ret_body any, inout aq any, inout ps any, inout _key any, inout opts any)
{
  declare candidate_id, candidate_name any;
  declare api_key any;
  declare indx, tmp any;
  declare ord int;

  declare exit handler for sqlstate '*'
  {
    return 0;
  };

  if (not DB.DBA.RDF_MQL_RESOURCE_IS_SENATOR (new_origin_uri))
    return 0;

  -- TO DO: hardcoded for now
  -- Need a mechanism to specify API key for meta-cartridges
  -- Could retrieve from virtuoso.ini?
  api_key := _key;

  -- NYT API supports a candidate_id in one of two forms:
  -- candidate_id ::= {candidate_ID} | {last_name [,first_name]}
  -- first_name is optional. If included, there should be no space after the comma.
  --
  -- However, because this meta cartridge supplies additional triples for the
  -- Wikipedia or Freebase cartridges, only the second form of candidate_id is
  -- supported. i.e. We extract the candidate name, rather than a numeric
  -- candidate_ID (FEC committee ID) from the Wikipedia or Freebase URL.
  --
  -- It's assumed that the source URI includes the candidate's first name.
  -- If it is omitted, the NYT API will return information about *all* candidates
  -- with that last name - something we don't want.

  indx := strstr(graph_iri, 'www.freebase.com/view/en/');
  if (indx is not null)
  {
    -- extract candidate_id from Freebase URI
    tmp := sprintf_inverse(subseq(graph_iri, indx), 'www.freebase.com/view/en/%s', 0);
    if (length(tmp) <> 1)
      return 0;
    candidate_name := tmp[0];
  }
  else
  {
    indx := strstr(graph_iri, 'wikipedia.org/wiki/');
    if (indx is not null)
    {
      -- extract candidate_id from Wikipedia URI
      tmp := sprintf_inverse(subseq(graph_iri, indx), 'wikipedia.org/%s', 0);
      if (length(tmp) <> 1)
        return 0;
      candidate_name := tmp[0];
    }
    else
      {
	tmp := sprintf_inverse(graph_iri, 'http://%s.freebase.com/ns/%s/%s', 0);
	if (length (tmp) <> 3)
	  tmp := sprintf_inverse(graph_iri, 'http://%s.freebase.com/ns/%s.%s', 0);
	if (length (tmp) <> 3)
	  return 0;
	candidate_name := tmp[2];
      }
  }


  -- split candidate_name into its component parts
  --   candidate_name is assumed to be firstname_[middlename_]*lastname
  --   e.g. hillary_rodham_clinton (Freebase), Hillary_clinton (Wikipedia)
  {
    declare i, _end, len int;
    declare names, tmp_name varchar;

    names := vector ();
    tmp_name := candidate_name;
    len := length (tmp_name);
    while (1)
    {
      _end := strchr(tmp_name, '_');
      if (_end is not null)
      {
        names := vector_concat (names, vector(subseq(tmp_name, 0, _end)));
        tmp_name := subseq(tmp_name, _end + 1);
      }
      else
      {
        names := vector_concat(names, vector(tmp_name));
        goto done;
      }
    }
done:
    if (length(names) < 2)
      return 0;
    -- candidate_id ::= lastname,firstname
    candidate_id := sprintf('%s,%s', names[length(names)-1], names[0]);
  }

  DB.DBA.RDF_NYTCF_LOOKUP(candidate_id, coalesce (dest, graph_iri), api_key);
  return 0;
}
;

NYTCF Meta-Cartridge Stylesheet

The XSLT stylesheet, nyctf2rdf.xsl, used by the meta-cartridge to transform the base Campaign Finance web service output to RDF is shown below. RDF_NYCTF_LOOKUP() assumes the stylesheet is located alongside the other stylesheets provided by the cartridges VAD in the Virtuoso WebDAV folder DAV/VAD/cartridges/xslt. You should create nyctf2rdf.xsl here from the following listing. The WebDAV Browser interface in Conductor provides the easiest means to upload the stylesheet.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY nyt "http://www.nytimes.com/">
]>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:vi="http://www.openlinksw.com/virtuoso/xslt/"
    xmlns:rdf=""
    xmlns:nyt=""
    >
    <xsl:output method="xml" indent="yes" />
    <xsl:template match="/result_set/status">
      <xsl:if test="text() = 'OK'">
        <xsl:apply-templates mode="ok" select="/result_set/results/candidate"/>
      </xsl:if>
    </xsl:template>

    <xsl:template match="candidate" mode="ok">
      <rdf:Description rdf:about="{vi:proxyIRI($baseUri)}">
	  <nyt:candidate_name><xsl:value-of select="candidate_name"/></nyt:candidate_name>
	  <nyt:committee_id><xsl:value-of select="committee_id"/></nyt:committee_id>
	  <nyt:party><xsl:value-of select="party"/></nyt:party>
	  <nyt:total_receipts><xsl:value-of select="total_receipts"/></nyt:total_receipts>
	  <nyt:total_disbursements>
	    <xsl:value-of select="total_disbursements"/>
	  </nyt:total_disbursements>
	  <nyt:cash_on_hand><xsl:value-of select="cash_on_hand"/></nyt:cash_on_hand>
	  <nyt:net_individual_contributions>
	    <xsl:value-of select="net_individual_contributions"/>
         </nyt:net_individual_contributions>
	  <nyt:net_party_contributions>
	    <xsl:value-of select="net_party_contributions"/>
	  </nyt:net_party_contributions>
	  <nyt:net_pac_contributions>
	    <xsl:value-of select="net_pac_contributions"/>
	  </nyt:net_pac_contributions>
	  <nyt:net_candidate_contributions>
	    <xsl:value-of select="net_candidate_contributions"/>
	  </nyt:net_candidate_contributions>
	  <nyt:federal_funds><xsl:value-of select="federal_funds"/></nyt:federal_funds>
	  <nyt:total_contributions_less_than_200>
	    <xsl:value-of select="total_contributions_less_than_200"/>
	  </nyt:total_contributions_less_than_200>
	  <nyt:total_contributions_2300>
	    <xsl:value-of select="total_contributions_2300"/>
	  </nyt:total_contributions_2300>
	  <nyt:net_primary_contributions>
	    <xsl:value-of select="net_primary_contributions"/>
	  </nyt:net_primary_contributions>
	  <nyt:net_general_contributions>
	    <xsl:value-of select="net_general_contributions"/>
	  </nyt:net_general_contributions>
	  <nyt:total_refunds><xsl:value-of select="total_refunds"/></nyt:total_refunds>
	  <nyt:date_coverage_from rdf:datatype="date">
	    <xsl:value-of select="date_coverage_from"/>
	  </nyt:date_coverage_from>
	  <nyt:date_coverage_to rdf:datatype="date">
           <xsl:value-of select="date_coverage_to"/>
          </nyt:date_coverage_to>
      </rdf:Description>
    </xsl:template>
    <xsl:template match="text()|@*"/>
</xsl:stylesheet>

The stylesheet uses the prefix nyt: (http://www.nytimes.com) for the predicates of the augmenting triples. This has been used purely for illustration - you may prefer to define your own ontology for RDF data derived from New York Times APIs.

Testing the Meta-Cartridge

After creating the required Virtuoso/PL functions and installing the stylesheet, you should be able to test the meta-cartridge by sponging a Freebase page as described earlier using ODE or the command line. For instance:

You should see campaign finance data added to the graph created by the Sponger in the form of triples with predicates starting http://www.nytimes.com/xxx, e.g. http://www.nytimes.com/net_primary_contribution.

How The Meta-Cartridge Works

The comments in the meta-cartridge code detail how the cartridge works. In brief:

Given the URI of the graph being created by the Freebase cartridge, RDF_MQL_RESOURCE_IS_SENATOR checks if the resource described by Freebase is a U.S. senator. Only then does it make sense to query for campaign finance data from the NYTCF data space.

To test for senators, the procedure starts by looking for two statements in the Freebase cartridge output similar to:

<rdf:Description rdf:about="http://localhost:8890/about/rdf/http://www.freebase.com/view/en/barack_obama#this">
  <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
  <rdfs:seeAlso rdf:resource="http://en.wikipedia.org/wiki/Barack_Obama"/>
   ...

where the graph generated by the Sponger will be

<http://www.freebase.com/view/en/barack_obama>

To test whether a resource is a senator, RDF_MQL_RESOURCE_IS_SENATOR

Only if this is the case is the RDF_NYTCF_LOOKUP routine called to query for and return campaign finance data for the candidate. The form of the query and the resulting XML output from the Campaign Finance service were presented earlier.



16.10.9.8. Sponger Queue API

16.10.9.8.1. Functions

16.10.9.8.2. REST Web service

The Sponger REST Web service has the following characteristics:

The service will return a json encoded result of the number of items added, for example:

{ "result":2 }

In case of error a JSON with error text will be returned and http status 500.

cURL example
  1. Assume file.txt which contains URL encoded JSON string:
    uris=%7B%20%22uris%22%3A%5B%22http%3A%2F%2Fwww.amazon.co.uk%2FHama-Stylus-Input-Apple-iPad%2Fdp%2FB003O0OM0C%22%2C%20%22http%3A%2F%2Fwww.amazon.co.uk%2FKrusell-GAIA-Case-Apple-iPad%2Fdp%2FB003QHXWWC%22%20%5D%20%7D
    
  2. Execute the following command:
    curl -i -d@file.txt http://cname/about/service?op=add
    HTTP/1.1 200 OK
    Server: Virtuoso/06.02.3129 (Darwin) i686-apple-darwin10.0.0  VDB
    Connection: Keep-Alive
    Date: Thu, 05 May 2011 12:06:24 GMT
    Accept-Ranges: bytes
    Content-Type: applcation/json; charset="UTF-8"
    Content-Length: 14
    
    { "result":2 }
    


16.10.9.9. Virtuoso functions usage examples

16.10.9.9.1. String Functions

sprintf_inverse

tmp := sprintf_inverse (new_origin_uri, 'http://farm%s.static.flickr.com/%s/%s_%s.%s', 0);
img_id := tmp[2];

split_and_decode

request_hdr := headers[0];
response_hdr := headers[1];
host := http_request_header (request, 'Host');
tmp := split_and_decode (request_hdr[0], 0, '\0\0 ');

http_method := tmp[0];
url := tmp[1];
protocol_version := substring (tmp[2], 6, 8);
tmp := rtrim (response_hdr[0], '\r\n');
tmp := split_and_decode (response_hdr[0], 0, '\0\0 ');

16.10.9.9.2. Retrieving URLs

http_get

url := sprintf('http://api.flickr.com/services/rest/?i"??
	method=flickr.photos.getInfo&photo_id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
if (hdr[0] not like 'HTTP/1._ 200 %')
  signal ('22023', trim(hdr[0], '\r\n'), 'RDFXX');
xd := xtree_doc (tmp);

DB.DBA.RDF_HTTP_URL_GET

A wrapper around http_get. Retrieves a URL using the specified HTTP method (defaults to GET). The function can handle proxies, redirects (up to fifteen) and HTTPS.

uri := sprintf ('http://musicbrainz.org/ws/1/%s/%s?type=xml&inc=%U',
	kind, id, inc);
cnt := RDF_HTTP_URL_GET (uri, '', hdr, 'GET', 'Accept: */*');
xt := xtree_doc (cnt);
xd := DB.DBA.RDF_MAPPER_XSLT (registry_get ('_cartridges_path_') || 'xslt/mbz2rdf.xsl', xt, vector ('baseUri', new_origin_uri));

http_request_header

content := RDF_HTTP_URL_GET (rdf_url, new_origin_uri, hdr, 'GET',
		'Accept: application/rdf+xml, text/rdf+n3, */*');
ret_content_type := http_request_header (hdr, 'Content-Type', null, null);

16.10.9.9.3. Handling Non-XML Response Content

json_parse: Parses JSON content into a tree.

url := sprintf ('http://www.freebase.com/api/service/mqlread?queries=%U', qr);
  content := http_get (url, hdr);
  tree := json_parse (content);
  tree := get_keyword ('ROOT', tree);
  tree := get_keyword ('result', tree);

16.10.9.9.4. Writing Arbitrarily Long Text

http

-- Writing N3 to a string output stream using function http(), parsing the N3 into a graph, then loading the graph into the quad store.
ses := string_output ();
http ('@prefix opl: <http://www.openlinksw.com/schema/attribution#> .\n', ses);
http ('@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n', ses);
...
DB.DBA.TTLP (ses, base, graph);
DB.DBA.RDF_LOAD_RDFXML (strg, base, graph);

string_output

ses := string_output ();
cnt := http_get (sprintf ('http://download.finance.yahoo.com/d/quotes.csv?s=%U&f=nsbavophg&e=.csv',
    symbol));
arr := rdfm_yq_parse_csv (cnt);
http ('<quote stock="NASDAQ">', ses);
foreach (any q in arr) do
  {
    http_value (q[0], 'company', ses);
    http_value (q[1], 'symbol', ses);
    ...
  }
  http ('</quote>', ses);
  content := string_output_string (ses);
  xt := xtree_doc (content);

string_output_string


16.10.9.9.5. XML & XSLT

xtree_doc

content := RDF_HTTP_URL_GET (uri, '', hdr, 'GET', 'Accept: */*');
xt := xtree_doc (content);

xpath_eval

profile := cast (xpath_eval ('/html/head/@profile', xt) as varchar);

DB.DBA.RDF_MAPPER_XSLT

tmp := http_get (url);
xd := xtree_doc (tmp);
xt := DB.DBA.RDF_MAPPER_XSLT (
	registry_get ('_cartridges_path_') || 'xslt/atom2rdf.xsl',
	xd, vector ('baseUri', coalesce (dest, graph_iri)));

16.10.9.9.6. Character Set Conversion

serialize_to_UTF8_xml

xt := DB.DBA.RDF_MAPPER_XSLT (
	registry_get ('_cartridges_path_') || 'xslt/crunchbase2rdf.xsl',
	xt, vector ('baseUri', coalesce (dest, graph_iri), 'base', base,
	'suffix', suffix));
xd := serialize_to_UTF8_xml (xt);
DB.DBA.RM_RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));

16.10.9.9.7. Loading Data Into the Quad Store

DB.DBA.RDF_LOAD_RDFXML

content := RDF_HTTP_URL_GET (uri, '', hdr, 'GET', 'Accept: */*');
xt := xtree_doc (content);
xd := DB.DBA.RDF_MAPPER_XSLT (
	registry_get ('_cartridges_path_') || 'xslt/mbz2rdf.xsl',
	xt, vector ('baseUri', new_origin_uri));
xd := serialize_to_UTF8_xml (xd);
DB.DBA.RM_RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));

DB.DBA.TTLP

sess := string_output ();
...
http (sprintf ('<http://dbpedia.org/resource/%s>
	<http://xbrlontology.com/ontology/finance/stock_market#hasCompetitor>
	<http://dbpedia.org/resource/%s> .\n',
	symbol, x), sess);
http (sprintf ('<http://dbpedia.org/resource/%s>
	<http://www.w3.org/2000/01/rdf-schema#isDefinedBy>
	<http://finance.yahoo.com/q?s=%s> .\n',
	 x, x), sess);
content := string_output_string (sess);
DB.DBA.TTLP (content, new_origin_uri, coalesce (dest, graph_iri));
See Also:

16.10.9.9.8. Debug Output

dbg_obj_print

dbg_obj_print ('try all grddl mappings here');


16.10.9.10. References

16.10.9.10.1. PingTheSemanticWeb RDF Notification Service

PingtheSemanticWeb (PTSW) is a repository for RDF documents. The PTSW web service archives the location of recently created or updated RDF documents on the Web. It is intended for use by crawlers or other types of software agents which need to know when and where the latest updated RDF documents can be found. They can request a list of recently updated documents as a starting location to crawl the Semantic Web.

You may find this service useful for publicizing your own RDF content. Content authors can notify PTSW that an RDF document has been created or updated by pinging the service with the URL of the document. The Sponger supports this facility through the async_queue and ping_service parameters of the cartridge hook function, where the ping_service parameter contains the ping service URL as configured in the SPARQL section of the virtuoso.ini file:

[SPARQL]
...
PingService = http://rpc.pingthesemanticweb.com/
...

The configured ping service can be called using an asynchronous request and the RDF_SW_PING procedure as illustrated below.

create procedure DB.DBA.RDF_LOAD_HTML_RESPONSE (
  in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
  inout ret_body any, inout async_queue any, inout ping_service any,
  inout _key any, inout opts any )
{
  ...
  if ( ... and async_queue is not null)
    aq_request (async_queue, 'DB.DBA.RDF_SW_PING',
                vector (ping_service, new_origin_uri));

For more details refer to section Asynchronous Execution and Multithreading in Virtuoso/PL


16.10.9.10.2. Main Namespaces used by OpenLink Cartridges

A list of the main namespaces / ontologies used by OpenLink-provided Sponger cartridges is given below. Some of these ontologies may prove useful when creating your own cartridges.


16.10.9.10.3. Freebase Cartridge & Stylesheet

Snapshots of the Freebase cartridge and stylesheet compatible with the meta-cartridge example presented earlier in this document can be found below.

DB.DBA.RDF_LOAD_MQL:

--no_c_escapes-
create procedure DB.DBA.RDF_LOAD_MQL (in graph_iri varchar, in new_origin_uri varchar,  in dest varchar,
    inout _ret_body any, inout aq any, inout ps any, inout _key any, inout opts any)
{
  declare qr, path, hdr any;
  declare tree, xt, xd, types any;
  declare k, cnt, url, sa varchar;

  hdr := null;
  sa := '';
  declare exit handler for sqlstate '*'
    {
      --dbg_printf ('%s', __SQL_MESSAGE);
      return 0;
    };

  path := split_and_decode (new_origin_uri, 0, '%\0/');
  if (length (path) < 1)
    return 0;
  k := path [length(path) - 1];
  if (path [length(path) - 2] = 'guid')
    k := sprintf ('"id":"/guid/%s"', k);
  else
  {
    if (k like '#%')
        k := sprintf ('"id":"%s"', k);
    else
      {
	sa := DB.DBA.RDF_MQL_GET_WIKI_URI (k);
    k := sprintf ('"key":"%s"', k);
  }
  }
  qr := sprintf ('{"ROOT":{"query":[{%s, "type":[]}]}}', k);
  url := sprintf ('http://www.freebase.com/api/service/mqlread?queries=%U', qr);
  cnt := http_get (url, hdr);
  tree := json_parse (cnt);
  xt := get_keyword ('ROOT', tree);
  if (not isarray (xt))
    return 0;
  xt := get_keyword ('result', xt);
  types := vector ();
  foreach (any tp in xt) do
    {
      declare tmp any;
      tmp := get_keyword ('type', tp);
      types := vector_concat (types, tmp);
    }
  --types := get_keyword ('type', xt);
  DELETE FROM DB.DBA.RDF_QUAD WHERE g =  iri_to_id(new_origin_uri);
  foreach (any tp in types) do
    {
      qr := sprintf ('{"ROOT":{"query":{%s, "type":"%s", "*":[]}}}', k, tp);
      url := sprintf ('http://www.freebase.com/api/service/mqlread?queries=%U', qr);
      cnt := http_get (url, hdr);
      --dbg_printf ('%s', cnt);
      tree := json_parse (cnt);
      xt := get_keyword ('ROOT', tree);
      xt := DB.DBA.MQL_TREE_TO_XML (tree);
      --dbg_obj_print (xt);
      xt := DB.DBA.RDF_MAPPER_XSLT (registry_get ('_cartridges_path_') || 'xslt/mql2rdf.xsl', xt,
      	vector ('baseUri', coalesce (dest, graph_iri), 'wpUri', sa));
      sa := '';
      xd := serialize_to_UTF8_xml (xt);
--      dbg_printf ('%s', xd);
      DB.DBA.RM_RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
    }
  return 1;
}

mql2rdf.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<!--
 -
 -  $Id$
 -
 -  This file is part of the OpenLink Software Virtuoso Open-Source (VOS)
 -  project.
 -
 -  Copyright (C) 1998-2014 OpenLink Software
 -
 -  This project is free software; you can redistribute it and/or modify it
 -  under the terms of the GNU General Public License as published by the
 -  Free Software Foundation; only version 2 of the License, dated June 1991.
 -
 -  This program is distributed in the hope that it will be useful, but
 -  WITHOUT ANY WARRANTY; without even the implied warranty of
 -  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 -  General Public License for more details.
 -
 -  You should have received a copy of the GNU General Public License along
 -  with this program; if not, write to the Free Software Foundation, Inc.,
 -  51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
-->
<!DOCTYPE xsl:stylesheet [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY bibo "http://purl.org/ontology/bibo/">
<!ENTITY xsd  "http://www.w3.org/2001/XMLSchema#">
<!ENTITY foaf "http://xmlns.com/foaf/0.1/">
<!ENTITY sioc "http://rdfs.org/sioc/ns#">
]>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:vi="http://www.openlinksw.com/virtuoso/xslt/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:sioc=""
    xmlns:bibo=""
    xmlns:foaf=""
    xmlns:skos="http://www.w3.org/2004/02/skos/core#"
    xmlns:dcterms= "http://purl.org/dc/terms/"
    xmlns:mql="http://www.freebase.com/">

    <xsl:output method="xml" indent="yes" />

    <xsl:param name="baseUri" />
    <xsl:param name="wpUri" />

    <xsl:variable name="ns">http://www.freebase.com/</xsl:variable>

    <xsl:template match="/">
	<rdf:RDF>
	    <xsl:if test="/results/ROOT/result/*">
		<rdf:Description rdf:about="{$baseUri}">
		    <rdf:type rdf:resource="Document"/>
		    <rdf:type rdf:resource="Document"/>
		    <rdf:type rdf:resource="Container"/>
		    <sioc:container_of rdf:resource="{vi:proxyIRI($baseUri)}"/>
		    <foaf:primaryTopic rdf:resource="{vi:proxyIRI($baseUri)}"/>
		    <dcterms:subject rdf:resource="{vi:proxyIRI($baseUri)}"/>
		</rdf:Description>
		<rdf:Description rdf:about="{vi:proxyIRI($baseUri)}">
		    <rdf:type rdf:resource="Item"/>
		    <sioc:has_container rdf:resource="{$baseUri}"/>
		    <xsl:apply-templates select="/results/ROOT/result/*"/>
		    <xsl:if test="$wpUri != ''">
			<rdfs:seeAlso rdf:resource="{$wpUri}"/>
		    </xsl:if>
		</rdf:Description>
	    </xsl:if>
	</rdf:RDF>
    </xsl:template>

    <xsl:template match="*[starts-with(.,'http://') or starts-with(.,'urn:')]">
	<xsl:element namespace="{$ns}" name="{name()}">
	    <xsl:attribute name="rdf:resource">
		<xsl:value-of select="vi:proxyIRI (.)"/>
	    </xsl:attribute>
	</xsl:element>
    </xsl:template>

    <xsl:template match="*[starts-with(.,'/')]">
	<xsl:if test="local-name () = 'type' and . like '%/person'">
	    <rdf:type rdf:resource="Person"/>
	</xsl:if>
	<xsl:if test="local-name () = 'type'">
	    <sioc:topic>
		<skos:Concept rdf:about="{vi:proxyIRI (concat ($ns, 'view', .))}"/>
	    </sioc:topic>
	</xsl:if>

	<xsl:element namespace="{$ns}" name="{name()}">
	    <xsl:attribute name="rdf:resource">
		<xsl:value-of select="vi:proxyIRI(concat ($ns, 'view', .))"/>
	    </xsl:attribute>
	</xsl:element>
    </xsl:template>

    <xsl:template match="*[* and ../../*]">
	<xsl:element namespace="{$ns}" name="{name()}">
	    <xsl:attribute name="rdf:parseType">Resource</xsl:attribute>
	    <xsl:apply-templates select="@*|node()"/>
	</xsl:element>
    </xsl:template>

    <xsl:template match="*">
	<xsl:if test="* or . != ''">
		<xsl:choose>
		    <xsl:when test="name()='image'">
			<foaf:depiction rdf:resource="{vi:mql-image-by-name (.)}"/>
		    </xsl:when>
		    <xsl:otherwise>
			<xsl:element namespace="{$ns}" name="{name()}">
			    <xsl:if test="name() like 'date_%'">
				<xsl:attribute name="rdf:datatype">dateTime</xsl:attribute>
			    </xsl:if>
			    <xsl:apply-templates select="@*|node()"/>
			</xsl:element>
		    </xsl:otherwise>
		</xsl:choose>
	</xsl:if>
    </xsl:template>
</xsl:stylesheet>


16.10.9.11. Using Python to perform Virtuoso Sponging

This section contains the generic steps to use Python language to extend the Virtuoso Sponger.

  1. Build the latest Python hosting module. It will introduce a new function python_exec ()

    The parameters of python_exec are :

    • string containing a python code, it should define one or more functions, see remarks below
    • string containing name of function to be called
    • list of parameters for the function

    For Example:

    python_exec (file_to_string ('spoonge.py'), 'rdf4uri', 'http://url..', 'http://base...');
    

    The above means call the rdf4uri ('http://url..', 'http://base...') function from spoonge.py file. It is importnat to know that python_exec is restricted to DBA only group and that the python source should not have __main__ or this to be restricted in python code to not be called . Any print etc. for stdout/stderr will go on server console if server is on foreground. Can be used for debug for example but not for real work.

    The function is supposed to return just single string, don't try to return multiple results, this will not work in this revision.

  2. Setup the Virtuoso server INI to include python module:
    ...
    [Plugins]
    LoadPath = ../lib
    Load1    = Hosting, hosting_python.so
    ...
    
  3. Download and install the rdflib package from http://www.rdflib.net/ Note before to build, disable Zope interface in rdflib as this not work with C-API correctly. Or make sure Python has no Zope interfaces installed. To disable the zope in rdflib, just comment out following in <rdflibhome>/rdflib/__init__.py:
     36 #from rdflib.interfaces import IIdentifier, classImplements
     37 #classImplements(URIRef, IIdentifier)
     38 #classImplements(BNode, IIdentifier)
     39 #classImplements(Literal, IIdentifier)
    

    Then do:

    perl setup.py build
    perl setup.py --user install
    
  4. Get an example of python code for sponger like: http://www.ebusiness-unibw.org/wiki/Python4Spongers and make sure you disable the last lines which not suitable for calling inside Sponger:
    ...
    #if __name__ == '__main__':
    #	rdf_xml = rdf4uri(uri='http://www.amazon.com/Apple-touch-Generation-NEWEST-MODEL/dp/B002M3SOBU/')
    #	print rdf_xml
    

    Store the python code in sponge.py in server working directory. Make sure this directory is allowed to read in DirsAllowed INI setting.

  5. Create a procedure and register with Sponger:
    -- THIS IS FOR DEMO PURPOSE ONLY
    
    -- for demo purposes we delete all other cartridges registrations to see effect from only this cartridge
    delete from DB.DBA.SYS_RDF_MAPPERS;
    delete from DB.DBA.RDF_META_CARTRIDGES;
    
    -- register cartridge
    insert soft DB.DBA.SYS_RDF_MAPPERS (RM_PATTERN, RM_TYPE, RM_HOOK, RM_KEY, RM_DESCRIPTION)
    	values ('(http://.*amazon.[^/]+/[^/]+/dp/[^/]+(/.*)?)', 'URL', 'DB.DBA.RDF_LOAD_PYTHON_AMAZON_ARTICLE', null, 'Amazon articles');
    
    -- the cartridge stored procedure itself
    create procedure DB.DBA.RDF_LOAD_PYTHON_AMAZON_ARTICLE (in graph_iri varchar, in new_origin_uri varchar,  in dest varchar,
        inout _ret_body any, inout aq any, inout ps any, inout _key any, inout opts any)
    {
      declare result any;
      -- we check first python hosting is capable to run code
      if (__proc_exists ('python_exec', 2) is null)
        return 0;
      -- handle any error
      declare exit handler for sqlstate '*'
        {
          -- log the error
          DB.DBA.RM_RDF_SPONGE_ERROR (current_proc_name (), graph_iri, dest, __SQL_MESSAGE);
          return 0;
        };
      -- call the python code
      result := python_exec (file_to_string ('sponge.py'), 'rdf4uri', new_origin_uri);
      -- in case of python error we will get integer zero, so we check
      if (not isstring (result))
        return 0;
      -- for demo purpose we delete all from this graph
      delete from DB.DBA.RDF_QUAD where G = DB.DBA.RDF_MAKE_IID_OF_QNAME (graph_iri);
      -- load the results
      DB.DBA.RDF_LOAD_RDFXML (result, new_origin_uri, coalesce (dest, graph_iri), 0);
      return 1;
    }
    ;
    
  6. Test the Sponger code like this:
    sparql define get:soft "soft" select * from <http://www.amazon.com/Apple-touch-Generation-NEWEST-MODEL/dp/B002M3SOBU/> { ?s ?p ?o };
    


16.10.10. Sponger Usage Examples