16.9.8.Using Virtuoso Crawler

Using Virtuoso Crawler (which includes the Sponger options so you crawl non-RDF but get RDF and this can go to the Quad Store).

Example:

  1. Go to Conductor UI. For ex. at http://example.com/conductor :

    Figure16.86.Using Virtuoso Crawler

    Using Virtuoso Crawler

  2. Enter admin user credentials:

    Figure16.87.Using Virtuoso Crawler

    Using Virtuoso Crawler

  3. Go to tab Web Application Server:

    Figure16.88.Using Virtuoso Crawler

    Using Virtuoso Crawler

  4. Go to tab Content Imports:

    Figure16.89.Using Virtuoso Crawler

    Using Virtuoso Crawler

  5. Click the "New Target" button:

    Figure16.90.Using Virtuoso Crawler

    Using Virtuoso Crawler

  6. In the shown form set respectively:

    1. "Target description": Tim Berners-Lee's electronic Business Card

    2. "Target URL": http://www.w3.org/People/Berners-Lee/ ;

    3. "Copy to local DAV collection " for ex.: /DAV/home/demo/my-crawling/ ;

    4. Choose from the list "Local resources owner": demo ;

    5. Leave checked by default the check-box "Store documents locally". -- Note: if "Store document locally" is not checked, then in this case no documents will be save as DAV resource and the specified DAV folder from above will not be used ;

    6. Check the check-box with label "Store metadata" ;

    7. Specify which cartridges to be involved by hatching their check-box ;

    8. Note: when selected "Convert Link", then all HREFs in the local stored content will be relative.

    Figure16.91.Using Virtuoso Crawler

    Using Virtuoso Crawler

    Figure16.92.Using Virtuoso Crawler

    Using Virtuoso Crawler

  7. Click the button "Create":

    Figure16.93.Using Virtuoso Crawler

    Using Virtuoso Crawler

  8. Click the button "Import Queues":

    Figure16.94.Using Virtuoso Crawler

    Using Virtuoso Crawler

  9. For "Robot target" with label "Tim Berners-Lee's electronic Business Card" click "Run".

  10. As result should be shown the number of the pages retrieved.

    Figure16.95.Using Virtuoso Crawler

    Using Virtuoso Crawler

Example: Use of schedular to interface Virtuoso Quad Store with PTSW using the following program:


create procedure PTSW_CRAWL ()
{
  declare xt, xp any;
  declare content, headers any;

  content := http_get ('http://pingthesemanticweb.com/export/', headers);
  xt := xtree_doc (content);
  xp := xpath_eval ('//rdfdocument/@url', xt, 0);
  foreach (any x in xp) do
    {
      x := cast (x as varchar);
      dbg_obj_print (x);
      {
        declare exit handler for sqlstate '*' {
          log_message (sprintf ('PTSW crawler can not load : %s', x));
        };
        sparql load ?:x into graph ?:x;
        update DB.DBA.SYS_HTTP_SPONGE set HS_LOCAL_IRI = x, HS_EXPIRATION = null WHERE HS_LOCAL_IRI = 'destMD5=' || md5 (x) || '&graphMD5=' || md5 (x);
        commit work;
      }
    }
}
;

insert soft SYS_SCHEDULED_EVENT (SE_SQL, SE_START, SE_INTERVAL, SE_NAME)
        values ('DB.DBA.PTSW_CRAWL ()', cast (stringtime ('0:0') as DATETIME), 60, 'PTSW Crawling');