The Virtuoso WebRobot (WebCopy) is useful for retrieving Internet web sites and storing them on to a local WebDAV repository. Once retrieved, the local copy in the WebDAV repository can be exported to the local filesystem or another WebDAV enabled server. The common features and usages are demonstrated in the WebCopy User Interface in the Visual Server Administration Interface. This document provides the actual API's and techniques useful for some other implementations.

A new web server target is created by inserting a row in to the WS.WS.VFS_SITE table and then a row in to the WS.WS.VFS_QUEUE table.

[Tip] See Also:

Web Robot System Tables for table definitions

Example 14.2. Creating a new target

This example creates a new target pointing to the site, with instructions to walk across foreign links, delete a local page if it is detected that it has been removed from the remote, retrieve images, walk on entire site using HTTP GET method. The content will be stored in /DAV/sites/www_foo_com collection in the local WebDAV repository.

  1. Create target for

    insert into WS.WS.VFS_SITE
        ('My first test', '', '/help/', 1, 'sites/www_foo_com', '1990-01-01',
          'checked', '/%;', '', 'checked', null, null, 'checked');
  2. Create start queue entry

    insert into WS.WS.VFS_QUEUE
      values ('', now(), '/help/', 'sites/www_foo_com', 'waiting', null);

The custom queue hook can be used to extract the next entry from the robot's queue following a custom algorithm. The following example extracts the oldest entry comparing to the my_data array (this array consists of non-desirable sites) and returns if some are found.

Example 14.3. Creating A Custom Robot Queue Hook

create procedure
  DB.DBA.my_hook (
    in host varchar, in collection varchar, out url varchar, in my_data any
  declare next_url varchar;
  whenever not found goto done;

  -- we trying to extract the oldest entry
  declare cr cursor for select VQ_URL from WS.WS.VFS_QUEUE
      where VQ_HOST = host and VQ_ROOT = collection and VQ_STAT = 'waiting'
      order by VQ_HOST, VQ_ROOT, VQ_TS for update;

  open cr;
  while (1)
    fetch cr into next_url;
    if (get_keyword (host, my_data, null) is not null) -- process if host not in black-list
      update WS.WS.VFS_QUEUE set VQ_STAT = 'pending'
          where VQ_HOST = host and VQ_ROOT = collection and VQ_URL = next_url;
      url := next_url;
      close cr;
      return 1;
    else -- otherwise continue finding
        update WS.WS.VFS_QUEUE set VQ_STAT = 'retrieved'
          where VQ_HOST = host and VQ_ROOT = collection and VQ_URL = next_url;
  -- if we arrive at the bottom of the queue return false to stop processing
  close cr;
  return 0;

[Note] Note:

The default function will return the oldest entry from queue without any restriction. The follow/not-follow restrictions are applied to the path on target before inserting a new queue entry.

The site retrieval can be performed with the WS.WS.SERV_QUEUE_TOP PL function integrated in to the Virtuoso server.