16.17.12.Loading RDF

There are many functions for loading RDF text, in RDF/XML and Turtle.

For loading RDF/XML, the best way is to split the data to be loaded into multiple streams and load these in parallel using RDF_LOAD_RDFXML () . To avoid running out of rollback space for large files and in order to have multiple concurrent loads not interfere with each other, the row autocommit mode should be enabled.

For example,

log_enable (2);
-- switch row-by-row autocommit on and logging off for this session
DB.DBA.RDF_LOAD_RDFXML (file_to_string_output ('file.xml'), 'base_uri', 'target_graph');
-- more files here ...
checkpoint;

Loading a file with text like the above with isql will load the data. Since the transaction logging is off, make a manual checkpoint at the end to ensure that data is persisted upon server restart since there is no roll forward log.

If large amounts of data are to be loaded, run multiple such streams in parallel. One may have for example 6 streams for 4 cores. This means that if up to two threads wait for disk, there is still work for all cores.

Having substantially more threads than processors or disks is not particularly useful.

There exist multithreaded load functions which will load one file on multiple threads: the DB.DBA.TTLP_MT() function and the DB.DBA.RDF_LOAD_RDFXML_MT() function . Experience shows that loading multiple files on one thread per file is better.

For loading Turtle, some platforms may have a non-reentrant Turtle parser. This means that only one load may run at a time. One can try this by calling ttlp () from two sessions at the same time. If these do not execute concurrently, then the best way may be to try ttlp_mt and see if this runs faster than a single threaded ttlp call.

RDF Bulk Load Utility

The RDF loader utility facilitates parallel bulk loading of multiple RDF files. The utility maintains a database table containing a list of files to load and the status of each file, whether not loaded, loaded or loaded with error. The table also records load start and end times.

One must have a dba group login for using this and the virtuoso.ini file access control list must be set up so that the Virtuoso server can open the files to load.

Files are added to the load list with the function ld_dir :

ld_dir (in dir_path varchar, in file_mask varchar, in target_graph varchar);

The file mask is a SQL like pattern to match against the files in the directory. For example:

ld_dir ('/data8/2848260', '%.gz', 'http://bsbm.org');

would load the RDF in all files ending in .gz from the directory given as first parameter. The RDF would be loaded in the http://bsbm.org graph.

If NULL is given for the graph, each file may go to a different graph specified in a separate file with the name of the RDF source file plus the extension .graph.

A .graph file contains the target graph URI without any other content or whitespace.

The layout of the load_list table is as follows:

create table DB.DBA.LOAD_LIST (
  ll_file varchar,
  ll_graph varchar,
  ll_state int default 0, -- 0 not started, 1 going, 2 done
  ll_started datetime,
  ll_done datetime,
  ll_host int,
  ll_work_time integer,
  ll_error varchar,
  primary key (ll_file))
alter index LOAD_LIST on DB.DBA.LOAD_LIST partition (ll_file varchar)
create index LL_STATE on DB.DBA.LOAD_LIST (ll_state, ll_file, ll_graph) partition (ll_state int)
;

This table may be checked at any time during bulk load for the progress of the load. ll_state is 1 for files being loaded and 2 for files whose loading has finished. ll_error is NULL if the load finished without error, else it is the error message.

In order to load data from the files in load_list, run as dba:

DB.DBA.rdf_loader_run ();

One may run several of these commands on parallel sessions for better throughput.

On a cluster one can do:

cl_exec ('rdf_ld_srv ()');

This will start one rdf_loader_run() on each node of the cluster. Note that in such a setting all the server processes must see the same files at the same path.

On an isql session one may execute rdf_loader_run () & several times, forking a new isql for each such command, similarly to what a Unix shell does.

Because this load is non-transactional and non-logged, one must do an explicit checkpoint after the load to guarantee a persistent state.

On a single server do:

checkpoint;

On a cluster do:

cl_exec ('checkpoint');

The server(s) are online and can process queries and transactions while a bulk load is in progress. Periodic checkpoints may occur during the load but the state is guaranteed to be consistent only after running a checkpoint after all the bulk load threads have finished.

A bulk load should not be forcibly stopped. To make a controlled stop, run:

rdf_load_stop ();

This will cause the files being loaded at the time to finish load but no new loads will start until explicitly started with rdf_loader_run() .

Specially note that on a cluster the database will be inconsistent if one server process does a checkpoint and another does not. Thus guaranteeing a checkpoint on all is necessary. This is easily done with an isql script with the following content:

ld_dir ('/data8/2848260', '%.gz', 'http://bsbm.org');

-- Record CPU time
select getrusage ()[0] + getrusage ()[1];

rdf_loader_run () &
rdf_loader_run () &
rdf_loader_run () &
rdf_loader_run () &
rdf_loader_run () &
rdf_loader_run () &
rdf_loader_run () &
rdf_loader_run () &

wait_for_children;
checkpoint;

-- Record CPU time
select getrusage ()[0] + getrusage ()[1];

For a cluster, the equivalent is:

ld_dir ('/data8/2848260', '%.gz', 'http://bsbm.org');

cl_exec ('DB.DBA.RDF_LD_SRV (2)');

cl_exec ('checkpoint');

rdf_loader_run() recognizes several file types, including .ttl, .nt, .xml, .rdf, .owl, .nq, .n4, and others. Internally the function uses DB.DBA.ttlp() or DB.DBA.rdf_load_rdfxml , as appropriate.

See the next section for detailed description of the rdf_loader_run() function.

Loading LOD RDF data

To load the rdf data to LOD instance, perform the following steps:

  • Configure & start cluster

  • Execute the file:

    --
    --  $Id$
    --
    --  Alternate RDF index scheme for cases where G unspecified
    --
    --  This file is part of the OpenLink Software Virtuoso Open-Source (VOS)
    --  project.
    --
    --  Copyright (C) 1998-2020 OpenLink Software
    --
    --  This project is free software; you can redistribute it and/or modify it
    --  under the terms of the GNU General Public License as published by the
    --  Free Software Foundation; only version 2 of the License, dated June 1991.
    --
    --  This program is distributed in the hope that it will be useful, but
    --  WITHOUT ANY WARRANTY; without even the implied warranty of
    --  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
    --  General Public License for more details.
    --
    --  You should have received a copy of the GNU General Public License along
    --  with this program; if not, write to the Free Software Foundation, Inc.,
    --  51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
    --
    --
    
    drop index RDF_QUAD_OGPS;
    checkpoint;
    create table R2 (G iri_id_8, S iri_id_8, P iri_id_8, O any, primary key (S, P, O, G))
    alter index R2 on R2 partition (S int (0hexffff00));
    
    log_enable (2);
    insert into R2 (G, S, P, O) SELECT G, S, P, O from rdf_quad;
    
    drop table RDF_QUAD;
    alter table r2 rename RDF_QUAD;
    checkpoint;
    create bitmap index RDF_QUAD_OPGS on RDF_QUAD (O, P, G, S) partition (O varchar (-1, 0hexffff));
    create bitmap index RDF_QUAD_POGS on RDF_QUAD (P, O, G, S) partition (O varchar (-1, 0hexffff));
    create bitmap index RDF_QUAD_GPOS on RDF_QUAD (G, P, O, S) partition (O varchar (-1, 0hexffff));
    
    checkpoint;
    
  • Execute:

    SQL>cl_exec ('checkpoint);
    
  • Execute ld_dir ('directory' , 'mask' , 'graph'), for ex:

    SQL>ld_dir ('/dbs/data', '*.gz', 'http://dbpedia.org');
    
  • Execute on every node with separate client:

    SQL>rdf_loader_run();
    

Loading UniProt RDF data

To load the uniprot data, create a function for example such as:

create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1)
{
  DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),'http://base_uri_1', 'destination_graph_1', log_mode, 3);
  DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),'http://base_uri_2', 'destination_graph_2', log_mode, 3);
  ...
  DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),'http://base_uri_9', 'destination_graph_9', log_mode, 3);
}

If you are starting from blank database and you can drop it and re-create in case of error signaled, use it this way:

checkpoint;
checkpoint_interval(6000);
DB.DBA.UNIPROT_LOAD (0),
checkpoint;
checkpoint_interval(60);

If the database contains important data already and there's no way to stop it and backup before the load then use:

checkpoint;
checkpoint_interval(6000);
DB.DBA.UNIPROT_LOAD (),
checkpoint;
checkpoint_interval(60);

Note that the 'number of threads' parameter of DB.DBA.RDF_LOAD_RDFXML() mentions threads used to process data from file, an extra thread will read the text and parse it, so for 4 CPU cores there's no need in parameter value greater than 3. Three processing threads per one parsing tread is usually good ratio because parsing is usually three times faster than the rest of loading so CPU loading is well balanced. If for example you are using 2 x Quad Xeon, then you can choose between 8 single-threaded parsers or 2 parsers with 3 processing threads each. With 4 cores you may simply load file after file with 3 processing threads. The most important performance tuning is to set the [Parameters] section of virtuoso configuration file:

NumberOfBuffers = 1000000
MaxDirtyBuffers = 800000
MaxCheckpointRemap = 1000000
DefaultIsolation = 2

Note: these numbers are reasonable for 16 GB RAM Linux box. Usually when there are no such massive operations as loading huge database, you can set up the values as:

NumberOfBuffers = 1500000
MaxDirtyBuffers = 1200000
MaxCheckpointRemap = 1500000
DefaultIsolation = 2
[Tip] Tip:

Thus after loading all data you may wish to shutdown, tweak and start server again. If you have ext2fs or ext3fs filesystem, then it's better to have enough free space on disk not to make it more than 80% full. When it's almost full it may allocate database file badly, resulting in measurable loss of disk access speed. That is not Virtuoso-specific fact, but a common hint for all database-like applications with random access to big files.

Here is an example of using awk file for splitting big file smaller ones:

BEGIN {
        file_part=1000
        e_line = "</rdf:RDF>"
        cur=0
        cur_o=0
        file=0
        part=file_part
      }
        {
            res_file_i="res/"FILENAME
            line=$0
            s=$1
            res_file=res_file_i"_"file".rdf"

            if (index (s, "</rdf:Description>") == 1)
            {
                cur=cur+1
                part=part-1
            }

            if (part > 0)
            {
                print line >> res_file
            }

            if (part == 0)
            {
#               print "===================== " cur
                print line >> res_file
                print e_line >> res_file
                close (res_file)
                file=file+1
                part=file_part
                res_file=res_file_i"_"file".rdf"
                system ("cp beg.txt " res_file)
            }
        }
END { }

Loading DBPedia RDF data

You can use the following script as an example for loading DBPedia RDF data in Virtuoso:

#!/bin/sh

PORT=$1
USER=$2
PASS=$3
file=$4
g=$5
LOGF=`basename $0`.log

if [ -z "$PORT" -o -z "$USER" -o -z "$PASS" -o -z "$file" -o -z "$g" ]
then
  echo "Usage: `basename $0` [DSN] [user] [password] [ttl-file] [graph-iri]"
  exit
fi

if [ ! -f "$file" -a ! -d "$file" ]
then
    echo "$file does not exists"
    exit 1
fi

mkdir READY 2>/dev/null
rm -f $LOGF $LOGF.*

echo "Starting..."
echo "Logging into: $LOGF"

DOSQL ()
{
    isql $PORT $USER $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="$1" > $LOGF
}

LOAD_FILE ()
{
    f=$1
    g=$2
    echo "Loading $f (`cat $f | wc -l` lines) `date \"+%H:%M:%S\"`" | tee -a $LOG

    DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 17); checkpoint;" > $LOGF

    if [ $? != 0 ]
    then
        echo "An error occurred, please check $LOGF"
        exit 1
    fi

    line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'`
    newf=$f.part
    inx=1
    while [ ! -z "$line_no" ]
    do
        cat $f |  awk "BEGIN { i = 1 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }"  >> bad.nt
        line_no=`expr $line_no + 1`
        echo "Retrying from line $line_no"
        echo "@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> ." > tmp.nt
        cat $f |  awk "BEGIN { i = 1 } { if (i>=$line_no) print \$0; i = i + 1 }"  >> tmp.nt
        mv tmp.nt $newf
        f=$newf
        mv $LOGF $LOGF.$inx
        DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 17); checkpoint;" > $LOGF

        if [ $? != 0 ]
        then
            echo "An error occurred, please check $LOGF"
            exit 1
        fi
        line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'`
        inx=`expr $inx + 1`
    done
    rm -f $newf 2>/dev/null
    echo "Loaded.  "
}

echo "======================================="
echo "Loading started."
echo "======================================="

if [ -f "$file" ]
then
    LOAD_FILE $file $g
    mv $file READY 2>> /dev/null
elif [ -d "$file" ]
then
    for ff in `find $file -name '*.nt'`
    do
        LOAD_FILE $ff $g
        mv $ff READY 2>> /dev/null
    done
else
   echo "The input is not file or directory"
fi
echo "======================================="
echo "Final checkpoint."
DOSQL "checkpoint;" > temp.res
echo "======================================="
echo "Check bad.nt file for skipped triples."
echo "======================================="

exit 0

Loading Bio2RDF data

The shell script below was used to import files in n3 notation into OpenLink Virtuoso RDF storage.

When an syntax error it will cut content from next line and will retry. This was used on ubuntu linux to import bio2rdf and freebase dumps.

Note it uses gawk, so it must be available on system where is tried. Also for recovery additional disk space is needed at max the size of original file.

#!/bin/bash

PASS=$1
f=$2
g=$3

# Usage
if [ -z "$PASS" -o -z "$f" -o -z "$g" ]
then
  echo "Usage: $0 [password] [ttl-file] [graph-iri]"
  exit
fi

if [ ! -f "$f" ]
then
    echo "$f does not exists"
    exit
fi

# Your port here
PORT=1111  #`inifile -f dbpedia.ini -s Parameters -k ServerPort`
if test -z "$PORT"
then
    echo "Cannot find INI and inifile command"
    exit
fi

# Initial run
isql $PORT dba $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="ttlp_mt (file_to_string_output ('$f'), '', '$g'); checkpoint;" > $0.log

# If disconnect etc.
if [ $? != 0 ]
then
    echo "An error occurred, please check $0.log"
    exit
fi

# Check for error
line_no=`grep Error $0.log | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'`
newf=$f.part
inx=1

# Error recovery
while [ ! -z "$line_no" ]
do
    cat $f |  awk "BEGIN { i = 0 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }"  >> bad.nt
    line_no=`expr $line_no + 1`
    echo "Retrying from line $line_no"
    cat $f |  awk "BEGIN { i = 0 } { if (i>=$line_no) print \$0; i = i + 1 }"  > tmp.nt
    mv tmp.nt $newf
    f=$newf
    mv $0.log $0.log.$inx
    # Run the recovered part
    isql $PORT dba $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="ttlp_mt (file_to_string_output ('$f'), '', '$g'); checkpoint;" > $0.log

    if [ $? != 0 ]
    then
        echo "An error occurred, please check $0.log"
        exit
    fi
   line_no=`grep Error $0.log | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'`
   inx=`expr $inx + 1`
done