6.1.7.Performance Tuning

I/O

Optimizing Disk I/O

Virtuoso allows splitting a database over several files that may be on different devices. By allocating database fragments onto independent disks I/O performance in both random and sequential database operations can be greatly enhanced.

The basic unit of a database is the segment. A segment consists of an integer number of 8K pages. A segment may consist of one or more files called stripes. If a segment has multiple stripes these will be of identical length and the segment size will be an integer multiple of the stripe size.

The size limit on individual database files is platform dependent, but 64 bit file offsets are used where available. For large databases use of multiple disks and segments is recommended for reasons of parallelism even though a single database file can get very large. A database can in principle grow up to 32TB (32-bit page number with 8KB per page).

When a segment is striped each logically consecutive page resides in a different file, thus for a segment of 2 stripes the first stripe will contain all even numbered pages and the second all the odd numbered pages. The stripes of a segment should always be located on independent disks.

In serving multiple clients that do random access to tables the server can perform multiple disk operations in parallel, taking advantage of the independent devices. Striping guarantees a statistically uniform access frequency to all devices holding stripes of a segment.

The random access advantages of striping are available without any specific configuring besides that of the stripes themselves.

Configuring I/O queues

Striping is also useful for a single client doing long sequential read operations. The server can detect the serial nature of an operation, for example a count of all rows in a table and can intelligently prefetch rows.

If the table is located in a striped segment then the server will read all relevant disks in parallel if these disks are allocated to different I/O queues.

All stripes of different segments on one device should form an I/O queue. The idea is that the database stripes that benefit from being sequentially read form a separate queue. All queues are then read and written independently, each on a different thread. This means that a thread will be allocated per queue. If no queues are declared all database files, even if located on different disks share one queue.

A queue is declared in the striping section by specifying a stripe id after the path of the file, separated by an equal sign.

[Striping]
Segment1 = 200M, disk1/db-seg1-1.db = iq1, disk2/db-seg1-2.db = iq2
Segment2 = 200M, disk1/db-seg2-1.db = iq1, (disk2/db-seg2-2.db = iq2

In the above example the first stripes of the segments form one queue and the second stripes form another. This makes sense because now all database files on /disk1 are in iq1 and all on /disk2 are on iq2.

This configuration could have resulted from originally planning a 200 MB database split on 2 disks and subsequently expanding that by another 200 MB.

The I/O queue identifier can be an arbitrary string. As many background I/O threads will be made as there are distinct I/O queues.

Striping and using I/O queues can multiply sequential read rates by a factor almost equal to the number of independent disks. On the other hand assigning stripes on one disk to different queues can have a very detrimental effect. The rule is that all that is physically accessed sequentially will be on the same queue.

Schema Design Considerations

Data Organization

One should keep the following in mind when designing a schema for maximum efficiency.

Index Usage

A select from a table using a non-primary key will need to retrieve the main row if there are search criteria on columns appearing on the main row or output columns that have to be fetched from the main row. Operations are noticeably faster if they can be completed without fetching the main row if the driving key is a non-primary key. This is the case when search criteria and output columns are on the secondary key parts or primary key parts. Note that all secondary keys contain a copy of the primary key. For this purpose it may be useful to add trailing key parts to a secondary key. Indeed, a secondary key can hold all the columns of a row as trailing key parts. This slows insert and update but makes reference to the main row unnecessary when selecting using the secondary key.

A sequential read on the primary key is always fastest. A sequential search with few hits can be faster on a secondary key if the criterion can be evaluated without reference to the main row. This is because a short key can have more entries per page.

Space Consumption

Each column takes the space 'naturally' required by its value. No field lengths are preallocated. Space consumption for columns is the following:

Table6.2.Data type Space Consumption

Data Bytes
Integer below 128 1
Smallint 2
long 4
float 4
timestamp 10
double 8
string 2 + characters
NULL data length for fixed length column, as value of 0 length for variable length column.
BLOB 88 on row + n x 8K (see note below)

If a BLOB fits in the remaining free bytes on a row after non-LOBs are stored, it is stored inline and consumes only 3 bytes + BLOB length.

Each index entry has an overhead of 4 bytes. This applies to the primary key as well as any other keys. The length of the concatenation of the key parts is added to this. For the primary key the length of all columns are summed. For any other key the lengths of the key parts plus any primary key parts not on the secondary key are summed. The maximum length of a row is 4076 bytes.

In light of these points primary keys should generally be short.

Page Allocation

For data inserted in random order pages tend to be 3/4 full. For data inserted in ascending order pages will be about 90% full due to a different splitting point for a history of rising inserts.

Efficient Use of SQL - SQL Execution profiling

Virtuoso offers an execution profiling mechanism that keeps track of the relative time consumption and response times of different SQL statements.

Profiling can be turned on or off with the prof_enable function. When profiling is on, the real time between the start and end of each SQL statement execute call is logged on the server. When prof_enable is called for the second time the statistics gathered between the last call to prof_enable and this call are dumped to the virtprof.out file in the server's working directory.

Profiling is off by default. Profiling can be turned on with the statement:

prof_enable (1);

The virtprof.out file will be generated when prof_enable is called for the second time, e.g.

prof_enable (0);

will write the file and turn profiling back off.

Below is a sample profile output file:

Query Profile (msec)
Real 183685, client wait 2099294, avg conc 11.428772 n_execs 26148 avg exec  80

99 % under 1 s
99 % under 2 s
99 % under 5 s
100 % under 10 s
100 % under 30 s

23 stmts compiled 26 msec, 99 % prepared reused.

 %  total n-times n-errors
49 % 1041791  7952     53     new_order
28 % 602789   8374     490   delivery_1
12 % 259833   8203     296   payment
5  % 123162   821      35    slevel
2  % 54182    785      0     ostat (?, ?, ?, ?)
0  % 11614    4        0     checkpoint
0  % 2790     2        1     select no_d_id, count (*) from new_orde
0  % 2457     3        1     select count (*) from new_order
0  % 662      2        0     status ()
0  % 11       1        1     set autocommit on
0  % 3        1        0     select * from district

This file was produced by profiling the TPC C sample application for 3 minutes. The numbers have the following meanings:

The real time is the real time interval of the measurement, that is the space in time between the prof_enable call that started the profiling and the call that wrote the report. The client wait time is the time cumulatively spent inside the execute call server side, only calls completely processed between profiling start and end are counted. The average concurrency is the exec time divided by real time and indicates how many requests were on the average concurrently pending during the measurement interval. The count of executes and their average duration is also shown.

The next section shows the percentage of executes that finished under 1, 2, 5, 10 and 30 seconds of real time.

The compilation section indicates how many statements were compiled during the interval. These will be SQLPrepare calls or SQLExecDirect calls where the text of the statement was not previously known to the server. The real time spent compiling the statements is added up. The percentage of prepared statement reuses, that is, executes not involving compiling over all executes is shown. This is an important performance metric, considering that it is always better to use prepared statements with parameters than statements with parameters as literals.

The next section shows individual statements executed during the measurement interval sorted by descending cumulative real time between start and completion of the execute call.

The table shows the percentage of real time spent during the calls to the statement as a percentage of the total real time spent in all execute calls. Note that these real times can be higher than the measurement interval since real times on all threads are added up.

The second column shows the total execute real time, the third column shows the count of executes of the statement during the measurement interval. The fourth column is the count of executes that resulted in an error. The error count can be used for instance to spot statements that often produce deadlocks.

Statements are distinguished for profiling purposes using the 40 first characters of their text. Two distinct statements that do not differ in their first 40 characters will be considered the same for profiling. If a statement is a procedure call then only the name of the procedure will be considered, not possibly differing literal parameters.

The profiler will automatically write the report after having 10000 distinct statements in the measurement interval. This is done so as to have a maximum on the profiling memory consumption for applications that always compile a new statement with different literals, resulting in a potentially infinitely long list of statements each executed once. It is obvious that such a profile will not be very useful.

It cannot be overemphasized that if an application does any sort of repetitive processing then this should be done with prepared statements and parameters, for both reasons of performance and profilability.

[Note] Note:

Note that measurements are made with a one millisecond precision. Percentages are rounded to 2 digits. Timing of fast operations, under a few milliseconds, will be imprecise as a result of the 1 ms resolution. Also the cumulative compilation time may be substantially off, since the compilation may often take less than 1 ms at a time. Also note that the precision may also vary between platforms.

Reading a Query profile

A query plan is essentially a pipeline of operations that can be read from top to bottom. The operator above produces the input for the operator below. Nesting is indicated by braces. Operators enclosed in braces must completely process their input before continuing with operators after the section in braces. A case in point is aggregation which is demoted by for { ... } ... . The section in braces generates the rows to be aggregated and one it is fully evaluated the execution proceeds with the operators after the fork. These will typically read the temporary table filled by the operators inside the fork.

Query variables are demoted with <$v...>. A reference to a query variable set by an operator that is not immediately the preceding one is demoted by <r $...> via ...>. The position of the variable in the query state follows the $ sign.

After this is given the symbolic name of the variable, e.g. a column name or the name of the function of which this is the return value.

If this is a reference to a query variable assigned earlier in the plan with one or more operators between the reference and the assignment, the reference contains a via .... section which lists the operators that are between the reference and the assignment.

The operators are identified by a unique number. Internally this is the location in the query state where the operator keeps the mapping between its rows of input and output.

With vectored execution, an operator may receive any number of rows of input which it may reorder at its convenience. For this it must record for each row of output which row of input this corresponds to. Thus when referencing a query variable set earlier in the plan one must take into account the reordering and addition/dropping of rows in affected by the operators between the reference and the initial assignment. This is the meaning of the via section in the <r $xx > notation.

The last part of the query variable reference is a two letter indication of the type. If the first letter is followed by n this means that the variable is non-null. The type letters are a for any, r for IRI, i for integer, d for double, f for float, t for datetime, N for wide string, s for string.

If the plan is a profile, i.e. it is annotated with actual times and cardinalities from running the query, each operator is preceded by a passage that indicates its percentage of total CPU time of the plan, its fanout and the number of rows of input. The fanout is the number of rows of Output divided by the number of rows of input.

The principal operators in a plan are:

  • Cluster outer seq start

    -- Queries and subqueries begin with this operator. It assigns numbers to the sets of parameter bindings the query gets as input. An optional section also begins with this. An optional section is the table or derived table on the right hand side of a left outer join.

  • From

    -- Accessing an index

  • Sort

    -- Group by, order by, distinct or hash join build side

  • group by read

    -- Accessing the results of a group by filled in by a previous sort operator.

  • Top oby

    -- Accessing a previously filled top k order by temporary space

  • Select

    -- At the end of a plan, send the rows of input to the client. If Subq select return the rows of input as rows of output of the enclosing subquery.

  • Subquery

    -- Indicates a SQL derived table, i.e. select in a from clause. The input is the output of the previous node.

  • Cluster ocation fragment

    -- A set of operators partitioned and shipped to a partition in a cluster. There are single part fragments and distributed fragments (DFG). A DFG is a query fragment which begins with a stage operator. The stage is a partitioning exchange operator which routes its rows of input to the next operator in the appropriate partition based on the partitioningh columns of the row of input.

  • Outer seq end

    -- This demotes the end of an optional section, i.e. the table/derived table on the right side of left outer join. This has the effect of making aligned vectors of all the vector query variables assigned in the optional section. When producing rows for which there was no match in the optional section, these will be set to null. References to these variables downstream in the plan will be to the shadowing variables, listed in the shadow clause following this.

Some operators such as index access (From) have a section called vector param casts. This is used when the vector that serves as parameter is not aligned or not of the type required by the operator. For example, a vector of column values in an index lookup might not be of the type of the column but might be of a type castable to this. In any case, the para,key values for index lookup are made into solid aligned, materialized vectors. After this there is a sort step that orders the input according to key order, so that matches are generated from left to right.

In the event of a cluster partitioned operation, i.e. stage node or non-DFG location fragment (QF), the first table source (From) after the stage or fragment serves to both align and cast all inputs needed downstream in the DFG or QF. The operator first partitions the rows of input. Each partition then gets solid, aligned vectors of input values which are directly suitable for input to the first index lookup (From).

We consider an example: The setup is a cluster of 4 server processes with each 8 slices of data. There is one table, ct_c, which is joined to itself:

profile ('SELECT COUNT (*) from ct_c a, ct_c b WHERE a.k2 = b.k1 option (loop, order)');

This returns a result set in which is found the below. This is also in the ql_plan column of the sys_query_log view:

{
-- In the below excerpt some lines are omitted. Comments are inline.

time   1.3e-06% fanout         1          1 rows in

s# 98 cluster outer seq start, set no <V $40 set_ctr in>
save ctx:()

-- the section inside the fork is executed to the end
-- before the section after the fork is started.
s# 161 Fork 42
{
wait time   0.00012% of exec real time, fanout         0

-- The below introduces the 2 stage DFG which does most of the work.
s# 108   { Cluster location fragment 51 0 unordered
   Params: ()
Output: ()

-- First partitioning step. Each of the 32 slices gets the
-- identical plan to execute, this is a flood for all is this
-- first step does not specify a partitioning key value.

time     8e-05% fanout         0          0 rows in
s# 115 Stage 1: Params: (<V $112 set_ctr in>)

-- This is the initial table scan. The operator is executed
-- 32 times, once for each slice of data.

time      0.33% fanout   3.1e+06         32 rows in
s# 121 from DB.DBA.ct_c by ct_c    1.7e+08 rows
Key ct_c  ASC  (<V $28 a.k2 in>)
 [copies params]
vector param casts: <V $40 set_ctr in>-> <V $112 set_ctr in>

-- This is the second partitioning step, each of the rows of
-- output of the previous is sent to the slice corresponding to K2.

time        31% fanout       0.9      2e+08 rows in
s# 131 Stage 2: Params: (<v $134 q_a.k2 S136 in>, <V $138 set_ctr in>)

-- This is the 2nd index lookup, by a.k2 = b.k1 There are two
-- parameters given in the vector param cast section, the first
-- is a.k2 and the second is the set number. This is constant
-- since the query is run on a single row of input but if the
-- query were run on multiple rows of input this would specify
-- which row of input the a.k2 comes from so that this would be
-- aggregated in the right place. Many aggregates can be produced
-- in a single invocation with multiple rows of input.

time        68% fanout         0      2e+08 rows in
s# 144 from DB.DBA.ct_c by ct_c  unq         1 rows
Key ct_c  ASC  ()
 inlined  k1 = <v $134 q_a.k2 S136 in>
vector param casts: <V $28 a.k2 in>-> <v $134 q_a.k2 S136 in>, <r $112 set_ctr via  S121>-> <V $138 set_ctr in>

-- This is the aggregation.

After code:
      0:  sum <V $43 count i> 1 set no <r $138 set_ctr via  S144>
      5: BReturn 0

}
}

-- The below returns the aggregate to the client. The combining of
-- aggregates from different slices is implicit in the for operator
-- whenever this ends with a cluster location fragment.

time   5.4e-07% fanout         0          1 rows in
s# 164 Select (<V $43 count i>)
  set no: <V $40 set_ctr in>
}

Tuning Vectored Execution and Hash Joins

Note : This section applies only to versions 7.00 and up.

Query evaluation performance is significantly affected by parallelization, the vector size and the use of hash joins and fast in-memory hash tables for group by and distinct.

This section explains the configuration parameters and event counters that allow optimizing these factors.

The SQL function cl_sys_stat (in name varchar, in clr int := 0) allows reading and optionally resetting these counters. In the case of a cluster, the value returned is the sum of the values of the metric gathered from all server processes, for a single server this is the local value. In a cluster, to get individual counter values, use sys_stat instead when connected to the process of interest.

To do TPC H at scale 100G on a 32 thread machine, the virtuoso.ini should have the following settings. Only settings at non-default values are mentioned:

[Parameters]
ServerThreads              = 100
NumberOfBuffers            = 5000000

;  The working set of the 100G database when stored column-wise is 38G,
;  so configure this much for database buffers:
; 5000000 / 128            = 39G

;  Read committed
DefaultIsolation           = 2

;  SQL optimizer work space
MaxMemPoolSize             = 100000000

;  Use large vectors when appropriate
AdjustVectorSize           = 1

;  Split queries in up to 32 independently evaluable fragments,
;  so up to 32 threads per query.
ThreadsPerQuery            = 32

;  Thread pool has 32 worker threads divided across all queries
;  in addition to the client thread taken by each query.
AsyncQueueMaxThreads       = 32

;  All queries can collectively take up to 30G without space
;  saving measures being applied. Space saving measures are
;  running on small vector size when a large size would be
;  faster and the use of partitioned (multiple pass) hash join.
MaxQueryMem                = 30G

;  Of the MaxQueryMem, hash join hash tables can take up to 30G,
;  i.e. all of it. A single hash table will only take half of the
;  remaining space, though. So if 2 queries that both need a 30G
;  hash table run at the same time, the first will do 2 passes,
;  taking a hash table of 15G at a time, This leaves 20G space.
;  The second will have 15G left, of which it will take half,
;  7.5G. This will require 4 passes over the data.
HashJoinSpace              = 30G

To analyze the performance of a query workload:

  1. Turn on query logging in sys_query_log. This view contains most metrics and the full text of the query plan with the per operator timings and cardinalities:

    SQL> prof_enable (1);
    
  2. Print long result columns without truncating:

    SQL> set blobs on;
    
  3. Run the query.

  4. Use the profile function for a convenient overview of query execution. For example:

    SQL> profile ('SELECT COUNT (*) FROM orders, lineitem WHERE l_orderkey = o_orderkey');
    

    See summary of execution time, CPU%, compilation time and IO

  5. Read the relevant event counters, resetting the count for the next query. For example:

    SQL> select cl_sys_stat ('tc_qp_thread', clr => 1);
    

    The relevant counters are:

    • tc_qp_thread -- How many threads were started for query parallelization. This is not the number of parallel threads as not all of these threads needed to be running at the same time.

    • tc_part_hash_join -- If a hash join is partitioned, i.e. needs to make multiple passes on over the data, this is the count of passes. This is incremented by 2 if the hash join does 2 passes and not incremented if the hash join goes in a single pass. Normally this should stay at 0 or the hash join space (HashJoinSpace in init, see above) should be increased.

    • tc_no_mem_for_longer_batch -- This is the count of times the execution engine did not switch to large vectors because there was not enough space. This should normally be 0, if this is not so, increase MaxQueryMem in the ini file.

    • tc_slow_temp_insert -- If a distinct or group by temporary space grows over the available query memory, a another data structure will be used so that the hash table can be paged out to disk. This is tens of times less efficient than the memory only structure. This counter is the count of rows inserted into a page-able group by or distinct hash table. This should be 0, if not, increase MaxQueryMem.

    • mp_max_large_in_use -- This is the maximum amount of query memory that has been allocated to date. Reset this before the query of interest, run and read the counter. This is the peak simultaneous memory use by the query.

    • mp_large_in_use -- This is the current amount of query memory in use. Do not reset this.

    • c_max_large_vec -- This is the MaxQueryMem init setting in bytes. This can be altered at run time with __dbf_set .

High Cardinality Group By And Distinct

There are multiple implementations of group by and distinct used in different circumstances by different versions of Virtuoso. Versions of Virtuoso 7 prior to 7.5 have a memory based cuckoo hash based group by and hash join. 7.5 and onwards have a linear hash based implementation of same. Additionally, all version 7's have a pageable, i.e. potentially disk based implementation of group by that may get used if there is no space in memory.

Partitioned Group By

Versions 7.5 and onwards have a choice of a partitioned or re-aggregated parallel group by. The group by with re-aggregation produces an intermediate result on each thread and then adds these up after the completion of each thread into a final result. This is efficient for low-cardinality cases (few distinct values of grouping columns) but inefficient for high cardinality. In the high cardinality case, it is better to use a partitioned group by operator. In the partitioned case, a hash is computed from the grouping columns and this decides which thread gets to do the aggregation. There is information exchange between threads (or processes in a scale-out setting) but there is no need for re-aggregation. This is significantly more performant with many groups. The peak memory utilization is the same.

The c_setp_partition_threshold setting in the Flags section of virtuoso.ini defines when to use a partitioned group by. The default threshold is 100000. If the cost model estimates more distinct values than this threshold, a partitioned group by will be produced. In a query plan with explain of profile, a partitioned group by is present when there is a stage <n> operator in front of the sort operator for the group by. In a cluster plan, the partitioned group by may be colocated with the table right before it, so the stage operator will be before this.

Ordered Group By

For version 7.5 onwards, a special case for ordered group by is recognized. In an ordered group by, the system takes advantage of physical data ordering to limit the memory footprint of the group by: If it is known that grouping keys will not reoccur past a certain window, then all grouping keys that are processed can be emitted as results in a continuous stream so that only a window of groups need be kept at a time.

This is recognized in a query plan by the word "streaming" occurring on the line of the sort operator for the group by.

Ordered group by is rare to non-existent with RDF, where data orderings with an application meaning do not occur, due to having no multipart keys. With a SQL schema grouping on a set of columns including the primary ordering column of the outermost scan in a plan, an ordered group by will typically be generated. An ordered group by is usually better than a partitioned or re-aggregated group by. The exception may be cases where the outermost ordering column has low cardinality, or a very biased value distribution e.g. the P in the RDF PSOG.

Memory for Group By

In Virtuoso 7 prior to 7.5, the memory for a group by is (3 + n_keys + n_aggregates) * 8 bytes per group plus the natural length of any dates or strings in the keys or aggregates. In 7.5 onwards, the memory is 8 bytes less per entry.

Slow Group By

The use of the disk-based group by is easily 100x less efficient than the vectored memory based hash tables. Falling back to this can be considered an error in configuration and can be detected with the tc_slow_temp_insert counter readable with sys_stat . If this increases then there is not enough memory for group by/distinct as used by the application. The monitor may print a message about this in the server messages log.

The maximum size of a memory based group by hash table is by default 1/10 of the MaxQueryMem setting in Parameters in the virtuoso.ini file. If a single hash table exceeds this or if the hash table plus the amount of large query memory blocks (e.g. vectors) allocated across the process exceeds MaxQueryMem, then the switch to the slow pageable hash table takes place. This can be avoided by explicitly setting a larger MaxQueryMem or by changing the cha_max_gby_bytes setting in the Flags section of the ini file. We note that for parallel query plans the first condition to trigger is the condition on mp_large_in_use + the hash table memory exceeding MaxQueryMem. A query will have more than 10 threads and the hash tables for memory based hashes are included in mp_large_in_use .

To track the general memory consumption in large blocks, use the mp_large_in_use and mp_max_large_in_use meters accessible with sys_stat and __dbf_set . Large blocks are for example vectors, group by's and hash join build sides.

To interactively try different memory settings, one can modify c_max_large_vec with __dbf_set . This is the number of bytes of outstanding large blocks after which pageable group by's will be used and vector sizes will no longer be adjusted upward.

Tuning Parameters for Multiuser Workloads

This section describes parameters and event counters that affect memory and parallelism and are specially relevant for high concurrency situations. Different switches and event counters are described. Setting these requires dba privileges. The event counters are accessed with sys_stat and set with __dbf_set . For example:

select sys_stat ('tc_qp_thread');

or

__dbf_set ('dc_batch_sz', 10000);

Some of the parameters have corresponding INI file settings. All of these can be set in an INI file in the [Flags] section with a stanza like:

dc_batch_sz = 10000

First, make sure that the ServerThreads in either [Parameters] ini section (if application with connected clients) or in [HTTPServer] ini section (if over web protocols), is larger than the number of actual connections.

Next, the workload should be profiled to see its memory utilization and intrinsic use of parallelism. Splitting a query into independently executable parallel fragments increases its memory utilization, Switching to larger vector sizes also increases a query's memory utilization while providing increased locality of data access. Together these techniques can result in a 100+ fold difference in the transient memory consumption of a query. Use of hash joins may also increase the transient memory consumption since in a hash join one half of the join must be materialized.

With relatively small numbers of concurrent queries, e.g. 10, these techniques are nearly always beneficial. With hundreds of queries the memory overhead may become a limiting factor.

The MaxQueryMem INI file setting controls how much memory is kept for query execution. This amount of space is maintained allocated. Transient memory use may exceed this but then the memory above this amount is unmapped when no longer needed. Mapping and unmapping memory takes time. Concurrent mmap and munmap calls on different threads will serialize and bring down the CPU utilization of a process even when there is enough parallelism for full platform utiliization.

The mp_mmap_clocks counter is the cumulative amount of real time in CPU clocks spent in mmap or munmap systtem calls. This can increase faster than real time if multiple threads are involved.

If this value grows fast, e.g. at over 10% of real time, the monitor prints a message in the server event log. This is an indication that memory utilization may have to be tuned.

For analyzing the behavior of a workload as concerns memory, do the following:

  • Run the workload once for warmup.

  • Clear the following counters:

    • mp_max_large_in_use -- Highest to date amount of concurrently used memory

    • tc_qp_thread -- count of times a query made a parallel executive fragment. Not increased if running single threaded.

    • tc_adjust_batch_sz -- count of times vector size was switched to larger

    • mp_mmap_clocks -- cumulative time inside mmap or munmap.

    For example:

    __dbf_set ('mp_max_large_in_use', 0);
    
  • Run the workload on a single thread with default settings.

  • Observe the values.

Example Scenario
  1. Clear the counters.

  2. Set enable_dyn_batch_sz to 0, causing queries to only use the initial vector size, dc_batch_sz or VectorSize in the INI file.

  3. Set enable_qp to 1, causing queries to run single threaded.

  4. Run the workload on a single thread. The difference in elapsed real time will show how much benefit the workload has from intra-query parallelization and long vector sizes. If the workload used any of the memory consuming techniques the peak memory in the second case will be lower. How much lower is entirely workload dependent.

    • Note : One may additionally try the workload on a single thread with hash_join_enable set to 0 and 2, respectively. 0 means no hash join plans are made, 2 means that hash join is used when appropriate for either SQL or SPARQL queries. The peak memory utilization and run times may be significantly affected.

  5. Having completed these metrics, one may move to the multi-user case. Note that MaxQueryMem , ( c_max_large_vec setting with __dbf_set ) should be set to a reasonable value, e.g. the peak consumption with the chosen settings times the number of parallel sessions.

  6. Expected Results -- Experience demonstrates for example that with a 128 concurrent users setting enable_qp to 1 and enable_dyn_batch_sz to 0 increased the throughput of the sample workload by a factor of 2.5. The workload derived, even in single user mode, little benefit from dynamic vector size or multithreading, under 30%. In the case at hand, the difference in performance was mostly accounted for by mmap , see the mp_mmap_clocks counter.

    • If more concurrent queries are than CPU cores are expected, there is little point in intra-query parallelism, controlled by ThreadsPerQuery in the INI file or enable_qp in __dbf_set .

    • If mp_mmap_clocks continues to grow fast during the execution one may increase c_max_large_vec . This will cause more mmap 's to be kept in reserve, thus in principle decreasing the frequency of mmap and munmap system calls.

    • If running with hash join enabled, there is a possibility of partitioned hash joins where the query executes in multiple passes in order to build smaller hash tables. This is given by tc_part_hash_join , which is increased by 1 for each non-first pass over a hash join. If the counter increases the HashJoinSpace setting in the INI file should be increased. With __dbf_set this is chash_space_avail .

  7. Notes:

    • The results from above may not happen with a single user but happen all the time with multiple users. The monitor will print a warning message about this in the message log.

    • One may also try a different default vector size, specially if very pressed on memory. The default of 10000 values is generally a good small value but smaller may be possible, however not under 1000.

Query Logging

As of version 7.00 Virtuoso offers optional server side query logging. This records a large number of execution statistics as well as the full query text and execution plan with per-operator timing and cardinality information.

This feature is enabled in the Parameters section of virtuoso.ini:

; virtuoso.ini
...
[Parameters]
QueryLog = filename

At run time, this may be enabled or disabled with prof_enable (), overriding the specification of the ini file. The default file name for the query log is virtuoso.qrl in the server's working directory, if not otherwise specified in the ini file.

The file is in binary format and is not conveniently readable as such. It is most easily accessed via the DB.DBA.SYS_QUERY_LOG system view. This view has parameters for specifying a non-default file path for the query log file as well as a datetime range for selecting the entries of interest.

For example:

SELECT *
  FROM sys_query_log
 WHERE qrl_file = 'other.qrl'
   AND qrl_start_dt = cast ('2011-10-1' as datetime)
   AND qrl_end_dt = cast ('2011-10-2' as datetime)

This will select entries from the file other.qrl that are between the given dates. Note that the qrl* columns are not actual result column of the view but are considered as parameters, hence the use of = instead of a range condition.

All statements executed over a SQL client connection (ODBC/JDBC etc) are logged, as well as any statements executed on an SPARQL end point. DAV and web services requests are not logged unless they perform an exec function call. CPU usage for committing transactions or for background data organization such as autocompact or automatic checkpoints are not logged.

A select statement is logged at the time it produces its last row of output, not at the time this row of output is fetched by a client.

Some of the columns are in units of clocks, whose meaning varies from system to system. On Intel this corresponds to the value returned by the RDTSC instruction. All values are intervals. The relation of this to real time may vary in function of automatic variation of clock frequency.

The columns of SYS_QUERY_LOG are as follows:

  • ql_id

    -- Serial number of the log entry. If the server was started many times with the same file these will not be unique. Combine with

    ql_start_dt

    for unique identification.

  • ql_start_dt

    -- datetime of the start of the query

  • ql_rt_msec

    -- real time elapsed in milliseconds between the start and the logging of the query

  • ql_rt_clocks

    -- Clock cycles of real time spent running the query, not including time between consecutive fetches from a client if the query was a cursor fetched in multiple chunks by a client. This is the number of clocks during which there was at least one thread running on behalf of the query. The average CPU% of the query is given by:

    100 * ql_thread_clocks / ql_rt_clocks
    
  • ql_client_ip

    -- Requesting client IP as dotted decimal.

  • ql_user

    -- User account on behalf of which the query was executed.

  • ql_sqlstate

    -- SQL state if query terminated with error, NULL otherwise.

  • ql_error

    -- Error message if query terminated with error, NULL otherwise.

  • ql_swap

    -- Cumulative count of major page faults since startup of this Virtuoso process.

  • ql_user_cpu

    -- Cumulative user CPU in milliseconds for this server process.

  • ql_sys_cpu

    -- Cumulative system CPU in milliseconds for this server process.

  • ql_text

    -- Source text of the query

  • ql_params

    -- NULL.

  • ql_plan_hash

    -- Hash number calculated from the execution plan, ignoring literals. Can be used for grouping executions of the same query with differing literals together. If difference of literals causes a different plan, this number will be different.

  • ql_c_clocks

    -- CPU clocks of real time used for query compilation. This will be 0 if the query is separately prepared of if the query compilation comes from a cache of recently compiled queries. This is likely if the query is parametrized and executed multiple times.

  • ql_c_msec

    -- Real time used for query compilation in milliseconds.

  • ql_c_disk

    -- Count of disk reads done on behalf of the query compilation, this stands for index sampling initiated by the compilation and does not include any speculative read possibly triggered by the sampling.

  • ql_c_disk_wait

    -- Count of clocks the compilation was blocked waiting for disk.

  • ql_c_cl_wait

    -- Count of clocks the compilation was waiting for information from cluster peers. Such waiting indicates sampling done on remote partitions. If the run time of the query is small, this may be a significant factor in query execution real time.

  • ql_cl_messages

    -- Count of distinct cluster messages sent on behalf of the compilation. These are all related to sampling. Many samples can be combined into one message in some situations. Samples are also cached on the requesting server so repeatedly compiling the same statement will send the messages only the first time in unless the cache has timed out in the meantime.

  • ql_c_rnd_rows

    -- Count of rows retrieved as part of compile time sampling.

The below columns correspond directly to the fields returned by db_activity (). These are summed over all the threads in all the hosts that have done something on behalf of the logged query.

  • ql_rnd_rows

    -- Count of random row lookups. Each sequential lookup begins with one random one for each partition concerned.

  • ql_seq_rows

    -- Sequential rows fetched, each non-first row which is selected counts as one. A row that is looked at but does not satisfy applicable query criteria does not count.

  • ql_same_seg

    -- Count of times the next random lookup in a vectored lookup falls on the same column-wise segment as the previous.

  • ql_same_page

    -- Ibid, for the next lookup falling on the same row-wise leaf page.

  • ql_same_parent

    -- Ibid, for the case of the next lookup falling on a sibling page of the row-wise leaf page of the previous lookup.

For column-wise indices, all the three above counters can be non-zero since these consist of multi-row segments each under a row on a row-wise leaf page. For a row-wise index the same seg counter is always 0.

  • ql_thread_clocks

    -- Sum of clocks spent on any thread for the logged query. Time spent waiting for other threads, for disk or for replies from cluster peers is not counted, thus only running cycles count. These are added up across all hosts in a cluster.

  • ql_disk_wait_clocks

    -- Total clocks any thread spends waiting for disk on behalf of the query. If two threads wait for the same page which is fetched once the wait is counted double. This is not the same as the total read time of the pages since read ahead can fetch pages before they are needed, thus involving no wait.

  • ql_cl_wait_clocks

    -- Total clocks a thread running on behalf of the query spends waiting for a cluster reply. This may be zero if the coordinating thread has work until any cluster replies arrive, in which case there will be no wait. If this is high then the workload is bound by interconnect or is unevenly distributed across a cluster.

  • ql_pg_wait_clocks

    -- Count get page wait.

  • ql_disk_reads

    -- Count disc reads.

  • ql_spec_disk_reads

    -- Count of speculative disk reads triggered on behalf of the query. Any read ahead or any upgrading of a single page read into a read of a whole extent counts towards this, only allocated pages that are read are counted but merging near-adjacent reads may cause actually more disk IO to take place.

  • ql_messages

    -- Count of distinct cluster messages sent on behalf of the query. Any message is counted once. Client-server messages are not counted.

  • ql_message_bytes

    -- Total bytes sent in all cluster messages sent on behalf of the query.

  • ql_qp_threads

    -- Count of times an extra thread is created for parallelizing work on the query in question. The count may be high since consecutively launched threads are counted, this is not a maximum degree of concurrency.

  • ql_vec_bytes

    -- reserved

  • ql_vec_bytes_max

    -- reserved

  • ql_lock_waits

    -- Count of times a thread has waited for a lock on behalf of the query.

  • ql_lock_wait_msec

    -- Total milliseconds any thread has spent waiting for a lock on behalf of the query. This may be longer than real time since many threads may wait at the same time.

  • ql_plan

    -- Text representation of the execution plan, annotated with CPU time percentages and fanouts for the different operators. Fanout is the count of output rows divided by the count of input rows.

  • ql_node_stat

    -- reserved

  • ql_c_memory

    -- Count of bytes allocated for compiling the query. This is the peak size of the memory pool for query compilation.

  • ql_rows_affected

    -- Count of inserted/updated/deleted rows. If the query was a select with a top and an order by, this is is the count of produced rows before the top restriction was applied.

Meters & System Views

DB.DBA.SYS_K_STAT, DB.DBA.SYS_D_STAT, DB.DBA.SYS_L_STAT view

These views provide statistics on the database engine

create view SYS_K_STAT as
  select KEY_TABLE, name_part (KEY_NAME, 2) as index_name,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'n_landings') as landed,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'total_last_page_hits') as consec,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'page_end_inserts') as right_edge,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'page_end_inserts') as lock_esc
        from SYS_KEYS;
create view SYS_L_STAT as
  select KEY_TABLE, name_part (KEY_NAME, 2) as index_name,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_set') as locks,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_waits') as waits,
        (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_waits') * 100)
          / (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_set') + 1) as wait_pct,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'deadlocks') as deadlocks,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'lock_escalations') as lock_esc
        from SYS_KEYS;
create view sys_d_stat as
  select KEY_TABLE, name_part (KEY_NAME, 2) as index_name,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'touches') as touches,
        key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'reads') as reads,
        (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'reads') * 100)

 > / (key_stat (KEY_TABLE, name_part (KEY_NAME, 2), 'touches') + 1) as read_pct
        from SYS_KEYS;

These views offer detailed statistics on index access locality, lock contention and disk usage.

'reset' specified as the stat name will reset all counts for the key in question.

SYS_K_STAT - Key statistics
  • KEY_TABLE The fully qualified table name, e.g. DB.DBA.SYS_PROCEDURES

  • INDEX_NAME The name of the index. This will be equal to the table name for the table's primary key.

  • LANDED The count of random accesses, including inserts. Any insert or select, whether empty, single line or multi- line counts as one. Updates and deletes do not count, as they imply a select in the same or previous statement.

  • CONSEC The number of times a random access falls on the same page as the previous random access. This is always less than LANDED. For repetitive access to the same place or an ascending insert, this will be near LANDED. For a totally random access pattern this will be near 0.

  • RIGHT_EDGE The number of times an insert has added a row to the right edge of the page where the insert was made.

  • LOCK_ESC The count of lock escalations, see SYS_L_STAT.

SYS_L_STAT
  • KEY_TABLE The fully qualified table name, e.g. DB.DBA.SYS_PROCEDURES

  • INDEX_NAME The name of the index. This will be equal to the table name for the table's primary key.

  • LOCKS The number of times a lock has been set on the index. Making a new row or page lock counts as one. Entering a row or page lock either after a wait or without wait (for a shared lock) counts as one.

  • WAITS The number of times a cursor reading this index waited for a lock. Note that this can be higher than the number of locks set, e.g. a 'read committed' cursor may wait for a lock but will never make one.

  • WAIT_PCT The percentage of lock set events that involved a wait.

  • DEADLOCKS The number of times a deadlock was detected when trying to wait for a lock on this index. Note that one deadlock event may involve locks on several indices. Each deadlock detection counts as one.

  • LOCK_ESC The number of times the set of row locks on a page of this index where escalated into one page lock or a page lock was set initially. This is always less than LOCKS. This value will be near LOCKS when there are many sequential selects which switch to page lock mode. This happens when a cursor has performed over 2 lock escalations and the page being entered has no locks, i.e. the lock can be set over the entire page.

SYS_D_STAT
  • KEY_TABLE The fully qualified table name, e.g. DB.DBA.SYS_PROCEDURES

  • INDEX_NAME The name of the index. This will be equal to the table name for the table's primary key.

  • TOUCHES The number of times a row is located on the index. Every row retrieved by a select or inserted counts as one. All rows scanned by an select count or other aggregate counts as one.

  • READS The number of times a disk read was caused by a read operation on this index. This may theoretically be higher than TOUCHES, since several levels of the index tree may have to be read to get to a leaf.

  • READ_PCT The percentage of READS in TOUCHES.

Example6.14.Examples:

select index_name, locks, waits, wait_pct, deadlocks
    from sys_l_stat order by 2 desc;

Get lock data, indices in descending order of lock count.

select index_name, touches, reads, read_pct
    from sys_d_stat order by 3 desc;

Get disk read counts, index with most reads first.

select index_name, (consec * 100) / (landed + 1)
    from sys_k_stat where landed > 1000  order by 2;

Get the percentage of consecutive page access on indices with over 1000 accesses so far, most randomly accessed first.

status SQL function - status ();

This function returns a summary of the database status as a result set. The result set has one varchar column, which has consecutive lines of text. The lines can be up to several hundred characters.

The contents of the status summary are described in the Administrator's Guide .

Virtuoso db file usage detailed info

All data in a virtuoso database are logically stored as database key rows. Thus the primary key for a table holds the entire row (including the dependent part) and the secondary keys just hold their respective key parts. So the space that the table occupies is the sum of the space occupied by it's primary key and all the secondary keys.

The main physical unit of allocation in a virtuoso db file is the database page (about 8k in virtuoso 3.x). So the server needs to map the key rows and outline blobs to database pages.

Virtuoso will store as many rows in a db page as it can, so usually one DB page will contain more than 1 row of a given key. No page contains rows from more than one key. However blobs (when not inlined on the row) will be placed in consecutive DB pages (up to their size). In addition to the blob and key pages the Virtuoso DB will hold a number of pages containing internal data. So the sum of the pages occupied by the key rows and the blobs is leas then the amount of occupied pages (as reported by the status() BIF).

To provide detailed information about the space consumption of each key there's a system view:

DB.DBA.SYS_INDEX_SPACE_STATS
    ISS_KEY_TABLE       varchar -- name of the table
    ISS_KEY_NAME        varchar -- name of the key
    ISS_KEY_ID          integer -- id of the key (corresponding to KEY_ID from DB.DBA.SYS_KEYS)
    ISS_NROWS           integer -- number of rows in the table
    ISS_ROW_BYTES       integer -- sum of the byte lengths of all the rows in the table
    ISS_BLOB_PAGES      integer -- sum of the blob pages occupied by the outline blobs on all the rows of the table
    ISS_ROW_PAGES       integer -- sum of all the db pages containing rows of this key
    ISS_PAGES           integer -- = ISS_BLOB_PAGES + ISS_ROW_PAGES (for convenience).

Each select on that view causes the server to go over all the db pages in the db file (similarly to how the crash dump operates) and collect the statistics above. The pages are traversed 1 time per select, but still on large database files that may take some time.

Transaction Metrics, Diagnostics and Optimization

Bad design and implementation of transactions affects applications in the following ways:

  • Performance is lost by having to needlessly retry transactions that are aborted by deadlock.

  • Concurrency is lost by having rows stay locked for too long.

  • Memory is transiently consumed, adversely affecting working set, by keeping data structures for too many simultaneous locks, rollback records and uncommitted roll forward logs.

The following rules should be observed when writing transactions:

  • Do not lock needlessly. For example, any report transaction that reads the data once can always be done as read committed instead of repeatable read without affecting semantics. Even if some data is read multiple times, the repeatable read semantic is typically not relevant for reports.

  • Lock for what is needed. If you mean to update later, do the initial read with exclusive locks. See the for update clause in select, for example.

  • Lock in constant order. If you must lock different resources in one transaction, lock them always in the same order. When updating stock for an order, update the quantity on hand in increasing order of item number, for instance.

  • Lock the item with the least contention first. For example, update the detail before updating the summary. Update the quantity in stock for the ordered items before updating the orders count of the whole warehouse.

  • Keep transactions short. Use stored procedures. Use the explicit commit work statement.

  • For each transaction in a stored procedure, make sure that if it is deadlocked, the deadlocked transaction gets retried. For example, have a "declare exit handler for sqlstate 40001" for every transaction context. Make sure that a deadlocking transaction is never retried endlessly. Two mutually deadlocking transactions can keep retrying and again deadlocking each other endlessly. To avoid this, have a maximum count of retries and a random delay before restarting. The restart delay should be between 0 and the expected duration of the transaction.

  • Always break batch updates into multiple transactions. Update a few thousand or tens of thousands of rows per transaction, never more than that. Failing to do this makes for prohibitive cost of retry with deadlocks and can cause swapping by keeping tens or hundreds of megabytes in rollback state, locks and other transaction temporary structures. This happens if one inserts, updates, deletes several million rows in a single transaction. If this is really needed and concurrency is no issue, use the atomic mode, effectively making the server single user for the transaction, thus having no locking or rollback. See the use of the

    __atomic()

    function in the Cluster Administration API .

Programming Virtuoso/PL

The isolation level is set in Virtuoso/PL with the

set isolation := level;

statement, where level is one of 'serializable', 'repeatable', 'committed', 'uncommitted'. Example :

set isolation = 'serializable';

The standard SQL syntax is also supported :

SET TRANSACTION ISOLATION LEVEL <isolation_level>
isolation level : READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SERIALIZABLE

Example :

SET TRANSACTION ISOLATION LEVEL READ COMMITTED

The effect is for the rest of the procedure and any procedure called from this procedure. The effect stops when the procedure having executed the set isolation statement returns.

Transaction Parallelism

From Virtuoso 7.1 onwards, data manipulation statements can be multithreaded. The virtuoso.ini [Flags] settings enable_mt_txn and enable_mt_transact control transaction parallelism. These are 1 for enable and 0 for disable. Defaults are version dependent, off in 7.1 and 7.2 and on in subsequent. The sys_stat() and __dbf_set() functions can be used for querying and altering these.

DML statements, e.g. insert, delete, update, are automatically multithreaded according to enable_qp (ThreadsPerQuery in ini [Parameters] when enable_mt_txn is on).

RDF load (DB.DBA.ttlp() and related functions) when in transactional mode (log_enable 0 or 1) are also multithreaded when enable_mt_txn is on. Otherwise these are multithreaded when in non-transactional mode (log_enable 2 or 3) and single threaded in transactional mode.

The enable_mt_transact setting controls the parallelism of a commit or rollback. Specially for column store deletes, the commit may take a long time, as this is the point where the physical deletion takes place. To accelerate this, enable_mt_transact should be on. This is however not worthwhile for short row store transactions, as the overhead of making threads is greater than the time it takes to finalize a transaction om a page. Again, large, scattered row store deletes may justify use of enable_mt_transact .

Sample Deadlock Handler

The text for a deadlock handler is

  declare retry_count int;
  retry_count := 0;
retry:
  {
    declare exit handler for sqlstate '40001'
      {
        rollback work;
        ...
        delay (rnd (2.5));  --- if 2.5 seconds is the expected duration of
the transaction.
        ...

        retry_count := retry_count + 1;
        if (retry_count > 5)
          signal ("xxxxx", "Too many deadlock retries in xxxxx.");
        goto retry;
      }
   -- do the operations. The actual working code here.

    commit work;
  }

An exclusive read is done with

select s_quantity from stock where s_i_id = 111 for update;
ODBC

For the Virtuoso ODBC driver the isolation is set by :

  • connection option (in either SQLSetConnectAttr () or SQLSetConnectOption () )

       rc = SQLSetConnectOption (hdbc, SQL_TXN_ISOLATION, SQL_TXN_READ_COMMITTED);
    

    or

      rc = SQLSetConnectAttr (hdbc, SQL_TXN_ISOLATION, SQL_TXN_READ_COMMITTED, NULL);
    

    Constants are : SQL_TXN_READ_UNCOMMITTED, SQL_TXN_READ_COMMITTED, SQL_TXN_REPEATABLE_READ, SQL_TXN_SERIALIZABLE

  • ODBC setup dialog option : In Windows there is a drop-down combo box to set the default transaction isolation level for a connection.

  • connection string element : You may specify the default transaction isolation level for a given connection in it's connect string (passed to the SQLDriverConnect () ODBC API). Example :

       SQLDriverConnect (hdbc, hwnd,
           "DSN=MyDSN;IsolationLevel=Repeatable Read;UID=dba;PWD=dbapwd", SQL_NTS,
           NULL, 0,
           NULL,
           SQL_DRIVER_NOPROMPT).
    

    The possible options for the connection string are : "Read Uncommitted", "Read Committed", "Repeatable Read" and "Serializable"

JDBC

In the Virtuoso JDBC driver the isolation is set by the java.sql.Connection.setTransactionIsolation() JDBC API.

  conn.setTransactionIsolation (java.sql.Connection.TRANSACTION_SERIALIZABLE)

The constants are described in the Java Docs

.Net

In the VirtuosoClient.NET provider the isolation is set by the System.Data.IDbConnection.BeginTransaction Method (IsolationLevel) function.

  System.Data.IDBTransaction trx = conn.BeginTransaction (System.Data.IsolationLevel.ReadCommitted)

The constants are described here

Metrics and Diagnostics

Metrics are presented at the server and the table level.

The first metric to check is the output of status ('');

The paragraph titled transaction status contains the following:

  • Count of deadlocks since server startup. There is a total number of deadlocks and the number of 2r1w deadlocks. The latter is a special case where two transactions both hold a shared lock on a resource and one tries to convert the lock to exclusive. This situation can kill the transaction attempting to write. Such deadlocks are essentially always needless. These are avoided by reading for update when first reading the resource.

  • Count of waits since server startup. This is incremented every time some operation waits for a lock, except if this wait leads into a deadlock. If the number of deadlocks is high, let's say over 5% of the number of waits, transactions are likely badly designed and deadlock too often, either because of not locking for write at the get go (2r1w) or because of locking resources in varying order.

  • Count of threads running. This is the count of all threads that are somehow occupied, whether running or waiting. This count minus the count of waiting minus the count of threads in vdb is the count of threads that in principle could be on CPU.

  • Count of threads waiting. This is the count of threads that are waiting for a lock. If this is a high percentage of the count of threads running, say over 30%, resources are likely locked inefficiently, keeping too many locked.

  • Count of threads in vdb. This is the count of threads that are at the time waiting for I/O either from a remote database or any sort of network operation, including access to web services on other serbers, access to web pages on other hosts etc.

The system view db.dba.sys_l_stat is used for locating bottlenecks.

The columns are:

  • index - The index being locked. Note that when reading on non-primary key, the lock is set on the index first, only then on the pk, that is if the pk is accessed at all. For all updates however, the pk will always be accessed. *locks - The count of times a lock was set on this index.

  • waits - The number of times there was a wait on this index. There can be more waits than locks because a read committed cursor can wait but will not lock, thus all waits do not result in locks.

  • wait_pct - The percentage of times setting a lock waited. (100 * waits) / locks

  • deadlocks - The number of times a deadlock was signalled when attempting to wait for a lock on this index.

  • lock_esc - The number of times a set of row locks was converted into a page lock on this index.

  • wait_msecs - The total amount of real time spent by some thread waiting for a lock on this index. This may be greater than elapsed time because many threads can wait at the same time.

All counts and times are cumulative from server startup.

The interpretation is as follows:

If deadlocks is high in relation to waits or locks, i.e. over 5%, there are many deadlocks and the transaction profiles may have to be adjusted. The table where deadlocks is incremented is the table where the deadlock was detected but the deadlock may involve any number of tables. So, if A and B are locked in the order A, B half of the time and B, a the rest of the time, then the deadlocks of the tables of A and B will be about the same, half of the deadlocks being detected when locking A, the other half when locking B.

If waits is high in relation to locks, for example 20%, then there is probably needless contention. Things are kept locked needlessly. Use read committed or make shorter transactions or lock items with the less contention first.

Because transaction duration varies, the place with the highest count of waits is not necessarily the place with the heaviest contention if the waits are short. Use wait_msecs in addition to waits for determining where the waiting takes place.

To get a general picture, use the Conductor's Statistics page or simply do:

select top 5 *
  from sys_l_statt
 order by wait_msecs desc;

to get a quick view of where time is spent. You can also sort by waits desc or locks desc.

SQL Issues

It is possible to get bad locking behavior if the SQL compiler decides to make linear scans of tables or indices and the isolation is greater than read committed. The presence of a linear scan on an index with locking is often indicated by having a large number of lock escalations. If lock_esc is near locks then a large part of the activity is likely sequential reads.

The general remedy is to do any long report type transactions as read committed unless there are necessary reasons to do otherwise.

To see how a specific query is compiled, one can use the explain () function. To change how a query is compiled, one can use the table option or option SQL clauses.

Observation of Dynamics

Deadlocks and contention do not occur uniformly across time. The occurrences will sharply increase after a certain application dependent load threshold is exceeded.

Also, deadlocks will occur in batches. Several transactions will first wait for each other and then retry maybe several times, maybe only one succeeding at every round. In such worst cases, there will be many more deadlocks than successfully completed transactions. Optimize locking order and make the transactions smaller.

Looking at how counts change, specially if they change in bursts is useful.

Tracing and Debugging

The most detailed picture of a system's behavior, including deadlocks nd other exceptions cn be obtained with profiling. If the application is in C, Java or some other compiled language, one can use the language's test coverage facility to see execution counts for various branches of the code.

For client applications, using the Virtuoso trace() function can be useful for seeing which statements signal errors. Also the profiling report produced by prof_enable () can give useful hints on execution times and error frequencies. See Profiling and prof_enable () .

For PL applications, Virtuoso provides a profiler and test coverage facility. This is activated by setting PLDebug = 2 in the Parameters section of the ini file and starting the server. The functions cov_store () and cov_report are used for obtaining a report of the code execution profile thus far. See the documentation on "Branch Coverage" for this. The execution counts of the lines for exception handler code will show the frequency of exceptions. If in a linear block of code a specific line has an execution count lower than that of the line above it, then this means that the line with the higher count has signalled as many exceptions as the difference of the higher and lower count indicates.

The times indicated in the flat and call graph profile reports for PL procedures are totals across all threads and include time spent waiting for locks. Thus a procedure that consumes almost no CPU can appear high on the list if it waits for locks, specially if this is so on multiple threads concurrently. The times are real times measured on threads separately and can thus exceed any elapsed real tiime.

Single stepping is not generally useful for debugging locking since locking issues are timing sensitive.

Client Level Resource Accounting

Starting with version 6, Virtuoso keeps track of the count of basic database operations performed on behalf of each connected client. The resource use statistics are incremented as work on the connection proceeds. The db_activity () function can be called to return the accumulated operation counts and to optionally reset these.

The db_activity built-in function has one optional argument. The possible values are:

  • 0 - (default) - Return human readable string and reset the counts.

  • 1 - return an array of numbers and reset the counts.

  • 2 - return a human readable string and leave the counts.

  • 3 - return an array of numbers and leave the counts.

The human readable string is of the form:

   22.56MR rnd  1.102GR seq     10P disk  1.341GB /  102.7K messages
  

The postfixes K, M, G, T mean 10^3 to 10^15, except when applied to bytes, where these mean consecutive powers of 1024.

The numbers, left to right are the count of random row lookups, sequential row lookups, disk page reads, cluster inter-node traffic in bytes and cluster inter-node message count as an integer number of messages. If the configuration is a single server, the two last are 0.

The random and sequential lookup counts are incremented regardless of whether the row was found or not or whether it matched search conditions.

If the information is retrieved as an array, the array contains integer numbers representing these plus some more metrics.

Index - Meaning

  • 0 - Random lookups

  • 1 - sequential lookups

  • 3 - lock waits

  • 4 - total msec spent in lock wait on some thread. In a cluster situation, this may be more than elapsed real time.

  • 5 - Disk reads

  • 6 - Speculative disk reads. These are also counted in disk reads. A speculative read is a prefetch made on the basis of a read history of a disk extent.

  • 7 - Cluster inter node message count

  • 8 - Total bytes sent in cluster inter node traffic. A message is counted once, when sent.

If the thread calling db_activity is a web server thread, the totals are automatically reset when beginning the processing of the current web request.