20.2.2. Choosing An Application Specific Document ID

The free text index conceptually works by making an index entry for each distinct word of each indexed column value which references back to the row containing the data being indexed. Therefore the table must have a unique ID that will be stored in conjunction with each distinct word in the indexed column in the text index. For space efficiency this should be as short as possible. If nothing else is specified the CREATE TEXT INDEX statement makes such a unique column and fills it automatically from a sequence producing unique numbers. If a single part integer primary key exists then this key is used as the free text index document ID. Note However that the values <= 0 are prohibited. This is however not always optimal, hence the application may specify what column is used to identify the row for text indexing. Such a unique column is referred to as the Free Text Document ID .

Suppose that a table contains news articles that should most frequently be retrieved latest first, in descending order of a datetime field. This can be achieved by just selecting the matching articles and sorting them with a SQL ORDER BY clause but this can be very inefficient. The reason for this is that all hits will first have to be found, then sorted and only then can the first hit be returned to the user. Further, the sort key will have to be retrieved from the table, causing a random access for each text hit. The sorting can be totally avoided if the document ID that is used to refer to the table from the index is itself ordered by date. This has several advantages:

To retrieve the n latest, one just takes the n first hits produced by the contains search, no sorting required.
To get the next n hits, one repeats the search but now specifying that the start ID is the ID of the last row of the previous set. No sorting and no scrollable cursors are required and the first hits can be returned before generating all hits. This is specially useful if the search criteria match many articles.

This has a disadvantage in that a longer document ID will have to be stored for each distinct word of each distinct article. This may result in a 60% increase in the index size but largely offsets the penalties of sorting. One should however exercise the utmost care in making this ID as short as possible. The maximum length of the ID is 30 bytes, but with this length the storage is extremely wasteful, so an ID with fields adding up to some 10 bytes is much better.

We will note that the document ID can be an aggregate of several scalars. In the news article example, it could be a datetime, ID number pair. This is so because the datetime typically would not be unique and the ID is required itself to be unique.

However, rather than storing the datetime and an integer article number, it is advisable to compress the datetime into a number, e.g. a count of minutes after a given date. This maintains the temporal order to within a minute and takes less than half the space taken by the datetime with all its fractions, time zones etc.

For handling multi-part ID's like scalars there is a special data type, composite. Thus, if an application specific document ID is not an integer, it must be a composite totaling less than 30 bytes of content divided among its members.

Prefix	IRI
schema	http://schema.org/
n5	http://creativecommons.org/licenses/by/4.0/
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
n4	http://www.openlinksw.com/#
xsdh	http://www.w3.org/2001/XMLSchema#
n2	http://docs.openlinksw.com/virtuoso/appspecificdocid/

Prefix	URI
xmlns:schema	http://schema.org/
xmlns:n5	http://creativecommons.org/licenses/by/4.0/
xmlns:rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:n4	http://www.openlinksw.com/#
xmlns:xsdh	http://www.w3.org/2001/XMLSchema#
xmlns:n2	http://docs.openlinksw.com/virtuoso/appspecificdocid/

Prefix

URI

xmlns:schema

http://schema.org/

xmlns:n5

http://creativecommons.org/licenses/by/4.0/

xmlns:rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

xmlns:n4

http://www.openlinksw.com/#

xmlns:xsdh

http://www.w3.org/2001/XMLSchema#

xmlns:n2

http://docs.openlinksw.com/virtuoso/appspecificdocid/

Subject	Predicate	Object
n2:	rdf:type	schema:TechArticle
n2:	schema:name	20.2.2.ÃÂ Choosing An Application Specific Document ID
n2:	schema:copyrightHolder	_:vb78514
n2:	schema:datePublished	2016-09-09 16:16:54
n2:	schema:headline	20.2.2.ÃÂ Choosing An Application Specific Document ID
n2:	schema:keywords	OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data
n2:	schema:license	n5:deed.en_US
n2:	schema:publisher	_:vb78513
n2:	schema:url	n2:
_:vb78513	rdf:type	schema:Organization
_:vb78513	schema:name	OpenLink Software
_:vb78513	schema:url	n4:this
_:vb78514	rdf:type	schema:Organization
_:vb78514	schema:name	OpenLink Software
_:vb78514	schema:url	n4:this

Subject

Predicate

Object

n2:

rdf:type

schema:TechArticle

n2:

schema:name

20.2.2.ÃÂ Choosing An Application Specific Document ID

n2:

schema:copyrightHolder

_:vb78514

n2:

schema:datePublished

2016-09-09 16:16:54

n2:

schema:headline

20.2.2.ÃÂ Choosing An Application Specific Document ID

n2:

schema:keywords

OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data

n2:

schema:license

n5:deed.en_US

n2:

schema:publisher

_:vb78513

n2:

schema:url

n2:

_:vb78513

rdf:type

schema:Organization

_:vb78513

schema:name

OpenLink Software

_:vb78513

schema:url

n4:this

_:vb78514

rdf:type

schema:Organization

_:vb78514

schema:name

OpenLink Software

_:vb78514

schema:url

n4:this

Prev	Up	Next
20.2. Creating Free Text Indexes	Home	20.2.3. The composite Data Type

20.2.2. Choosing An Application Specific Document ID

Namespace Prefixes

Statements

Namespace Prefixes

Statements