20.2.5. Pre-processing and Extending the Content Being Indexed
Let us consider the news application. Assume now a many to many relationship between articles and numbered news channels.
create table article_channel ( a_id integer references article, c_id integer, primary key (a_id, c_id);
Assume further that free text search criteria be combined to channel membership tests.
This could be expressed as follows
select * from article where contains (description, 'sample') and exists (select 1 from article_channel where a_id = id and c_id = ?);
We have a random access per each hit to a table with at least the population of the article table for each hit. The situation is yet worse if there is an OR of multiple channel id's to which the article may belong.
To optimize this, we may choose to add an extra word for each channel in which the article appears. Likewise, we may add the text of the title of the article to the text being index.
This can be done with the index hook feature.
create text index on article (description) with key dtid clustered with (pub_date, author_name) using function ;
We then define the hook functions
create procedure article_description_index_hook (inout vtb any, inout d_id any) { for (select c_id from article_channel where a_id = composite_ref (d_id, 0)) do { vt_batch_feed (vtb, sprintf ('ch%d', c_id), 0); } vt_batch_feed (vtb, coalesce ((select title from article where dtid = d_id), ''), 0); return 0; }
This function gets all channel id's where the article appears and adds the word ch<nnnn> where <nnnnn> is the channel id. Thus to look for 'xx' on channel 1 or 10 one can use the text expression xx and (ch1 or ch10). Additionally, the text of the title is added to the text being indexed. Note that the d_id supplied is the free text document id and that the second part of it is the article id.
To reverse the effect, the unindex function works as follows:
create procedure article_description_unindex_hook (inout vtb any, inout d_id any) { for (select c_id from article_channel where a_id = composite_ref (d_id, 0)) do { vt_batch_feed (vtb, sprintf ('ch%d', c_id), 1); } vt_batch_feed (vtb, coalesce ((select title from article where dtid = d_id), ''), 1); return 0; }
These hooks accept 2 inout parameters, the so called vt batch and the free text document ID of the row at hand.
The function returns 1 to indicate that it has processed all words of the row to be indexed and 0 to indicate that it expects the default text to be processed by the caller as normally. Returning 1 is useful for example if extra word normalization is applied by the hook.
If the resulting index is used by xcontains()
special
predicate then hook functions should not alter the indexing of XML documents.
xcontains()
reads both free-text index and the
actual document in order to locate particular fragments and it may miss
search hits or get false hits if free-text index of a column does not match to the
actual content of the column.
It is still safe to call vt_batch_feed
more than once
during a single call of a hook function: first call for an
unmodified XML document in the column plus calls for additional data.
If offband columns are declared then any call of a hook function should
either return 0 or call vt_batch_feed_offband()
before returning a non-zero value. If this condition is violated for a document
then NULL is returned instead of correct offband value for the document without
signalling any error.
The vt batch is an opaque data structure that accumulates words that will
be added to the text index entry for a given row. The vt_batch_feed
function adds words to the batch, the first argument is the vt batch, the
second is the text and the third is a flag 0 for insert and 1 for delete.
The text to be associated to the d_id in the index is the concatenation of
all the text supplied by successive calls to vt_batch_feed
. Word proximity
is defined as if all text were a single string in the order of calling
vt_batch_feed
.
It is possible to partially alter the rtext associated with an existing document. This could be done when adding channels to an article which already exists. This could be done with the procedure:
create procedure ch_add_article (in cid integer, in aid integer) { declare vtb, _dtid, cname any; if (exists (select 1 from article_channel where c_id = cid and a_id = aid)) return; insert into article_channel (c_id, a_id) values (cid, aid); select dtid into _dtid from article where id = aid; vtb := vt_batch (1); vt_batch_d_id (vtb, _dtid); vt_batch_feed (vtb, sprintf ('ch%d', cid), 0); vt_batch_process_db_dba_article (vtb); }
This first checks if the article is already on the channel, and if not, it makes a vt batch, gets the free text document id of the article, associates it to the vt batch and then adds a single word, ch<nnnn>. The vt batch is applied by calling the generated procedure vt_batch_process_<table>_<column> with the vt batch as only argument.
If multiple documents should be processed in one batch, it is possible to call vt_batch_d_id multiple times to feed data about multiple documents. In this case the successive document ids must be given in ascending order. The batch can be processed (applied to the words table) by calling T_BATCH_PROCESS_<table>_<column>, generated by the index creation.
The sensitive columns of the UPDATE trigger generated are the free text document id, the CLUSTERED WITH columns and the main text column. If more columns are needed for hook functions etc., the triggers should be manually edited.
If an explicit integer document ID column is specified, its value may not be 0 or negative.
See Also | |
---|---|
vt_batch , vt_batch_feed , vt_batch_feed_offband , vt_batch_d_id . |