20.2.10. Noise Words

Noise words are often occurring words which can be skipped to save space in the indexing, such as 'the', 'of', and' etc. These are ignored when they occur in queries or documents to be indexed. The set of noise words is configurable and is read from the noise.txt file, in the server's working directory, at server start up. Words mentioned in that file will be ignored for both indexing and querying.

The file noise.txt consists of control lines and text lines. A text line is just a string of one or more words to be declared as noise. Please keep them shorter than 1000 characters.

Control lines are those starting with "Language:" or "Encoding:" (case is important) "Language: lang-id" tells the system to use rules for language "lang-id" for subsequent text lines, until either another "Language:" control line or end of file. Similarly, "Encoding: enc-id" tells the system to use rules for encoding "enc-id". Control lines are always in plain ASCII, no matter which encoding is active for text lines. By default, the server default language and "UTF-8" encoding will be used.

The simplest way of composing noise.txt is to place every word on a separate line and save the file in UTF-8 encoding; this will work fine for most European languages.

Example 20.3. Example

AND
OF
THIS
THE

Noise words seem to be case-insensitive, but this is not so. If you enter a word on a text line, up to four noise words will be registered:

the word exactly as it was entered;

an uppercased form of this word, if it is defined for active language;

an lowercased form of this word, if it is defined for active language;

a capitalized form, with one (or more) first chars in upper case and the rest in lower case.

An error is signalled for a free text query consisting exclusively of noise words.

It is important to understand that changes in noise.txt may invalidate free text indexes that were filled with the previous version of noise.txt. For example, if a text contains a phrase 'A B C' and after indexing the word 'B' is added to the noise.txt then 'contains' predicate will properly search for words 'A' and 'C' but will fail to find the phrase 'A B C' or 'A C' due to differences in counting of word positions. The 'xcontains' predicate is even more sensitive to changes in word positions, because any change in word counting will corrupt the index for element names. In addition, "persistent XML" documents may contain pre-calculated word positions for all elements and these positions may become out of sync with positions in free text index, so it is best not to change noise.txt if the database contains any free text indexes on persistent XMLs.

Prefix	IRI
schema	http://schema.org/
n5	http://creativecommons.org/licenses/by/4.0/
n2	http://docs.openlinksw.com/virtuoso/noisewords/
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
n4	http://www.openlinksw.com/#
xsdh	http://www.w3.org/2001/XMLSchema#

Prefix	URI
xmlns:schema	http://schema.org/
xmlns:n5	http://creativecommons.org/licenses/by/4.0/
xmlns:n2	http://docs.openlinksw.com/virtuoso/noisewords/
xmlns:rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:n4	http://www.openlinksw.com/#
xmlns:xsdh	http://www.w3.org/2001/XMLSchema#

Prefix

URI

xmlns:schema

http://schema.org/

xmlns:n5

http://creativecommons.org/licenses/by/4.0/

xmlns:n2

http://docs.openlinksw.com/virtuoso/noisewords/

xmlns:rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

xmlns:n4

http://www.openlinksw.com/#

xmlns:xsdh

http://www.w3.org/2001/XMLSchema#

Subject	Predicate	Object
n2:	rdf:type	schema:APIReference
n2:	rdf:type	schema:TechArticle
n2:	schema:name	20.2.10.ÃÂ Noise Words
n2:	schema:copyrightHolder	_:vb80998
n2:	schema:datePublished	2016-09-09 16:16:54
n2:	schema:headline	20.2.10.ÃÂ Noise Words
n2:	schema:keywords	OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data
n2:	schema:license	n5:deed.en_US
n2:	schema:publisher	_:vb80997
n2:	schema:url	n2:
_:vb80997	rdf:type	schema:Organization
_:vb80997	schema:name	OpenLink Software
_:vb80997	schema:url	n4:this
_:vb80998	rdf:type	schema:Organization
_:vb80998	schema:name	OpenLink Software
_:vb80998	schema:url	n4:this

Subject

Predicate

Object

n2:

rdf:type

schema:APIReference

n2:

rdf:type

schema:TechArticle

n2:

schema:name

20.2.10.ÃÂ Noise Words

n2:

schema:copyrightHolder

_:vb80998

n2:

schema:datePublished

2016-09-09 16:16:54

n2:

schema:headline

20.2.10.ÃÂ Noise Words

n2:

schema:keywords

OpenLink,Virtuoso,database,RDBMS,relational,SQL,RDF,triple store,linked data,linked open data,Big Data

n2:

schema:license

n5:deed.en_US

n2:

schema:publisher

_:vb80997

n2:

schema:url

n2:

_:vb80997

rdf:type

schema:Organization

_:vb80997

schema:name

OpenLink Software

_:vb80997

schema:url

n4:this

_:vb80998

rdf:type

schema:Organization

_:vb80998

schema:name

OpenLink Software

_:vb80998

schema:url

n4:this

Prev	Up	Next
20.2.9. Order of Hits	Home	20.3. Querying Free Text Indexes

20.2.10. Noise Words

Namespace Prefixes

Statements

Namespace Prefixes

Statements