15.4.8. XML Free Text Indexing Rules
XML documents are inserted into the free text index as follows:
The process works on the parsed XML tree; therefore character and local entity references are expanded. |
Whole words of text content, bounded by delimiters used for free text, are each assigned an ordinal number. Noise words defined in the noise.txt file used by free text indexing are not counted. |
Attribute names and values are not indexed. |
Element start and end tags are indexed using the expanded names - that is, prefixed with the namespace URI + ':'. |
An element start tag's ordinal number is one less than the ordinal number of the first whole word in the text value. |
A close tag's ordinal number is one greater than that of the last word in the text value. |
From these rules follows that:
<html> <body> <title>Title of Document</title> <p>Some <b>bold</b> text </p> </body> </html>
will be indexed as follows:
<html> 0 <body> 0 <title> 0 Title 1 of - no number, noise word Document 2 </title> 3 <p> 3 Some 4 <b> 4 bold 5 </b> 6 text 6 </p> 6 </body> 6 </html> 6
As a result, the phrase "some bold text" is the string value
of the <p> tag and will match the free text expression
"some bold text" even though there is mark-up in it.
Conversely, the phrase "Document some bold" does not match.
Words will not considered adjacent if there is a mix of opening and
closing tags. They will only be considered adjacent if there are
solely one or more either opening or closing tags between them. This
can be circumvented by using the NEAR
connective
instead of the phrase construct.
A free text condition will only be true of an element if all the words needed to satisfy the condition are part of the element's string value. This string value includes text children of descendants.