15.4.8.XML Free Text Indexing Rules

XML documents are inserted into the free text index as follows:

The process works on the parsed XML tree; therefore character and local entity references are expanded.
Whole words of text content, bounded by delimiters used for free text, are each assigned an ordinal number. Noise words defined in the noise.txt file used by free text indexing are not counted.
Attribute names and values are not indexed.
Element start and end tags are indexed using the expanded names - that is, prefixed with the namespace URI + ':'.
An element start tag's ordinal number is one less than the ordinal number of the first whole word in the text value.
A close tag's ordinal number is one greater than that of the last word in the text value.

From these rules follows that:

<html>
  <body>
   <title>Title of Document</title>
   <p>Some <b>bold</b> text </p>
  </body>
</html>

will be indexed as follows:

<html>            0
<body>            0
<title>           0
Title           1
of              - no number, noise word
Document                2
</title>          3
   <p>            3
Some            4
 <b>              4
bold            5
</b>              6
 text           6
</p>              6
  </body>         6
</html>           6

As a result, the phrase "some bold text" is the string value of the <p> tag and will match the free text expression "some bold text" even though there is mark-up in it. Conversely, the phrase "Document some bold" does not match. Words will not considered adjacent if there is a mix of opening and closing tags. They will only be considered adjacent if there are solely one or more either opening or closing tags between them. This can be circumvented by using the NEAR connective instead of the phrase construct.

A free text condition will only be true of an element if all the words needed to satisfy the condition are part of the element's string value. This string value includes text children of descendants.