9.33.2.Phrase Set Configuration API

  • DB.DBA.ANN_PHRASE_CLASS_ADD

  • DB.DBA.ANN_PHRASE_CLASS_DEL

  • AP_BUILD_MATCH_LIST: The report R is a vector of 6 elements:

    1. R[0] vector of all distinct phrase classes for phrase sets of found phrases; every pair of items represents one phrase class: first item is an integer APC_ID of a class, second item is a description of phrase class as vector of APC_NAME, APC_CALLBACK and APC_APP_ENV;

    2. R[1] vector of all distinct phrase sets of found phrases; every pair of items represents one phrase set: first item is an integer APS_ID of a class, second item is a vector of APS_NAME, APS_APC_ID, index of phrase class description in R[0] and APS_APP_ENV;

    3. R[2] vector of all distinct found phrases; every item represents a phrase as a vector of AP_APS_ID, index of phrase set description in R[1], AP_TEXT and application-specific data from AP_LINK_DATA or AP_LINK_DATA_LONG;

    4. R[3] vector of all composed arrows for the text; every item represents one place in a text, as an "arrow" described below;

    5. R[4] vector of indexes of arrows that point to words in the text; every item is an integer that is index in R[3];

    6. R[5] vector of descriptions of occurrences of annotation phrases in text; every item represents one occurrence as vector of index of first word in R[3], index of last word in R[3], index of found phrase in R[2], index of previous occurrence of same phrase in R[5].

    7. Every "arrow" A is vector of length 5 or 6, it is longer when arrow points inside occurrence of some annotation phrase.

      1. A[0] integer that indicates type of text fragment:

        • 0 is for plain word (only this type occurs in reports for plain text),

        • 1 is for text of opening tag,

        • 2 is for text of closing tag,

        • 3 is something exceptional like unrecoverable HTML syntax error

      2. A[1] integer offset of the first byte of a fragment in the text

      3. A[2] integer offset of the first byte after the end of a fragment

      4. A[3] integer that is a bit-mask of opened but not yet closed tags

      5. A[4] integer index of the arrow of the innermost tag that is opened but not yet closed where the arrow begins

      6. A[5] may absent, if presents then it is a vector of indexes in R[2] of all containing phrases.

    8. Bit mask of opened but not yet closed tags consists of the following bits:

      0x00000001      PCDATA containers (such as OPTION, TEXTAREA, XBODY, XHEAD)
      0x00000002      Inlined highlight tags (such as ABBR, ACRONYM, B, BDO, BIG, CITE, CODE, DFN, EM, FONT, I, KBD, Q, S, SAMP, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT, U)
      0x00000004      Tag A
      0x00000008      Tag LABEL
      
      0x00000010      Inlined content (such as ADDRESS, APPLET, H1-H6, LABEL, LEGEND, P, PRE, and all blocks of content except MAP)
      0x00000020      Blocks (such as BLOCKQUOTE, BUTTON, DD, DIV, DL, DT, FIELDSET, FORM, IFRAME, LI, NOFRAMES, NOSCRIPT, OBJECT, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, XBODY, XHEAD)
      
      0x00000100      Tags of list and ordered list (MENU, OL, UL)
      0x00000200      Tag LI
      0x00000400      Tag DL
      0x00000800      Tags DD and DT
      
      0x00001000      Tag FORM
      0x00002000      Tag SELECT
      0x00004000      Tag OPTGROUP
      0x00008000      Tag BUTTON
      
      0x00010000      Tag TABLE
      0x00020000      Tags inside TABLE but outside table rows (such as TBODY, TFOOT, THEAD)
      0x00040000      Tag TR
      0x00080000      Tags TH and TD
      
      0x00FFFFFF      Tags XBODY and XHEAD
      
      0x01000000      Tag HEAD
      0x02000000      Tag FRAMESET
      0x04000000      Tag NOFRAMES
      
      0x10000000      Tag HTML
      0x20000000      Tag BODY
      0x40000000      Tags INS and DEL
      0x80000000      Tag XMP
      

      For long document, the report may be too long, esp. vectors R[3] and R[4]. A simple application may not need locations of every tag and every word of the document. The report_flags argument is a bitmask, and some bits control the size of the report. If bit 1 is set then closing tags are excluded from report. If bit 2 is set then only words in found phrase are placed to the report, the rest of phrases is excluded.