.
contact contact

B.08.20 / Release Notes / Database / FTS / FTS Parser

FTS Parser

 
.
  The FTS parser operates on text fields and splits it into separate words. Words are then upshifted and added to the index.

By default, each character that is neither a letter nor a digit is considered a word separator. Exceptions are:

  • A negative sign is not considered a separator if it is the first character in a word and is succeeded by a decimal point or a digit. This allows negative numbers as keywords.

  • A decimal point is not considered a separator if enclosed by digits. This allow decimal numbers as keywords.

  • A decimal grouping character is not considered a separator if it is enclosed by digits and occurs at a 3 digit position (relative to the end of the number). This allows to recognize numbers with thousands separators.
A word is recognized as a number if it starts with either a digit or decimal point or a negative sign, followed by a digit. Decimal numbers are recognized if the decimal point character is defined. Thousands grouping is recognized (and removed) when the grouping character is defined and the grouping happens on a 3 digit position (relative to the end of the number).

The maximum length of a word in an index depends on the FTS field configuration. It defaults to 12 characters. Any additional characters are truncated.

Usually, word separators are discarded. However, a "csep" character is ignored at the beginning of a word but is considered a regular character after a word has started. By default, the percent sign is considered a "csep" separator. So the word "15%" may be distinguished from the value "15".

Words separated by "multi" separators are added to the index multiple times. By default, the hyphen is considered a "multi" separator. For example, "TIC-TAC-TOE" could be located with TIC-TAC-TOE, TIC, TAC or TOE. This multiple-indexing feature may be configured per field.

Parser Configuration

The parser characteristics may be customized (per field) with the following settings:
  • "nsep" specifies characters that would otherwise considered a separator to be treated like a letter.

  • "csep" specifies characters that would otherwise considered a separator to be treated like a letter after a word has started.

  • "multi" specifies characters that function as a separator but results in multiple keywords to be kept.

  • "dec" specifies the decimal point and grouping characters (in this order). If the decimal point is not specified then decimal numbers are not recognized. If the grouping character is not specified then the thousands grouping of numbers is not recognized.

Default Parser Configuration

Unless specified the default parser configuration applies to a text field (parser config P0). The default parser config defaults to the following settings:
nsep  = ""
csep  = "%"
multi = "-"
dec   = ".,"
Additional parser configurations may be defined. Additional parser configurations default to the following settings:
nsep  = ""
csep  = ""
multi = ""
dec   = "."

FTS field options

Field specific configuration options define the parsing of the FTS index system:
  • The "NP" (no parse) option indicates to not parse a field. Instead the content is indexed literally. This option may only be specified for text fields.

  • The "NT" (no translate) option indicates to not "translate" a field. Unless set, any words extracted from this field are upshifted and "translated" to ASCII equivalents.

  • The "NE" (no exclude) option indicates to not filter words through an exclusion list (stopword list).

  • The "NM" (no multi) option indicates to not handle multiple keywords seperately. Any multi-separator is treated as a separator.

  • The "SX" (soundex) option indicates that a soundex (sound alike) lookup for any words obtained from this field should be supported. When enabled an additional soundex representation of the keyword is maintained. Note that a soundex code is not added for words that are recognized as numbers.

  • The "NR" (numeric range) option indicates that a numeric range search should also be supported for a text field. When enabled, any numbers in parsed text fields (subject to the exclusion list and the min. keyword length) are also added to the numeric keyword index.

  • The EXCL=n or "En" (exclusion) option specifies the exclusion list for this field. "n" represents the exclusion list id. If not specified, the default exclusion list (E0) is used.

  • The PCFG=n or "Pn" (parser config) option specifies the parser configuration when parsing the content of the field. "n" represents the parser configuration id. If not specified, the default parser configuration (P0) is used.

  • The MAX=# and MIN=n options may be used to specify the max. and min. length of a word. The max. word length specifies the max. number of characters included in a keyword. Any excessive characters are truncated. The min. word length specifies the min. size of a word. Shorter words are exluded from the index. If not specified the max. word length defaults to 12. Any value up to the internal limit of 32 characters may be used. The min. keyword length defaults to 2.

FTS Stop-Word list

A stopword list (aka excluded word list or exclusion list) lists the words that are excluded from indexing.

Words are used with a different frequency. Some are more common than others. The best approach to the most common words (such as THE) is to not index them at all by adding them to the exclude list. This saves disk space and performance while not affecting the search quality. If a word is common it is not a good differentiator to choose records by.

A stopword list may be maintained in a file and copied to the FTS database when the index is defined.

Exclusion list file format:

  • A stopword file (aka exclude file) is a text file
  • Lines are separated by LF or CR LF
  • A leading hash (#) is considered a comment
  • Leading and trailing spaces are ignored
  • One word per line
  • Words are not case sensitive
  • Words may be quoted to retain spaces (double quotes)
For more examples, including languages other than English, the following web site may be of interest http://www.ranks.nl/resources/stopwords.html


 
 
.
 
 
  Privacy | Webmaster | Terms of use | Impressum Revision:  2012-09-06  
  Copyright © 2011 Marxmeier Software AG