The FTS parser operates on text fields and splits it into separate
words. Words are then upshifted and added to the index.
By default, each character that is neither a letter nor a digit is
considered a word separator. Exceptions are:
-
A negative sign is not considered a separator if it is the first
character in a word and is succeeded by a decimal point or a digit.
This allows negative numbers as keywords.
-
A decimal point is not considered a separator if enclosed by digits.
This allow decimal numbers as keywords.
-
A decimal grouping character is not considered a separator
if it is enclosed by digits and occurs at a 3 digit
position (relative to the end of the number). This allows
to recognize numbers with thousands separators.
A word is recognized as a number if it starts with either a
digit or decimal point or a negative sign, followed by a digit.
Decimal numbers are recognized if the decimal point character
is defined. Thousands grouping is recognized (and removed) when
the grouping character is defined and the grouping happens on
a 3 digit position (relative to the end of the number).
The maximum length of a word in an index depends on the FTS field
configuration. It defaults to 12 characters. Any additional
characters are truncated.
Usually, word separators are discarded. However, a "csep" character
is ignored at the beginning of a word but is considered a regular
character after a word has started.
By default, the percent sign is considered a "csep" separator.
So the word "15%" may be distinguished from the value "15".
Words separated by "multi" separators are added to the index multiple times.
By default, the hyphen is considered a "multi" separator.
For example, "TIC-TAC-TOE" could be located with TIC-TAC-TOE, TIC, TAC or TOE.
This multiple-indexing feature may be configured per field.
Parser Configuration
The parser characteristics may be customized (per field) with
the following settings:
-
"nsep" specifies characters that would otherwise considered
a separator to be treated like a letter.
-
"csep" specifies characters that would otherwise considered
a separator to be treated like a letter after a word has started.
-
"multi" specifies characters that function as a separator
but results in multiple keywords to be kept.
-
"dec" specifies the decimal point and grouping characters
(in this order). If the decimal point is not specified
then decimal numbers are not recognized. If the grouping
character is not specified then the thousands grouping
of numbers is not recognized.
Default Parser Configuration
Unless specified the default parser configuration applies to a
text field (parser config P0). The default parser config
defaults to the following settings:
nsep = ""
csep = "%"
multi = "-"
dec = ".,"
Additional parser configurations may be defined. Additional
parser configurations default to the following settings:
nsep = ""
csep = ""
multi = ""
dec = "."
FTS field options
Field specific configuration options define the parsing of the
FTS index system:
-
The "NP" (no parse) option indicates to not parse a field.
Instead the content is indexed literally. This option may only be
specified for text fields.
-
The "NT" (no translate) option indicates to not "translate"
a field. Unless set, any words extracted from this field are upshifted and
"translated" to ASCII equivalents.
-
The "NE" (no exclude) option indicates to not filter words
through an exclusion list (stopword list).
-
The "NM" (no multi) option indicates to not handle multiple
keywords seperately. Any multi-separator is treated as a separator.
-
The "SX" (soundex) option indicates that a soundex
(sound alike) lookup for any words obtained from this field should be supported.
When enabled an additional soundex representation of the keyword is maintained.
Note that a soundex code is not added for words that are recognized as numbers.
-
The "NR" (numeric range) option indicates that a numeric range
search should also be supported for a text field. When enabled, any numbers
in parsed text fields (subject to the exclusion list and the min. keyword length)
are also added to the numeric keyword index.
-
The EXCL=n or "En" (exclusion) option
specifies the exclusion list for this field. "n" represents the exclusion list id. If not
specified, the default exclusion list (E0) is used.
-
The PCFG=n or "Pn" (parser config) option
specifies the parser configuration when parsing the content of the field. "n" represents
the parser configuration id. If not specified, the default parser
configuration (P0) is used.
-
The MAX=# and MIN=n options may be used
to specify the max. and min. length of a word. The max. word length specifies the max.
number of characters included in a keyword. Any excessive characters are truncated.
The min. word length specifies the min. size of a word. Shorter words are exluded from
the index.
If not specified the max. word length defaults to 12. Any value up
to the internal limit of 32 characters may be used. The min.
keyword length defaults to 2.
FTS Stop-Word list
A stopword list (aka excluded word list or exclusion list) lists the
words that are excluded from indexing.
Words are used with a different frequency. Some are more common than
others. The best approach to the most common words (such as THE) is
to not index them at all by adding them to the exclude list.
This saves disk space and performance while not affecting the search
quality. If a word is common it is not a good differentiator to
choose records by.
A stopword list may be maintained in a file and copied to the FTS
database when the index is defined.
Exclusion list file format:
- A stopword file (aka exclude file) is a text file
- Lines are separated by LF or CR LF
- A leading hash (#) is considered a comment
- Leading and trailing spaces are ignored
- One word per line
- Words are not case sensitive
- Words may be quoted to retain spaces (double quotes)
For more examples, including languages other than English, the
following web site may be of interest
http://www.ranks.nl/resources/stopwords.html
|