This document explains the dbutil syntax to configure FTS indexing.
UPGRADE DATABASE
Using FTS indexes requires additional database catalog tables to maintain
the FTS structural information. Databases created with previous Eloquence
versions must be upgraded to use FTS indexing.
All commands in this document are only valid inside a database context:
DATABASE "example";
The UPGRADE DATABASE command upgrades the database catalog for databases
created on older Eloquence versions:
UPGRADE DATABASE;
Please note that upgrading the database catalog may affect compatibility
with older Eloquence releases and there is no "downgrade database" command
in dbutil.
DELETE ALL FTS
The DELETE ALL FTS command removes all FTS resources for a database.
DELETE ALL FTS;
This deletes any FTS fields, the FTS parser config and FTS exclusion lists.
Parser Configuration
The FTS parser operates on text fields and splits them into separate words.
Its operation depends on the parser configuration and the FTS field
options.
Eloquence supports configuring the parsing of text fields globally or per
field. By default the parser config id P0 is used.
Multiple parser configurations may be defined and individual FTS fields may be
associated with a specific parser configuration.
Parser configurations are maintained in dbutil with the
following syntax:
CREATE FTS PARSER CONFIG id
option = "string" [, ...];
CHANGE FTS PARSER CONFIG id
option = "string" [, ...];
DELETE FTS PARSER CONFIG id;
The CREATE FTS PARSER command creates a new parser config.
The CHANGE FTS PARSER statement may be used to change an
existing parser configuration. The DELETE FTS PARSER
statement deletes a parser configuration.
Id is an integer number that identifies the parser configuration.
It may be prepended with the letter "P". The parser configuration
P0 (or 0) specifies the default parser configuration.
The following parser options may be configured:
- NSEP - word separator exception list
-
"nsep" specifies characters that would otherwise be considered a separator to be
treated like a letter.
By default, each character that is neither a letter nor a digit is considered
a word separator. Characters listed in the NSEP option
are no longer considered a word separator.
- CSEP - conditional separator list
-
"csep" specifies characters that would otherwise considered a separator to be
treated like a letter when appearing inside a word.
Usually, characters that are neither letters or numbers are word separators and
discarded. However, a "csep" character is ignored at the beginning of a word
but is considered a regular character after a word has started.
By default, the percent sign is considered a "csep" separator. So the word "15%"
may be distinguished from the value "15".
- MULTI - multi separator list
-
"multi" specifies characters that function as a separator but result in multiple
keywords to be indexed.
Words separated by "multi" separators are added to the index multiple times.
By default, the hyphen is considered a "multi" separator.
For example, "TIC-TAC-TOE" could be located with TIC-TAC-TOE, TIC, TAC or TOE.
- DEC - decimal point and grouping
-
"dec" specifies the decimal point and grouping characters (in this order).
If the decimal point is not specified ("dec" is an ampty string) then decimal
numbers are not recognized.
If the grouping character is not specified then the thousands grouping of
numbers is not recognized.
The default parser configuration always exist. It may either be configured or
a default parser configuration is used.
The default parser configuration P0 defaults to the following settings:
nsep = ""
csep = "%"
multi = "-"
dec = ".,"
Parser configurations other than the default parser default to the
following settings:
nsep = ""
csep = ""
multi = ""
dec = "."
For example:
change fts parser config P0
nsep = "_", dec=".,";
This example changes the default parser configuration to no longer
consider an underline character ("_") as a word separator and
specifies the decimal point and grouping.
Note:
Changing an FTS parser configuration will cause rebuilding any FTS
indexes using this parser configuration.
FTS Stop-Word Lists
Eloquence supports configuring exclusion lists (aka stop-word lists).
By default the exclusion list E0 is used.
Multiple exclusion lists may be defined and FTS fields may be
associated with a specific exclusion list.
Exclusion lists are maintained in dbutil with the
following syntax:
CREATE FTS EXCLUSION LIST id
FROM "file name";
CREATE FTS EXCLUSION LIST id
ADD string [,...] [...];
CHANGE FTS EXCLUSION LIST id
FROM "file name";
CHANGE FTS EXCLUSION LIST id
{ADD|DELETE} string [,...] [...];
DELETE FTS EXCLUSION LIST id;
The CREATE FTS EXCLUSION LIST command creates a new exclusion list.
The CHANGE FTS EXCLUSION LIST statement may be used to change an
existing exclusion list. The DELETE FTS EXCLUSION LIST statement
deletes an exclusion list.
Id is an integer number that identifies the exclusion list.
It may be prepended with the letter "E". The exclusion list E0 (or 0)
specifies the default exclusion list.
An exclude file may be referenced in the dbutil syntax and is then
uploaded to the database catalog. As an alternative, individual
words may be added to or deleted from an exclusion list.
An exclude file is a text file with the following format:
- A stopword file (aka exclude file) is a text file
- Lines are separated by LF or CR LF
- A leading hash character (#) denotes a comment line
- Leading and trailing spaces are ignored
- One word per line
- Words are not case sensitive
- Words may be quoted to retain spaces (double quotes)
- Character set must match the Eloquence platform default (HP-ROMAN8 on
HP-UX or ISO-8859-1 on Windows or Linux) and is translated as needed when
uploaded to the server.
For example:
create fts exclusion list E0
from "mandb.excludes";
This example defines a default exlusion list from the
specified text file.
change fts exclusion list E1
add "more","words"
del "nolonger";
This example adds the words "more" and "words" to the exclusion
list E1 and removes the word "nolonger".
Note:
Changing an FTS exclusion list will currently not rebuild the affected
FTS indexes. The database server will ignore any excluded words in subsequent
searches.
FTS fields
FTS index fields are maintained in dbutil with the following
syntax:
CREATE FTS INDEX [field_name] in set_name
ON item [:[start.]length] [, item ...]
[OPTION op [,...];
CHANGE FTS INDEX field_name in set_name
[NAME field_name]
[ON item [:[start.]length] [, item ...]]
[OPTION op [,...];
DELETE FTS INDEX field_name in set_name;
The CREATE FTS INDEX command creates a new FTS field.
The CHANGE FTS INDEX statement may be used to change the
definition of an existing FTS index field.
The DELETE FTS INDEX statement deletes a FTS field.
set_name specifies the name of a data set.
field_name may be used to specify an alias name for the
FTS field. For simple FTS fields (using a single item),
the item name is used to identify the FTS field if no
alias name was specified.
A field_name must be used for any composite FTS fields
(fields concatenated from more than a single item).
The ON clause is used to specify one or more items to
be indexed in this field. Optionally, a starting position
in the item and a different item size may be specified.
The options clause specifies a comma separated list of field
options. The following options are supported:
- NP - no parse
-
The "NP" (no parse) option indicates to not parse a field.
Instead the content is indexed literally. This option may only be
specified for text fields.
- NT - no translate
-
The "NT" (no translate) option indicates to not "translate"
a field. Unless set, any words extracted from this field are upshifted and
"translated" to ASCII equivalents.
- NE - no exclude
-
The "NE" (no exclude) option indicates to not filter words
through an exclusion list (stopword list).
- NM - no multi
-
The "NM" (no multi) option indicates to not handle multiple
keywords seperately. Any multi-separator is treated as a separator.
- SX - soundex
-
The "SX" (soundex) option indicates that a soundex
(sound alike) lookup for any words obtained from this field should be supported.
When enabled an additional soundex representation of the keyword is maintained.
Note that a soundex code is not added for words that are recognized as numbers.
- NR - numeric range
-
The "NR" (numeric range) option indicates that a numeric range
search should also be supported for a text field. This is the default for
numeric fields. When enabled, any numbers in parsed text fields (subject to
the exclusion list and the min. keyword length) are also added to the numeric
keyword index. If not specified, numeric range searches in text fields will
use a text comparision rather than comparing the numeric range.
For example, 01:10 would not include the value 5.
- GROUP=n - field group
-
The GROUP=n or "Gn" (field group)
option indicates that the field is maintained as a group with any other fields that
share the same group number. "n" represents the group number and may
not be zero. This allows to combine keywords from multiple fields
(that might be in different sets of the same set-group) in one
query.
- EXCL=n - exclusion list
-
The EXCL=n or "En" (exclusion) option
specifies the exclusion list for this field. "n" represents the exclusion list id. If not
specified, the default exclusion list (E0) is used.
- PCFG=n - parser config
-
The PCFG=n or "Pn" (parser config) option
specifies the parser configuration when parsing the content of the field. "n" represents
the parser configuration id. If not specified, the default parser
configuration (P0) is used.
- MAX=n - max. keyword length
-
The MAX=n option may be used to specify the max. number of
characters included in a keyword. Any excessive characters are truncated.
The max. word length defaults to 12. Any value up to the internal limit of 32 characters
may be used.
- MIN=n - min. keyword length
-
The MIN=n option may be used to specify min. length of a word.
Shorter words are excluded from the index.
The min. keyword length defaults to 2 and must be smaller than the max.
keyword length.
- PATH=item - chain for linked detail set
-
The PATH=item option may be used to designate a chain
in a detail data set. When specified, keywords reference the master set in addition
to the detail (or instead of the detail set if the AGG option is used).
- AGG - aggregated keyword references
-
The AGG option may be used with the PATH= option to specify
keyword references are aggregated by a master set.
For example:
create fts index in CUSTOMERS on CITY option GROUP=1;
create fts index in CUSTOMERS on ZIP option GROUP=1;
The example defines the FTS fields on CITY and ZIP, placing
both in the field group 1.
create fts index in ORDER-DETAIL
on LINE-COMMENT option GROUP=1,path=ORDER-NO,agg;
This example defines a new FTS field in the data set
ORDER-DETAIL on the item LINE-COMMENT. The field group 1 is used and
the FTS index is aggregated by the master set for the ORDER-NO item.
Backwards Compatibility
Configuring and using FTS indexing requires internal database catalog
tables that did not exist in previous versions of Eloquence.
For databases created on older Eloquence versions, it is
necessary to upgrade the catalog tables with the "UPGRADE DATABASE"
command in dbutil before configuring any FTS indexing is possible.
After upgrading the database catalog tables, such databases may no
longer be fully compatible with older Eloquence versions. Using the
database volume files with an older Eloquence version or transferring
such a database to an older Eloquence version with dbstore/dbrestore
results in the FTS indexing tables being "ignored" by the older server
version (and the keyword indexing is corrupted if the database is
changed).
Furthermore, the old dbdumpcat and dbutil programs may not be able to
work with those databases in all cases.
Please note that there is no "downgrade database" command in dbutil.
|