.
contact contact

B.08.20 / Release Notes / Database / FTS / dbutil syntax

FTS dbutil syntax

 
.
  This document explains the dbutil syntax to configure FTS indexing.


UPGRADE DATABASE

Using FTS indexes requires additional database catalog tables to maintain the FTS structural information. Databases created with previous Eloquence versions must be upgraded to use FTS indexing.

All commands in this document are only valid inside a database context:

DATABASE "example";
The UPGRADE DATABASE command upgrades the database catalog for databases created on older Eloquence versions:
UPGRADE DATABASE;
Please note that upgrading the database catalog may affect compatibility with older Eloquence releases and there is no "downgrade database" command in dbutil.


DELETE ALL FTS

The DELETE ALL FTS command removes all FTS resources for a database.
DELETE ALL FTS;
This deletes any FTS fields, the FTS parser config and FTS exclusion lists.


Parser Configuration

The FTS parser operates on text fields and splits them into separate words. Its operation depends on the parser configuration and the FTS field options.

Eloquence supports configuring the parsing of text fields globally or per field. By default the parser config id P0 is used. Multiple parser configurations may be defined and individual FTS fields may be associated with a specific parser configuration.

Parser configurations are maintained in dbutil with the following syntax:

CREATE FTS PARSER CONFIG id
option = "string" [, ...];

CHANGE FTS PARSER CONFIG id
option = "string" [, ...];

DELETE FTS PARSER CONFIG id;
The CREATE FTS PARSER command creates a new parser config. The CHANGE FTS PARSER statement may be used to change an existing parser configuration. The DELETE FTS PARSER statement deletes a parser configuration.

Id is an integer number that identifies the parser configuration. It may be prepended with the letter "P". The parser configuration P0 (or 0) specifies the default parser configuration.

The following parser options may be configured:

NSEP - word separator exception list
"nsep" specifies characters that would otherwise be considered a separator to be treated like a letter.
By default, each character that is neither a letter nor a digit is considered a word separator. Characters listed in the NSEP option are no longer considered a word separator.

CSEP - conditional separator list
"csep" specifies characters that would otherwise considered a separator to be treated like a letter when appearing inside a word.
Usually, characters that are neither letters or numbers are word separators and discarded. However, a "csep" character is ignored at the beginning of a word but is considered a regular character after a word has started. By default, the percent sign is considered a "csep" separator. So the word "15%" may be distinguished from the value "15".

MULTI - multi separator list
"multi" specifies characters that function as a separator but result in multiple keywords to be indexed.
Words separated by "multi" separators are added to the index multiple times. By default, the hyphen is considered a "multi" separator. For example, "TIC-TAC-TOE" could be located with TIC-TAC-TOE, TIC, TAC or TOE.

DEC - decimal point and grouping
"dec" specifies the decimal point and grouping characters (in this order). If the decimal point is not specified ("dec" is an ampty string) then decimal numbers are not recognized. If the grouping character is not specified then the thousands grouping of numbers is not recognized.
The default parser configuration always exist. It may either be configured or a default parser configuration is used.

The default parser configuration P0 defaults to the following settings:

nsep  = ""
csep  = "%"
multi = "-"
dec   = ".,"
Parser configurations other than the default parser default to the following settings:
nsep  = ""
csep  = ""
multi = ""
dec   = "."
For example:
change fts parser config P0
nsep = "_", dec=".,";
This example changes the default parser configuration to no longer consider an underline character ("_") as a word separator and specifies the decimal point and grouping.

Note: Changing an FTS parser configuration will cause rebuilding any FTS indexes using this parser configuration.


FTS Stop-Word Lists

Eloquence supports configuring exclusion lists (aka stop-word lists). By default the exclusion list E0 is used. Multiple exclusion lists may be defined and FTS fields may be associated with a specific exclusion list.

Exclusion lists are maintained in dbutil with the following syntax:

CREATE FTS EXCLUSION LIST id
FROM "file name";

CREATE FTS EXCLUSION LIST id
ADD string [,...] [...];

CHANGE FTS EXCLUSION LIST id
FROM "file name";

CHANGE FTS EXCLUSION LIST id
{ADD|DELETE} string [,...] [...];

DELETE FTS EXCLUSION LIST id;
The CREATE FTS EXCLUSION LIST command creates a new exclusion list. The CHANGE FTS EXCLUSION LIST statement may be used to change an existing exclusion list. The DELETE FTS EXCLUSION LIST statement deletes an exclusion list.

Id is an integer number that identifies the exclusion list. It may be prepended with the letter "E". The exclusion list E0 (or 0) specifies the default exclusion list.

An exclude file may be referenced in the dbutil syntax and is then uploaded to the database catalog. As an alternative, individual words may be added to or deleted from an exclusion list.

An exclude file is a text file with the following format:

  • A stopword file (aka exclude file) is a text file
  • Lines are separated by LF or CR LF
  • A leading hash character (#) denotes a comment line
  • Leading and trailing spaces are ignored
  • One word per line
  • Words are not case sensitive
  • Words may be quoted to retain spaces (double quotes)
  • Character set must match the Eloquence platform default (HP-ROMAN8 on HP-UX or ISO-8859-1 on Windows or Linux) and is translated as needed when uploaded to the server.

For example:

create fts exclusion list E0
from "mandb.excludes";
This example defines a default exlusion list from the specified text file.
change fts exclusion list E1
add "more","words"
del "nolonger";
This example adds the words "more" and "words" to the exclusion list E1 and removes the word "nolonger".

Note: Changing an FTS exclusion list will currently not rebuild the affected FTS indexes. The database server will ignore any excluded words in subsequent searches.


FTS fields

FTS index fields are maintained in dbutil with the following syntax:
CREATE FTS INDEX [field_name] in set_name
ON item [:[start.]length] [, item ...]
[OPTION op [,...];

CHANGE FTS INDEX field_name in set_name
[NAME field_name]
[ON item [:[start.]length] [, item ...]]
[OPTION op [,...];

DELETE FTS INDEX field_name in set_name;
The CREATE FTS INDEX command creates a new FTS field. The CHANGE FTS INDEX statement may be used to change the definition of an existing FTS index field. The DELETE FTS INDEX statement deletes a FTS field.

set_name specifies the name of a data set. field_name may be used to specify an alias name for the FTS field. For simple FTS fields (using a single item), the item name is used to identify the FTS field if no alias name was specified. A field_name must be used for any composite FTS fields (fields concatenated from more than a single item).

The ON clause is used to specify one or more items to be indexed in this field. Optionally, a starting position in the item and a different item size may be specified.

The options clause specifies a comma separated list of field options. The following options are supported:

NP - no parse
The "NP" (no parse) option indicates to not parse a field. Instead the content is indexed literally. This option may only be specified for text fields.

NT - no translate
The "NT" (no translate) option indicates to not "translate" a field. Unless set, any words extracted from this field are upshifted and "translated" to ASCII equivalents.

NE - no exclude
The "NE" (no exclude) option indicates to not filter words through an exclusion list (stopword list).

NM - no multi
The "NM" (no multi) option indicates to not handle multiple keywords seperately. Any multi-separator is treated as a separator.

SX - soundex
The "SX" (soundex) option indicates that a soundex (sound alike) lookup for any words obtained from this field should be supported. When enabled an additional soundex representation of the keyword is maintained. Note that a soundex code is not added for words that are recognized as numbers.

NR - numeric range
The "NR" (numeric range) option indicates that a numeric range search should also be supported for a text field. This is the default for numeric fields. When enabled, any numbers in parsed text fields (subject to the exclusion list and the min. keyword length) are also added to the numeric keyword index. If not specified, numeric range searches in text fields will use a text comparision rather than comparing the numeric range. For example, 01:10 would not include the value 5.

GROUP=n - field group
The GROUP=n or "Gn" (field group) option indicates that the field is maintained as a group with any other fields that share the same group number. "n" represents the group number and may not be zero. This allows to combine keywords from multiple fields (that might be in different sets of the same set-group) in one query.

EXCL=n - exclusion list
The EXCL=n or "En" (exclusion) option specifies the exclusion list for this field. "n" represents the exclusion list id. If not specified, the default exclusion list (E0) is used.

PCFG=n - parser config
The PCFG=n or "Pn" (parser config) option specifies the parser configuration when parsing the content of the field. "n" represents the parser configuration id. If not specified, the default parser configuration (P0) is used.

MAX=n - max. keyword length
The MAX=n option may be used to specify the max. number of characters included in a keyword. Any excessive characters are truncated. The max. word length defaults to 12. Any value up to the internal limit of 32 characters may be used.

MIN=n - min. keyword length
The MIN=n option may be used to specify min. length of a word. Shorter words are excluded from the index. The min. keyword length defaults to 2 and must be smaller than the max. keyword length.

PATH=item - chain for linked detail set
The PATH=item option may be used to designate a chain in a detail data set. When specified, keywords reference the master set in addition to the detail (or instead of the detail set if the AGG option is used).

AGG - aggregated keyword references
The AGG option may be used with the PATH= option to specify keyword references are aggregated by a master set.
For example:
create fts index in CUSTOMERS on CITY option GROUP=1;
create fts index in CUSTOMERS on ZIP  option GROUP=1;
The example defines the FTS fields on CITY and ZIP, placing both in the field group 1.
create fts index in ORDER-DETAIL
  on LINE-COMMENT option GROUP=1,path=ORDER-NO,agg;
This example defines a new FTS field in the data set ORDER-DETAIL on the item LINE-COMMENT. The field group 1 is used and the FTS index is aggregated by the master set for the ORDER-NO item.


Backwards Compatibility

Configuring and using FTS indexing requires internal database catalog tables that did not exist in previous versions of Eloquence. For databases created on older Eloquence versions, it is necessary to upgrade the catalog tables with the "UPGRADE DATABASE" command in dbutil before configuring any FTS indexing is possible.

After upgrading the database catalog tables, such databases may no longer be fully compatible with older Eloquence versions. Using the database volume files with an older Eloquence version or transferring such a database to an older Eloquence version with dbstore/dbrestore results in the FTS indexing tables being "ignored" by the older server version (and the keyword indexing is corrupted if the database is changed). Furthermore, the old dbdumpcat and dbutil programs may not be able to work with those databases in all cases.

Please note that there is no "downgrade database" command in dbutil.


 
 
.
 
 
  Privacy | Webmaster | Terms of use | Impressum Revision:  2013-03-28  
  Copyright © 2011 Marxmeier Software AG