Showing revision 22

Thesaurus dictionary

Thesaurus dictionary

Status: Commited to CVS HEAD

Funded by Georgia Public Library Service and LibLime, Inc.
Theasaurus - is a collection of words with included information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.

  • Synonyms - are words(phrases) describing the same concept.
  • Prefered term (PT) - is the term selected among synonyms to be the one used for indexing and retrieval purposes. Another name is authorized term.
  • non-prefered term (NPT) - is a synonym to a preferred term that has the equivalent meaning to the PT but that is not used for indexing. The thesaurus directs you to the PT.

Tsearch2's thesaurus dictionary is an extension of synonym dictionary to support phrases. Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves them for indexing. Preserving NPT allows to use relationships (BT, NT) at query time. Thesaurus used when indexing, so any changes in thesaurus require reindexing ( don't confuse with query rewriting, which used in query stage and rules could be changed online without reindexing ).

Configuration

Thesaurus

Thesaurus is a plain file of the following format:

input word(s) : indexed word(s)
...............................

Colon (:) symbol used as a delimiter.

Note: thesaurus dictionary looks for the most longest match !

tsearch2

tsearch2 comes with thesaurus template, which could be used to define new dictionary:

INSERT INTO pg_ts_dict
               (SELECT 'tz_simple', dict_init,
                        'DictFile="/path/to/tz_simple.txt",'
                        'Dictionary="en_stem"',
                       dict_lexize
                FROM pg_ts_dict
                WHERE dict_name = 'thesaurus_template');

Here:

  • tz_simple - is the dictionary name
  • DictFile="/path/to/tz_simple.txt" - is the location of thesaurus file
  • Dictionary="en_stem" defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that en_stem dictionary has it's own configuration (stop-words, for example). It's possible to define only one dictionary.

Now, it's possible to use tz_simple in pg_ts_cfgmap, for example:

update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and 
tok_alias in ('lhword', 'lword', 'lpart_hword');

Usage

tz_simple:

one : 1
two : 2
one two : 12

To see, how thesaurus works, one could use to_tsvector, to_tsquery or plainto_tsquery functions:

=# select plainto_tsquery('default_russian',' one day is oneday');
    plainto_tsquery
------------------------
 '1' & 'day' & 'oneday'

=# select plainto_tsquery('default_russian','one two day is oneday');
     plainto_tsquery
-------------------------
 '12' & 'day' & 'oneday'