Thesaurus dictionary

Thesaurus dictionary

Status: Commited to CVS HEAD

Funded by Georgia Public Library Service and LibLime, Inc.
Theasaurus - is a collection of words with included information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.

  • Synonyms - are words(phrases) describing the same concept.
  • Prefered term (PT) - is the term selected among synonyms to be the one used for indexing and retrieval purposes. Another name is authorized term.
  • non-prefered term (NPT) - is a synonym to a preferred term that has the equivalent meaning to the PT but that is not used for indexing. The thesaurus directs you to the PT.

Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves them for indexing. Preserving NPT allows to use relationships (BT, NT) at query time. Thesaurus used when indexing, so any changes in thesaurus require reindexing ( don't confuse with query rewriting, which used in a query stage and rules could be changed online without reindexing ).

Tsearch2's thesaurus dictionary (TZ) is an extension of synonym dictionary to support phrases. We were able to introduce a new dictionary type and preserve compatibility with old interface. Technically, TZ maintains it's state and interacts with parser. TZ uses subdictionary (should be defined in tsearch2 configuration) to normalize thesaurus text. It's possible to define only one dictionary. Notice, that subdictionary produces an error, if it couldn't recognize word. In that case, you should remove definition line with this word or teach subdictionary to know it. There are dictionaries, which always recognize any words ('simple', 'stemmers').

Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e., stored a position of word while exact stop-word is not important. To break possible ties thesaurus applies the last definition. For example, consider thesaurus (with simple subdictionary) rules with pattern 'swsw' ('s' designates stop-word and 'w' - known word):

a one the two : swsw
the one a two : swsw2

Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition 'swsw2'.

As a normal dictionary, it should be assigned to the specific lexeme types. Since TZ has a capability to recognize phrases it must remember its state and interact with parser. TZ use these assignments to check if it should handle next word or stop accumulation. Compiler of TZ should take care about proper configuration to avoid confusion. For example, if TZ is assigned to handle only lword lexeme, then TZ definition like ' one 1:11' will not works, since lexeme type digit doesn't assigned to the TZ.

Configuration

Thesaurus

Thesaurus is a plain file of the following format:

# this is a comment 
sample word(s) : indexed word(s)
...............................
  • Colon (:) symbol used as a delimiter.
  • Use asterisk (*) at the beginning of indexed word to skip subdictionary. It's still required, that sample words should be known.
  • thesaurus dictionary looks for the most longest match
tsearch2

tsearch2 comes with thesaurus template, which could be used to define new dictionary:

INSERT INTO pg_ts_dict
               (SELECT 'tz_simple', dict_init,
                        'DictFile="/path/to/tz_simple.txt",'
                        'Dictionary="en_stem"',
                       dict_lexize
                FROM pg_ts_dict
                WHERE dict_name = 'thesaurus_template');

Here:

  • tz_simple - is the dictionary name
  • DictFile="/path/to/tz_simple.txt" - is the location of thesaurus file
  • Dictionary="en_stem" defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that en_stem dictionary has it's own configuration (stop-words, for example).

Now, it's possible to use tz_simple in pg_ts_cfgmap, for example:

update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and 
tok_alias in ('lhword', 'lword', 'lpart_hword');

Examples

tz_simple:

one : 1
two : 2
one two : 12
the one : 1
one 1 : 11

To see, how thesaurus works, one could use to_tsvector, to_tsquery or plainto_tsquery functions:

=# select plainto_tsquery('default_russian',' one day is oneday');
    plainto_tsquery
------------------------
 '1' & 'day' & 'oneday'

=# select plainto_tsquery('default_russian','one two day is oneday');
     plainto_tsquery
-------------------------
 '12' & 'day' & 'oneday'

=# select plainto_tsquery('default_russian','the one');
NOTICE:  Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
 plainto_tsquery
-----------------
 '1'

If you add NPT to the PT in TZ, then resulted tsvector will contain NPT as well as PT, and it could be used for searching also NPT, not just PT, using different tsearch2 configuration.

supernovae stars:supernovae stars SN
crab nebulae: crab nebulae SN

Searching for 'supernovae stars' or 'crab nebulae' with TZ support will find the same set of documents, since both queries will be converted to 'SN'. At the same time, using configuration without TZ support, it's possible to use the same tsvector to search for 'supernovae stars'. This looks cumbersome, but one could use different tools for easy building of TZ from scratch.