2.8. Dictionaries

CREATE FUNCTION lexize([ oid, | dict_name text, lexeme text) RETURNS text[]]

Returns an array of lexemes if input lexeme is known to the dictionary dictname, or void array if a lexeme is known to the dictionary, but it is a stop-word, or NULL if it is unknown word.

=# select lexize('en_stem', 'stars');
 lexize
--------
 {star}
=# select lexize('en_stem', 'a');
 lexize
--------
 {}

Note: lexize function expects lexeme, not text ! Below is a didactical example:

apod=# select lexize('tz_astro','supernovae stars') is null;
 ?column?
----------
 t

Thesaurus dictionary tz_astro does know what is a supernovae stars, but lexize fails, since it does not parse input text and considers it as a single lexeme. Use plainto_tsquery, to_tsvector to test thesaurus dictionaries.

apod=# select plainto_tsquery('supernovae stars');
 plainto_tsquery
-----------------
 'sn'

There are several predefined dictionaries and templates. Templates used to create new dictionaries overriding default values of parameters. FTS Reference Part I contains description of SQL commands (CREATE FULLTEXT DICTIONARY, DROP FULLTEXT DICTIONARY, ALTER FULLTEXT DICTIONARY) for managing of dictionaries.

2.8.1. Simple dictionary

This dictionary returns lowercased input word or NULL if it is a stop-word. Example of how to specify location of file with stop-words.

=# CREATE FULLTEXT DICTIONARY public.my_simple 
	OPTION 'english.stop'
   LIKE pg_catalog.simple;

Relative paths in OPTION resolved respective to $PGROOT/share. Now we could test our dictionary:

=# select lexize('public.my_simple','YeS');
 lexize
--------
 {yes}
=# select lexize('public.my_simple','The');
 lexize
--------
 {}

2.8.2. Ispell dictionary

Ispell template dictionary for FTS allows creation of morphological dictionaries, based on Ispell, which has support for a large number of languages. This dictionary try to reduce an input word to its infinitive form. Also, more modern spelling dictionaries are supported - MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A big list of dictionaries is available on OpenOffice Wiki.

Ispell dictionary allow search without bothering about different linguistic forms of a word. For example, a search on bank would return hits to all declensions and conjugations of the search term bank - banking, banked, banks, banks' and bank's etc.

=# select lexize('en_ispell','banking');
 lexize
--------
 {bank}
=# select lexize('en_ispell','bank''s');
 lexize
--------
 {bank}
=# select lexize('en_ispell','banked');
 lexize
--------
 {bank}

To create ispell dictionary one should use built-in ispell_template dictionary and specify several parameters.

CREATE FULLTEXT DICTIONARY en_ispell 
OPTION 'DictFile="/usr/local/share/dicts/ispell/english.dict",
        AffFile="/usr/local/share/dicts/ispell/english.aff",
        StopFile="/usr/local/share/dicts/ispell/english.stop"'
LIKE ispell_template;

Here, DictFile, AffFile, StopFile are location of dictionary files and file with stop words.

Relative paths in OPTION resolved respective to $PGROOT/share/dicts_data.

CREATE FULLTEXT DICTIONARY en_ispell 
OPTION 'DictFile="ispell/english.dict",
        AffFile="ispell/english.aff",
        StopFile="english.stop"'
LIKE ispell_template;

Ispell dictionary usually recognizes a restricted set of words, so it should be used in conjunction with another "broader" dictionary, for example, stemming dictionary, which recognizes "everything".

Ispell dictionary has support for splitting compound words based on an ispell dictionary. This is a nice feature and FTS in PostgreSQL supports it. Notice, that affix file should specify special flag with the compoundwords controlled statement, which used in dictionary to mark words participated in compound formation.

compoundwords  controlled z

Several examples for Norwegian language:

=# select lexize('norwegian_ispell','overbuljongterningpakkmesterassistent');
 {over,buljong,terning,pakk,mester,assistent}
=# select lexize('norwegian_ispell','sjokoladefabrikk');
 {sjokoladefabrikk,sjokolade,fabrikk}

Note: MySpell doesn't supports compound words, Hunspell has sophisticated support of compound words. At present, FTS implements only basic compound word operations of Hunspell.

2.8.3. Snowball stemming dictionary

Snowball template dictionary is based on the project of Martin Porter, an inventor of popular Porter's stemming algorithm for English language, and now supported many languages (see Snowball site for more information). FTS contains a large number of stemmers for many languages. The only option, which accepts snowball stemmer is a location of a file with stop words. It can be defined using ALTER FULLTEXT DICTIONARY command.

ALTER FULLTEXT DICTIONARY en_stem 
OPTION '/usr/local/share/dicts/ispell/english-utf8.stop';

Relative paths in OPTION resolved respective to $PGROOT/share/dicts/data.

ALTER FULLTEXT DICTIONARY en_stem OPTION 'english.stop';

Snowball dictionary recognizes everything, so the best practice of usage is to place it at the end of the dictionary stack. It it uselessness to have it before any dictionary, because a lexeme will not pass through a stemmer.

2.8.4. Synonym dictionary

This dictionary template is used to create dictionaries which replaces one word by synonym word. Phrases are not supported, use thesaurus dictionary (Section 2.8.5) if you need them. Synonym dictionary can be used to overcome linguistic problems, for example, to avoid reducing of word 'Paris' by a english stemmer dictionary to 'pari'. In that case, it's enough to have Paris paris line in synonym dictionary and put it before en_stemm dictionary.

=# select * from ts_debug('english','Paris');
 Alias | Description | Token |      Dicts list      |       Lexized token
-------+-------------+-------+----------------------+----------------------------
 lword | Latin word  | Paris | {pg_catalog.en_stem} | pg_catalog.en_stem: {pari}
(1 row)
=# alter fulltext mapping on english for lword with synonym,en_stem;
ALTER FULLTEXT MAPPING
Time: 340.867 ms
postgres=# select * from ts_debug('english','Paris');
 Alias | Description | Token |               Dicts list                |        Lexized token
-------+-------------+-------+-----------------------------------------+-----------------------------
 lword | Latin word  | Paris | {pg_catalog.synonym,pg_catalog.en_stem} | pg_catalog.synonym: {paris}
(1 row)

2.8.5. Thesaurus dictionary

Thesaurus - is a collection of words with included information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.

Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing. Current realization of thesaurus dictionary (TZ) is an extension of synonym dictionary with phrase support. Thesaurus is a plain file of the following format:

# this is a comment 
sample word(s) : indexed word(s)
...............................

where colon (:) symbol is a delimiter.

TZ uses subdictionary (should be defined FTS configuration) to normalize thesaurus text. It's possible to define only one dictionary. Notice, that subdictionary produces an error, if it couldn't recognize word. In that case, you should remove definition line with this word or teach subdictionary to know it. Use asterisk (*) at the beginning of indexed word to skip subdictionary. It's still required, that sample words should be known.

Thesaurus dictionary looks for the most longest match.

Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e., important only their position. To break possible ties thesaurus applies the last definition. To illustrate this, consider thesaurus (with simple subdictionary) rules with pattern 'swsw, where 's' designates any stop-word and 'w' - any known word:

a one the two : swsw
the one a two : swsw2

Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition 'swsw2'.

As a normal dictionary, it should be assigned to the specific lexeme types. Since TZ has a capability to recognize phrases it must remember its state and interact with parser. TZ use these assignments to check if it should handle next word or stop accumulation. Compiler of TZ should take care about proper configuration to avoid confusion. For example, if TZ is assigned to handle only lword lexeme, then TZ definition like ' one 1:11' will not works, since lexeme type digit doesn't assigned to the TZ.

2.8.5.1. Thesaurus configuration

To define new thesaurus dictionary one can use thesaurus template, for example:

CREATE FULLTEXT DICTIONARY tz_simple
OPTION 'DictFile="dicts_data/thesaurus.txt.sample", Dictionary="en_stem"'
LIKE thesaurus_template;

Here:

  • tz_simple - is the thesaurus dictionary name

  • DictFile="/path/to/tz_simple.txt" - is the location of thesaurus file

  • Dictionary="en_stem" defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that en_stem dictionary has it's own configuration (stop-words, for example).

Now, it's possible to bind thesaurus dictionary tz_simple and selected tokens, for example:

ALTER FULLTEXT MAPPING ON russian_utf8 FOR lword,lhword,lpart_hword WITH tz_simple;

2.8.5.2. Thesaurus examples

Let's consider simple astronomical thesaurus tz_astro, which contains some astronomical word-combinations:

supernovae stars : sn
crab nebulae : crab

Below, we create dictionary and bind some types of tokens with astronomical thesaurus and english stemmmer.

=# CREATE FULLTEXT DICTIONARY tz_astro OPTION 
    'DictFile="dicts_data/tz_astro.txt", Dictionary="en_stem"' 
   LIKE thesaurus_template;
=# ALTER FULLTEXT MAPPING ON russian_utf8 FOR lword,lhword,lpart_hword 
   WITH tz_astro,en_stem;

Now, we could see how it works. Notice, that lexize couldn't use for testing (see description of lexize) thesaurus, so we could use plainto_tsquery and to_tsvector functions, which accept text argument, not a lexeme.

=# select plainto_tsquery('supernova star');
 plainto_tsquery
-----------------
 'sn'
=# select to_tsvector('supernova star');
 to_tsvector
-------------
 'sn':1

In principle, one can use to_tsquery if quote argument.

=# select to_tsquery('''supernova star''');
 to_tsquery
------------
 'sn'

Notice, that supernova star matches supernovae stars in tz_astro, because we specified en_stem stemmer in thesaurus definition.

To keep an original phrase in full-text index just add it to the right part of definition:

supernovae stars : sn supernovae stars
--------------------------------------
=# select plainto_tsquery('supernova star');
       plainto_tsquery
-----------------------------
 'sn' & 'supernova' & 'star'