Tsearch2WhatsNew

Tsearch2 - full text search extension for PostgreSQL

This document describes the new features introduced in PostgreSQL 8.2 development (Apr, 2006). These features and backport release to PostgreSQL 8.1.X was sponsored by the University of Mannheim. Thesaurus dictionary was Funded by Georgia Public Library Service and LibLime, Inc. Gin support was sponsored by jfg:networks (http:www.jfg-networks.com/)

Online version of this document is available from http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch2WhatsNew

New feautures in 8.2 release

  • Tsearch2 now has full UTF-8 support (fully rewritten parser, ispell and snowball interfaces)
  • Query rewriting support
  • Extents based ranking function (rank_cd) now supports complex logical operators in queries and different weights of class lexemes.
  • Fixed a number of bugs in ispell code and improved compound words support
  • Indexing speed was greatly improved (up to 20 times, thanks Tom Lane)
  • Thesaurus support
  • Ispell dictionary now recognize MySpell format, used by OpenOffice. Download them from OO site. OO dictionaries doesn't support compound words. If you need it (for norwegian, german) it's recommended to use original ispell format (my2ispell script).
  • Tsearch2 now supports Gin index. You can create new index instead of GiST one.
=# create index fts_idx on titles using gin(fts_index);

UTF-8 support

UTF-8 support was tested for russian and greek languages. Please, check utf-8 greek test page for more information.

Important steps:

  • Don't forget to initdb cluster with correct utf-8 locale !
initdb -D /usr/local/pgsql-dev/data.el_utf8 --locale=el_GR.utf8
  • Convert ispell dict, affix file and stop words to utf-8
iconv -f iso-8859-7 -t utf-8 ......
  • Set correct client encoding in psql to test:
set client_encoding='ISO-8859-7';
  • Download stem_UTF_8_LANG.{c,h} from libstemmer and use Gendict to produce UTF-8 snowball stemmer

As always, don't forget configuration. Below is a setup for greek language we used for testing:

begin;
delete from pg_ts_cfg where ts_name = 'utf8';
insert into pg_ts_cfg values('utf8','default','el_GR.utf8');

-- dictionaries

DELETE FROM pg_ts_dict WHERE dict_name = 'el_ispell';
INSERT INTO pg_ts_dict
               (SELECT 'el_ispell', dict_init,
                       'DictFile="/tmp/greek.utf8/el_GR_utf8.dict",'
                       'AffFile="/tmp/greek.utf8/el.u8.aff",'
                       dict_lexize
                FROM pg_ts_dict
                WHERE dict_name = 'ispell_template');

-- tokens to index
delete from pg_ts_cfgmap where ts_name = 'utf8';
insert into pg_ts_cfgmap values('utf8','nlhword','{el_ispell}');
insert into pg_ts_cfgmap values('utf8','nlword','{el_ispell}');
insert into pg_ts_cfgmap values('utf8','nlpart_hword','{el_ispell}');
end;

Extents based ranking for tsearch2

  • Extent is a shortest and non-nested sequence of words, which satisfy a query.
  • weight of an extent is inversely proportional to the number of non-query words.
  • query is a sequence of words and boolean operations, for example, 'a & b'.

We don't consider normalization of document weight to 1, since we have no global information about document collection.

Current interface of rank_cd:

rank_cd('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery,normalization)

normalization is used to select normalization method(s) and is an OR-ed value of following methods:

  • 0 - no normalization (default)
  • 1 - log (document length)
  • 2 - document length
  • 4 - mean harmonic distance between extents
  • 8 - the number of unique words in document
  • 16 - log (the number of unique words in document)

Example: normalize document rank by logarithm of document length and take into account extents density.

rank_cd('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery,1|4)

Examples:

contrib_regression=# select rank_cd('{1, 1, 1, 1}','1:1,20 2:2 3:3 4:4','1&2'::tsquery,0);
 rank_cd 
---------
 1.05556
(1 row)
contrib_regression=# select rank_cd('{1, 1, 1, 1}','1:1,20 2:2 3:3 4:4','1&2'::tsquery,1);
 rank_cd  
----------
 0.589117
(1 row)

Query rewriting

Query rewriting is a set of functions and operators for tsquery type.

FUNCTIONS:

Query rewriting is flexible.

  • rewrite (tsquery, tsquery, tsquery) or
  • rewrite (ARRAY[tsquery,tsquery,tsquery]) or be table driven
  • rewrite (tsquery,'select tsquery,tsquery from test'::text)

OPERATORS:

Operators could be used to speedup query rewriting, for example, filtering non-candidate tuples.

  • tsquery @ tsquery - TRUE if right agrument might contained in left argument
  • tsquery ~ tsquery - TRUE if left agrument might contained in right argument

INDEX SUPPORT:

It's possible to create index to speedup operators @, ~.

create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);

EXAMPLES: See sql/tsearch2.sql for more examples.

contrib_regression=# \d test_tsquery 
   Table "public.test_tsquery"
   Column   |  Type   | Modifiers 
------------+---------+-----------
 txtkeyword | text    | 
 txtsample  | text    | 
 keyword    | tsquery | 
 sample     | tsquery | 
Indexes:
    "bt_tsq" UNIQUE, btree (keyword)
    "qq" gist (keyword)

Replace new york by their synonyms.

contrib_regression=# select rewrite('foo & bar & qq & new & york',  'new & york'::tsquery, 'big & apple | nyc | new & york & city');
                                     rewrite                                      
----------------------------------------------------------------------------------
 'qq' & 'foo' & 'bar' & ( 'city' & 'york' & 'new' | ( 'nyc' | 'apple' & 'big' ) )
(1 row)

Example with synonyms from table:

contrib_regression=# select rewrite('moscow & hotel', 'select keyword, sample from test_tsquery'::text );
              rewrite              
-----------------------------------
 ( 'moskva' | 'moscow' ) & 'hotel'

Filtering:

contrib_regression=# select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('default', 'moscow') as query where keyword ~ query;
       rewrite       
---------------------
 'moskva' | 'moscow'
(1 row)