Gentle introduction to the Full Text Search with Tsearch2

  • Modern information systems are all database driven.
  • DB stores objects metadata (structured data and plain text)
  • There is a need in IR (Information Retrieval) style full text search inside database with full conformance to DB principles (ACID).
Talk plan
  • Part I: Tsearch2 fundamentals
  • Part II: Tsearch2 basics
  • Part III: Getting results
  • Part IV: Index support
  • Acknowledgements
  • ToDo

Part I: Tsearch2 fundamentals

1. What is FTS ?

Find documents, which satisfy query and optionally return them in some order. Most common case: Find documents containing all query terms and return them in order of their similarity to the query.

1.1. What is a document ?

Document can be:

  • any text attribute
  • combination of text attributes from one or many tables.

Document must be identified by some unique key.

~,~*, LIKE, ILIKE for FTS

Text search operators exist for years.

  • No linguistics (even in english, regular expressions are not enough {satisfy,satisfies})
  • No ranking
  • Tends to be slow (no index support)
Improve FTS

The idea is rather simple - preprocess document at index time to save time at search stage.


  • document parsing - (token, token type)
  • lingustic - normalize lexeme (depending on token type)
  • storage (sorted list of lexemes with positional information)
Tsearch2 comes

Tsearch2 - is the full text engine for PostgreSQL, fully integrated into database. Main features (new in 8.2 are bolded):

  • It is mature (5 years of development)
  • Supports multiple table driven configurations
  • flexible and rich linguistic support (dictionaries, stop words), thesaurus
  • fully UTF-8 compatible
  • Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd)
  • Index support (GiST and Gin) with concurrency and recovery support
  • Rich query language with query rewriting support

Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types, which represent document and query.

Tsvector, tsquery

We introduced two new data types and operator for FTS

tsvector @@ tsquery 
tsquery  @@ tsvector

@@ operator returns TRUE if tsvector contains tsquery.

=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
  • data type, which represents document, optimized for FTS.
  • Since it's a sorted list of lexemes - search is faster than standard ~,LIKE operators.
=# select 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'

Notice, that space is also lexeme !

=# select 'space ''    '' is a lexeme'::tsvector;
 'a' 'is' '    ' 'space' 'lexeme'
  • Lexeme, optionally, could have positional information, which used for proximity ranking.
=# select 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
  • Each position of a lexeme could have label ('A','B','C','D'), where 'D' is default.
    • Labels used for ranking, actual weights assigned at search time.
    • Labels could indicate different importance of lexeme, structure of document, etc.
    • Labels could be used for restricting search region
  • Concatenation operator: tsvector || tsvector, used to "build" document from several parts. Order is important if tsvector contains positional information.
=# select 'fat:1 cat:2'::tsvector || 'fat:1 rat:2'::tsvector;
 'cat':2 'fat':1,3 'rat':4
=# select 'fat:1 rat:2'::tsvector || 'fat:1 cat:2'::tsvector;
 'cat':4 'fat':1,3 'rat':2
(1 row)
  • data type for textual queries with support of boolean operators - & (AND), | (OR), parenthesis. Tsquery consists of lexemes (optionally labelled by letter[s]) with boolean operators between.
=# select 'fat & cat'::tsquery;
 'fat' & 'cat'
=# select 'fat:ab & cat'::tsquery;
 'fat':AB & 'cat'

tsqueries could be concatenated

tsquery && (ANDed) tsquery

test=# select 'a&b'::tsquery && 'c|d'::tsquery;
 'a' & 'b' & ( 'c' | 'd' )

tsquery || (ORed) tsquery

test=# select 'a&b'::tsquery || 'c|d'::tsquery;
 'a' & 'b' | ( 'c' | 'd' )
  • Length of lexeme < 2K
  • Length of tsvector (lexemes + positions) < 1Mb
  • The number of lexemes < 4^32
  • 0< Positional information < 16383
  • No more than 256 positions per lexeme
  • The number of nodes ( lexemes + operations) in tsquery < 32768

PostgreSQL 8.1 documentation:

  • total 335420 words
  • 10441 unique words
  • 'postgresql' mentioned 6127 times in 655 documents

PostgreSQL mailing list archive:

  • total 57,491,343 lexemes in 461020 msgs
  • 910989 uniquelexemes

Part II: Tsearch2 basic

Full text indexing - obtaining tsvector

Document - to_tsvector → tsvector

=# select to_tsvector('a fat  cat sat on a mat - it ate a fat rats');
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
  • missed 'a','on','it' - common (stop) words
  • 'rats' → 'rat'
  • punctuation ('-') ignored
to_tsvector magic
  • Document parsed using parser to (token, token_type)
  • Each token depending on their token_type processed by stack of dictionaries. Lexeme indexed if it recognized by dictionaries and it's not a stop-word.
tsvector - functions
  • setweight(tsvector, letter), letter could be on of 'A','B','C','D' (default) - label all lexemes in tsvector, which later (at query time) could be used for structural ranking.
=# select setweight( to_tsvector('story title'),'A') ||
  setweight( to_tsvector('story abstract'),'B') || 
  setweight( to_tsvector('story body'),'D');
 'bodi':6 'titl':2A 'stori':1A,3B,5 'abstract':4B
  • strip(tsvector) - remove positional information from tsvector.
=# select strip( '''bodi'':6 ''titl'':2A ''stori'':1A,3B,5 ''abstract'':4B');
 'bodi' 'titl' 'stori' 'abstract'
  • tsearch2 trigger - automatically update tsvector
    • tsearch2(tsvector,text)
    • tsearch2(tsvector, my_process, text] - preprocess text using function my_process
CREATE FUNCTION dropatsymbol(text) RETURNS text 
AS 'select replace($1, ''@'', '' '');'
tsearch2(tsvector,dropatsymbol, strMessage);
Four tables control FTS

We want to control document → tsvector convertation and be able to do that in various ways. That means, we should be able to specify how to parse document, what lexemes to index and how to process them.

Tsearch2 provides flexible table driven configuration support.

  • pg_ts_cfg - configurations
  • pg_ts_dict - dictionaries
  • pg_ts_parser - document parsers
  • pg_ts_cfgmap - map configurations, dictionaries and tokens
Tsearch2 configurations - pg_ts_cfg

Each configuration has unique name (ts_name), parser (prs_name) and locale name (server locale, see show all), which used to identify default configuration.

=# select * from pg_ts_cfg;
     ts_name     | prs_name |    locale
 default         | default  | C
 default_russian | default  | ru_RU.KOI8-R
 utf8_russian    | default  | ru_RU.UTF-8
 simple          | default  |
Tsearch2 configurations - functions
  • show_curcfg() - returns oid of current confguration
select show_curcfg();

More friendly:

=# select * from pg_ts_cfg where oid=show_curcfg();
     ts_name     | prs_name |    locale
 default_russian | default  | ru_RU.KOI8-R
  • set_curcfg(ts_name) - specify current configuration
  • reset_tsearch() - clear tsearch2 cache

Dictionary is a program, which accepts lexeme(s) on input and returns:

  • array of lexeme(s) if input lexeme is known to the dictionary
  • void array - dictionary knows lexeme, but it's stop word.
  • NULL - dictionary doesn't recognized input lexeme

WARNING: Data files, used by dictionaries, should be in server_encoding to avoid possible problems !

Usually, dictionaries used for normalization of words ('rats'->'rat')

  • compression
  • not bother which word form use in query
Dictionaries - pg_ts_dict

pg_ts_dict is a dictionaries registry. Tsearch2 provides templates for several dictionaries to simplify registering of new dictionaries

=# \d pg_ts_dict
         Table "public.pg_ts_dict"
     Column      |     Type     | Modifiers
 dict_name       | text         | not null
 dict_init       | regprocedure |
 dict_initoption | text         |
 dict_lexize     | regprocedure | not null
 dict_comment    | text         |
    "pg_ts_dict_pkey" PRIMARY KEY, btree (dict_name)
Dictionaries templates


  • simple - returns lowercased lexeme ( recognize everything )
  • ispell - returns normalized lexeme(s) - morphology, compound words support (German, Norwegian, etc)
  • snowball stemmer - returns lexeme stem ( recognize everything )
  • synonym - simple lexeme-to-lexeme replacement
  • thesaurus - phrase-to-phrase replacement
Using dictionaries template

Define new dictionary 'en_ispell', which uses ispell template and has their data in '/usr/local/share/dicts/ispell/' directory:

INSERT INTO pg_ts_dict
               (SELECT 'en_ispell', dict_init,
                FROM pg_ts_dict
                WHERE dict_name = 'ispell_template');
Dictionary - stop words

Ispell and snowball stemmers treat stop words differently:

  • ispell - normalize word and then lookup normalized form in stop-word file
  • snowball stemmer - lookup word in stop-word file and then does it job. The reason - to minimize possible 'noise'.
Dictionary - functions
  • lexize (dict_name, lexeme)
=# select lexize('en_stem','rats');
Where to get dictionaries ?

Dictionaries (synonym, thesaurus, ispell, snowball stemmer), which have predefined templates need only data.

  • Get ispell data from the net. Tsearch2 transparently supports original ispell files and myspell, which comes with OpenOffice.
  • snowball stemmers available from snowball.tartarus.org

Also, you may develop dictionary from scratch. Gendict utility, which comes with tsearch2, generates template file for dictionary program. Gendict has built-in support for snowball stemmers.

Parser - pg_ts_parser
This table specifies parser which used to parse document to lexemes. Parser outputs a lexeme and it's type.
       Table "public.pg_ts_parser"
    Column     |     Type     | Modifiers
 prs_name      | text         | not null
 prs_start     | regprocedure | not null
 prs_nexttoken | regprocedure | not null
 prs_end       | regprocedure | not null
 prs_headline  | regprocedure | not null
 prs_lextype   | regprocedure | not null
 prs_comment   | text         |
    "pg_ts_parser_pkey" PRIMARY KEY, btree (prs_name)

Tsearch2 comes with default parser which recognized 23 following tokens:

=# select * from token_type();
 tokid |    alias     |               descr
     1 | lword        | Latin word
     2 | nlword       | Non-latin word
     3 | word         | Word
     4 | email        | Email
     5 | url          | URL
     6 | host         | Host
     7 | sfloat       | Scientific notation
     8 | version      | VERSION
     9 | part_hword   | Part of hyphenated word
    10 | nlpart_hword | Non-latin part of hyphenated word
    11 | lpart_hword  | Latin part of hyphenated word
    12 | blank        | Space symbols
    13 | tag          | HTML Tag
    14 | protocol     | Protocol head
    15 | hword        | Hyphenated word
    16 | lhword       | Latin hyphenated word
    17 | nlhword      | Non-latin hyphenated word
    18 | uri          | URI
    19 | file         | File or path name
    20 | float        | Decimal notation
    21 | int          | Signed integer
    22 | uint         | Unsigned integer
    23 | entity       | HTML Entity
(23 rows)

Notice, space symbol is also lexem.

parser - functions
  • parse([parser_name], text)
=# select * from parse('<b>Fat</b> cat');
 tokid | token
    13 | <b>
     1 | Fat
    13 | </b>
    12 |
     1 | cat
(5 rows)

Default parser could be defined using function set_curprs(parser_name).


For each lexeme class (token_type) there is a configurable dictionary queue (in pg_ts_cfgmap). Lexeme passes through this queue until it recognized by some dictionary (currently, there is no possibility to recognize lexeme and pass it to the next dictionary). It's tenable to begin from very specific dictionary (topic related, synonym), and finish queue with most common dictionary like 'simple' or 'stemmer', which recognize everything :)

An excerption from pg_ts_cfgmap:

    ts_name     |  tok_alias   |   dict_name
default_russian | lword        | {en_ispell,en_stem}
default_russian | nlword       | {ru_ispell,ru_stem_koi8}

If lexeme type doesn't specified in pg_ts_cfgmap or dict_name is NULL, them lexeme will not indexed. For example, we don't want to index uri' in default_russian configuration: we may delete corresponding row, but better leave it and set dict_name as NULL:

update pg_ts_cfgmap set dict_name=NULL where 
ts_name='default_russian' and tok_alias='uri';
Obtaining tsquery

Tsquery obtained at search time, using the same configuration table as tsvector.

  • to_tsquery
=# select to_tsquery('fat & rats');
 'fat' & 'rat'
  • plainto_tsquery
=# select plainto_tsquery('fat rats');
 'fat' & 'rat'
Query - restricted search
  • It's possible to use labels, stored in tsvector, to limit search region.
  • Most often, labels used to mark different parts of document.
  • Flexibility - Several searches using one tsvector
    • 'supernovae & stars'::tsquery - search everewhere
    • 'supernovae:a & stars'::tsquery - search only titles
    • 'supernovae:ab & stars'::tsquery - search titles and abstracts
Query rewriting

Query rewriting is a set of functions and operators for tsquery type. It allows to control search at query time without reindexing (opposite to thesaurus).

  • expand search using synonyms (new york → big apple, nyc, gotham)
  • help search popular topic: submarine "Kursk" went down August 12, 2000 year, people were searching for 'kursk'. To get right results, instead of 'city Kursk', we directed them to 'submarine kursk'.
Query rewriting - functions

Query rewriting is flexible.

  • rewrite (tsquery, tsquery, tsquery)
  • rewrite (ARRAY[tsquery,tsquery,tsquery])
  • rewrite (tsquery,'select tsquery,tsquery from test'::text) - table driven query rewriting
Query rewriting - examples

Replace new york by their synonyms.

=# select rewrite('foo & bar & qq & new & york',  'new & york'::tsquery, 
'big & apple | nyc | new & york & city');
 'qq' & 'foo' & 'bar' & ( 'city' & 'york' & 'new' | ( 'nyc' | 'apple' & 'big' ) )
(1 row)

Example with synonyms from table:

=# select rewrite('moscow & hotel', 'select keyword, sample from test_tsquery'::text );
 ( 'moskva' | 'moscow' ) & 'hotel'
Query rewriting - operators

Operators could be used to speedup query rewriting, for example, filtering non-candidate tuples when search big lookup table.

  • tsquery @ tsquery - TRUE if right agrument might contained in left argument
  • tsquery ~ tsquery - TRUE if left agrument might contained in right argument
contrib_regression=# select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('default', 'moscow') as query where keyword ~ query;
 'moskva' | 'moscow'
(1 row)
Query rewriting - index support

It's possible to create index to speedup operators @, ~.

create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
Combining everything together

Using standard SQL commands it's possible (in transaction) to create your own configurations. Configuration name (ts_name) could be specified explicitly in functions to_tsvector, to_tsquery, plainto_tsquery, which provides a big flexibility for developing advanced search applications.

Default configuration (specified by server's locale) knows stop-words:

=# select to_tsvector('a fat cat');
 'cat':3 'fat':2

But simple configurations accepts all words without bothering about stop-words:

=# select to_tsvector('simple','a fat cat');
 'a':1 'cat':3 'fat':2

PART III: Getting results

To materialize search results you need to return documents in some order and show them to help user to decide what document he really wants. Ordering is an application specific function and, in principle, it doesn't belongs to the tsearch2 core.


Ranking attempts to measure how documents are relevant to particular query. Concept of relevancy is vague, since it's very application dependent. Tsearch2 provides two ranking functions rank,rank_cd, which use different algorithms, but both attempt to take into account lexical (frequency), proximity and structural information. Optionally, ranking functions could normalize rank, that specifies whether a document's length should impact its rank.

Ranking - functions
  • rank, rank_cd ('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery, normalization)

Here, '{0.1, 0.2, 0.4, 1.0}' - is an array of weights for 'D','C','B','A' zones, which stores in tsvector.

  • get_covers(tsvector, tsquery) - returns extents, which are a shortest and non-nested sequences of words, which satisfy a query. Extents (covers) used in rank_cd algorithm for fast calculation of proximity ranking.
=# select get_covers (to_tsvector('a fat  cat sat on a mat - it ate a fat rats'), to_tsquery('fat & cat'));
 {1 fat {2 cat }1 sat mat ate fat }2 rat
Ranking - normalization

Notice, ranking functions use only local information (for given document), since it's very expensive to get global statistics. That means, we can't do real normalization of rank, for example, 0 < rank < 1, but you can always cheat a bit and use transformation like rank/(rank+1).

Ranking - examples
=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) as rnk 
from apod, to_tsquery('neutrino|(dark & matter)') query 
where query @@ fts order by rnk desc limit 10;
                     title                     |   rnk
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

Now, normalize rank:

=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query)/
(rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) + 1) as rnk from
 apod, to_tsquery('neutrino|(dark & matter)') query where 
query @@ fts order by rnk desc limit 10;
                     title                     |        rnk
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

Headline is a fragment of document with query terms. They are useful for visual representation of search results. Usually, query terms are bolded.

  • headline([ts_name], document, tsquery, options)
=# select headline('a fat cat sat on a mat and ate a fat rat','fat & cat'::tsquery);
 a <b>fat</b> <b>cat</b> sat on a mat and ate a <b>fat</b> rat
=# select headline('a fat cat sat on a mat and ate a fat rat',
'fat & cat'::tsquery, 'StartSel='', StopSel=''');
 a 'fat' 'cat' sat on a mat and ate a 'fat' rat

There are several options, which define how to bold query terms, min,max length of headline and etc.

Headline - subselect
  • Naive (bad) query
select id,headline(body,q),rank(ti,q) as rank
from apod, to_tsquery('stars') q 
where ti @@ q order by rank desc limit 10;
  • The same, but with subselect:
select id,headline(body,q),rank
from ( select id,body,q, rank(ti,q) as rank from apod, to_tsquery('stars') q
where ti @@ q order by rank desc limit 10) as foo;

2nd query could be order of magnitude faster than 1st one ! This is because of slow headline() function, which processes a document (and so needs to read document from disk), not just an index, so 1st query has to produce as many headlines as the number of search results, while 2nd query needs to read only 10 documents. That makes the difference !

  • ts_debug
=# select * from ts_debug('a fat rats');
     ts_name     | tok_type | description | token | dict_name | tsvector
 default_russian | lword    | Latin word  | a     | {en_stem} |
 default_russian | lword    | Latin word  | fat   | {en_stem} | 'fat'
 default_russian | lword    | Latin word  | rats  | {en_stem} | 'rat'
(3 rows)
  • stat It's usefull to see words statistics, for example, to check how good your dictionaries work or how did you configure pg_ts_cfgmap. Also, you may notice probable stop words relevant for your collection. Tsearch provides stat() function, for example, for PostgreSQL 8.1 documentation we have top 10 most frequent words:
=# select  *  from stat('select fts_index from a_pages') 
order by ndoc desc, nentry desc,word limit 10;
    word    | ndoc | nentry
 postgresql |  655 |   6127
 document   |  655 |   3486
 &nbsp;     |  655 |   2876
 manual     |  655 |   2046
 8.1        |  655 |   2034
 rarr       |  655 |   1965
 text       |  655 |   1424
 home       |  655 |   1315
 copi       |  655 |   1138
 group      |  655 |   1076

Part IV: Indexes

Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! Indices support concurrency and recovery.

  • RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree)
=# create index fts_idx on apod using gist(fts);
  • Gin - Generalized Inverted Index
=# create index fts_idx on apod using gin(fts);
GiST index
  • Document represented as a bit string with '1' in positions to which words are hashed.
  • These bit strings are stored in RD-tree, invented by Hellerstein, where parent is 'OR'-ed bit-strings of all children.
  • This index is lossy (because of hashing), so we need to check results to eliminate false drops. Checking for false drops could be very expensive, because we should read tuple from disk.
  • good for online indexing
  • support multicolumn indices
  • not well scaled with the number of distinct words and the number of documents
Gin - Generalized inverted index

Generalized means that the index does not know which operation it accelerates. It works with custom strategies, defined for specific data types. Gin is similar to the GiST and differs from btree indices, which have predefined, comparison-based operations.

An inverted index is an index structure storing a set of (key, posting list) pairs, where 'posting list' is a set of document id in which the key occurs.

  • weak dependence of the size of vocabulary, good scalability
  • fast bulk indexing
  • slow update
  • used maintenance_work_mem
Indexes - practice

Use table inheritance with CE to have fast online FTS, based on GiST index and archive tables (mostly read-only) with Gin index.


  • ABC Startsiden - compound words support
  • PostGIS community - GiST Concurrency and Recovery
  • JFG Networks - Gin
  • University of Mannheim - UTF-8
  • Georgia Public Library Service and LibLime, Inc. - Thesaurus support
  • Russian Foundation for Basic Research


  • phrase search
  • exact search
  • wildcard search
  • better tsquery parsing error message
  • configurable number of zones in tsvector
  • built-in support pg_trgm ( It can be used with GIN )
  • configurable length of signature in GiST index
  • control of lexemes weights by parser and dictionaries (html-parser)
  • built-in support of more languages
  • Better interaction of GiST with optimizer and planner, developing cost functions for several popular extensions (intarray, ltree, tsearch2,,,)
  • Extend GiST interface to support SP-GIST
  • Increase the number of strategies. Currently – only one (full match) entries B-tree: <,<=,>,>=,prefix
  • Extend Gin to support new data types.
    • replace entries B-tree by GiST similar tree (requres support of unique values in GiST).
    • This further increase the number of possible strategies.
  • Optimize insert operations (background index insertion)
  • Extend pgsql's interface to use GIN for ranking:
    • store position information in GIN
    • propagate ranking from index to final sort of tuples
  • Better interaction of GIN with optimizer and planner, developing cost functions (tsearch2, built-in support for array,,)