Full-Text Search in PostgreSQL: A Gentle Introduction
Prev	Fast Backward	Chapter 1. FTS Introduction	Fast Forward	Next

1.3. Basic operations

To implement full-text search engine we need some functions to obtain tsvector from a document and tsquery from user's query. Also, we need to return results in some order, i.e., we need a function which compare documents in respect to their relevance to the tsquery. FTS in PostgreSQL provides support of all of these functions, introduced in this section.

1.3.1. Obtaining tsvector

FTS in PostgreSQL provides function to_tsvector, which transforms document to tsvector data type. More details is available in Section 2.2, but for now we consider a simple example.

=# select to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                     to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

In the example above we see, that resulted tsvector does not contains a,on,it, word rats became rat and punctuation sign - was ignored.

to_tsvector function internally calls parser function which breaks document (a fat cat sat on a mat - it ate a fat rats) on words and corresponding type. Default parser recognizes 23 types, see Section 2.4 for details. Each word, depending on its type, comes through a stack of dictionaries (Section 1.3.5). At the end of this step we obtain what we call a lexeme. For example, rats became rat, because one of the dictionaries recognized that word rats is a plural form of rat. Some words are treated as a "stop-word" (Section 1.3.6) and ignored, since they are too frequent and have no informational value. In our example these are a,on,it. Punctuation sign - was also ignored, because it's type (Space symbols) was forbidden for indexing. All information about the parser, dictionaries and what types of lexemes to index contains in the full-text configuration (Section 2.9). It's possible to have many configurations and actually, many predefined system configurations are available for different languages. In our example we used default configuration english for english language.

To make things clear, below is an output from ts_debug function ( Section 2.10 ), which show all details of FTS machinery.

=# select * from ts_debug('english','a fat  cat sat on a mat - it ate a fat rats');
 Alias |  Description  | Token |      Dicts list      |       Lexized token
-------+---------------+-------+----------------------+---------------------------
 lword | Latin word    | a     | {pg_catalog.en_stem} | pg_catalog.en_stem: {}
 blank | Space symbols |       |                      |
 lword | Latin word    | fat   | {pg_catalog.en_stem} | pg_catalog.en_stem: {fat}
 blank | Space symbols |       |                      |
 lword | Latin word    | cat   | {pg_catalog.en_stem} | pg_catalog.en_stem: {cat}
 blank | Space symbols |       |                      |
 lword | Latin word    | sat   | {pg_catalog.en_stem} | pg_catalog.en_stem: {sat}
 blank | Space symbols |       |                      |
 lword | Latin word    | on    | {pg_catalog.en_stem} | pg_catalog.en_stem: {}
 blank | Space symbols |       |                      |
 lword | Latin word    | a     | {pg_catalog.en_stem} | pg_catalog.en_stem: {}
 blank | Space symbols |       |                      |
 lword | Latin word    | mat   | {pg_catalog.en_stem} | pg_catalog.en_stem: {mat}
 blank | Space symbols |       |                      |
 blank | Space symbols | -     |                      |
 lword | Latin word    | it    | {pg_catalog.en_stem} | pg_catalog.en_stem: {}
 blank | Space symbols |       |                      |
 lword | Latin word    | ate   | {pg_catalog.en_stem} | pg_catalog.en_stem: {ate}
 blank | Space symbols |       |                      |
 lword | Latin word    | a     | {pg_catalog.en_stem} | pg_catalog.en_stem: {}
 blank | Space symbols |       |                      |
 lword | Latin word    | fat   | {pg_catalog.en_stem} | pg_catalog.en_stem: {fat}
 blank | Space symbols |       |                      |
 lword | Latin word    | rats  | {pg_catalog.en_stem} | pg_catalog.en_stem: {rat}
(24 rows)

Function setweight() is used to label tsvector. The typical usage of this is to mark out the different parts of document (say, importance). Later, this can be used for ranking of search results in addition to the positional information (distance between query terms). If no ranking is required, positional information can be removed from tsvector using strip() function to save some space.

Since to_tsvector(NULL) produces NULL, it is recomended to use coalesce to avoid unexpected results. Here is the safe method of obtaining tsvector of structured document.

test=# update tt set ti=\
test=# setweight( to_tsvector(coalesce(title,'')), 'A' )    || ' ' ||\
test=# setweight( to_tsvector(coalesce(keyword,'')), 'B' )  || ' ' ||\
test=# setweight( to_tsvector(coalesce(abstract,'')), 'C' ) || ' ' ||\
test=# setweight( to_tsvector(coalesce(body,'')), 'D' );

1.3.2. Obtaining tsquery

FTS provides two functions for obtaining tsquery - to_tsquery and plainto_tsquery ( Section 2.3.2 ).

=# select to_tsquery('english', 'fat & rats');
  to_tsquery
---------------
 'fat' & 'rat'
=# select plainto_tsquery('english', 'fat rats');
 plainto_tsquery
-----------------
 'fat' & 'rat'

Tsquery data type obtained at search time and the same way as tsvector ( Section 1.3.1 ).

There is a powerful technique to rewrite query online, called Query Rewriting ( Section 2.3.1 ). It allows to manage searches on the assumption of application semantics. Typical usage is a synonym extension or changing query to direct search in the necessary direction. The nice feature of Query Rewriting is that it doesn't require reindexing in contrast of using thesaurus dictionary (Section 2.8.5). Also, Query Rewriting is table-driven, so it can be configured online.

1.3.3. Ranking search results

Ranking of search results is de-facto standard feature of all search engines and PostgreSQL FTS provides two predefined ranking functions, which attempt to produce a measure of how a document is relevant to the query. In spite of that the concept of relevancy is vague and is very application specific, these functions try to take into account lexical, proximity and structural information. Detailed description is available (Section 2.5). Different application may require an additional information to rank, for example, document modification time.

Lexical part of ranking reflects how often are query terms in the document, proximity - how close in document query terms are and structural - in what part of document they occur.

Since longer document has a bigger chance to contain a query, it is reasonable to take into account the document size. FTS provides several options for that.

It is important to notice, that ranking functions does not use any global information, so, it is impossible to produce a fair normalization to 1 or 100%, as sometimes required. However, a simple technique, like rank/(rank+1) can be applied. Of course, this is just a cosmetic change, i.e., ordering of search results will not changed.

Several examples are shown below. Notice, that second example used normalized rank.

=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) as rnk 
from apod, to_tsquery('neutrino|(dark & matter)') query 
where query @@ fts order by rnk desc limit 10;
                     title                     |   rnk
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query)/
(rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) + 1) as rnk from
 apod, to_tsquery('neutrino|(dark & matter)') query where 
query @@ fts order by rnk desc limit 10;
                     title                     |        rnk
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

First argument in rank_cd ( '{0.1, 0.2, 0.4, 1.0}' ) is an optional parameter, which specifies actual weights for labels D,C,B,A, used in function setweight. These default values show that lexemes labeled as A are 10 times important than one with label D.

Ranking could be expensive, since it requires consulting tsvector of all found documents, which is IO bound and slow. Unfortunately, it is almost impossible to avoid, since FTS in databases should works without index, moreover, index could be lossy (GiST index, for example), so it requires to check documents to avoid false hits. External search engines doesn't suffer from this, because ranking information usually contain in the index itself and it is not needed to read documents.

1.3.4. Getting results

To present search results it is desirable to show part(s) of documents which somehow identify its context and how it is related to the query. Usually, search engines show fragments of documents with marked search terms. FTS provides function headline() (see details in Section 2.6) for this. It uses original document, not tsvector, so it is rather slow and should be used with care. Typical mistake is to call headline() for all found documents, while usually one need only 10 or so documents to show. SQL subselects help here. Below is an example of that.

SELECT id,headline(body,q),rank
FROM ( SELECT id,body,q, rank_cd (ti,q) AS rank FROM apod, to_tsquery('stars') q
WHERE ti @@ q ORDER BY rank DESC LIMIT 10) AS foo;

1.3.5. Dictionaries

Dictionary is a program, which accepts lexeme(s) on input and returns:

array of lexeme(s) if input lexeme is known to the dictionary
void array - dictionary knows lexeme, but it's stop word.
NULL - dictionary doesn't recognized input lexeme

WARNING: Data files, used by dictionaries, should be in server_encoding to avoid possible problems !

Usually, dictionaries used for normalization of words and allows user to not bother which word form use in query. Also, normalization can reduce a size of tsvector. Normalization not always has linguistic meaning and usually depends on application semantics.

Some examples of normalization:

Linguistic - ispell dictionaries try to reduce input word to its infinitive, stemmer dictionaries remove word ending.
All URL-s are equivalent to the http server:
- http://www.pgsql.ru/db/mw/index.html
- http://www.pgsql.ru/db/mw/
- http://www.pgsql.ru/db/../db/mw/index.html
Colour names substituted by their hexadecimal values - red,green,blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
Cut fractional part to reduce the number of possible numbers, so 3.14159265359, 3.1415926, 3.14 will be the same after normalization, if leave only two numbers after period. See dictionary for integers (Appendix C) for more details.

FTS provides several predefined dictionaries (Section 2.8), available for many languages, and SQL commands to manipulate them online (Part I). Besides this, it is possible to develop custom dictionaries using API, see dictionary for integers Appendix C, for example.

CREATE FULLTEXT MAPPING command ( CREATE FULLTEXT MAPPING ) binds specific type of lexeme and a set of dictionaries to process it. Lexeme come through a stack of dictionaries until some dictionary identify it as a known word or found it is a stop-word. If no dictionary will recognize a lexeme, than it will be discarded and not indexed. A general rule for configuring stack of dictionaries is to place at first place the most narrow, most specific dictionary, then more general dictionary and finish it with very general dictionary, like snowball stemmer or simple, which recognize everything. For example, for astronomy specific search (astro_en configuration) one could bind lword (latin word) with synonym dictionary of astronomical terms, general english dictionary and snowball english stemmer.

=# CREATE FULLTEXT MAPPING ON astro_en FOR lword WITH astrosyn, en_ispell, en_stem;

Function lexize can be used to test dictionary, for example:

=# select lexize('en_stem', 'stars');
 lexize
--------
 {star}
(1 row)

Also, ts_debug function ( Section 2.10 ) is very useful.

1.3.6. Stop words

Stop words are the words, which are too popular and appear almost in every document and have no discrimination value, so they could be ignored in full-text index. For example, every english text contains word a and it is useless to have it in index. However, stop words does affect to the positions in tsvector, which in turn, does affect ranking.

=# select to_tsvector('english','in the list of stop words');
        to_tsvector
----------------------------
 'list':3 'stop':5 'word':6

The gaps between positions 1-3 and 3-5 are because of stop words, so ranks, calculated for document with/without stop words, are quite different !

=# select rank_cd ('{1,1,1,1}', to_tsvector('english','in the list of stop words'), to_tsquery('list & stop'));
 rank_cd
---------
     0.5

postgres=# select rank_cd ('{1,1,1,1}', to_tsvector('english','list stop words'), to_tsquery('list & stop'));
 rank_cd
---------
       1

It is up to the specific dictionary, how to treat stop-words. For example, ispell dictionaries first normalized word and then lookups it in the list of stop words, while stemmers first lookups input word in stop words. The reason for such different behaviour is an attempt to decrease a possible noise.

Prev	Home	Next
FTS Overview	Up	FTS features