Tsearch V2 Notes

Difference between revision 52 and current revision

No diff available.

Miscelaneous tsearch v2 notes by authors

Tsearch2 Tips

; tsquery concatenations

tsquery AND tsquery

test=# select 'a&b'::tsquery && 'c|d'::tsquery;
         ?column?          
---------------------------
 'a' & 'b' & ( 'c' | 'd' )

tsquery OR tsquery

test=# select 'a&b'::tsquery || 'c|d'::tsquery;
         ?column?          
---------------------------
 'a' & 'b' | ( 'c' | 'd' )
custom tsearch2 installation
Sometimes it's necessary to install custom tsearch2, for example, one could decrease signature length in gistidx.h to decrease index size and, consequently, to improve performance. Unfortunately, currently we couldn't pass parameter (signature size) to CREATE INDEX command, so we cannot mix several signature size in one database, but we could install modified tsearch2 into different location and use in separate database.
sed -e 's|$libdir|/path/to/new/tsearch2/installation/|g' < tsearch2.sql|psql your_DB
Show current configuration
select show_curcfg();
  show_curcfg
 -------------
     2407575
(1 row)

Human-readable configuration (instead of oid) could be obtained using:

select * from pg_ts_cfg where oid=show_curcfg();
     ts_name     | prs_name |    locale    
-----------------+----------+--------------
 default_russian | default  | ru_RU.KOI8-R
(1 row)
could not find tsearch config by locale
This is a famous error message novice users often confused with. The error arises from mismatching of server's locale with any of pg_ts_cfg.locale when tsearch2 configuration doesn't specified neither explicitly (in function's argument) nor implicitly (current configuration by set_curcfg()). Default configuration of tsearch2 looks like:
# select * from pg_ts_cfg;
     ts_name     | prs_name |    locale    
-----------------+----------+--------------
 default         | default  | C
 default_russian | default  | ru_RU.KOI8-R
 simple          | default  | 
(3 rows)

Server's locale could be found using show all command (see values of lc_collate and lc_ctype).

Current configuration (valid for current session only) used by tsearch2 functions could be specified using set_curfg(pg_ts_cfg.ts_name):

# select set_curcfg('default');
 set_curcfg 
------------
 
(1 row)

Also, you may explicitly specify configuration name:

test=# select to_tsquery('default','busy&men');
   to_tsquery   
----------------
 'busi' & 'men'
(1 row)
test=# select to_tsquery('simple','busy&men');
   to_tsquery   
----------------
 'busy' & 'men'
(1 row)
Indexing structured documents
It's possible to weight different parts of document specifying part designator (D,C,B,A) in setweight function when indexing document. Suppose, we have table tt with four text fields title, keyword,abstract, body and ti which is tsvector type.
test=# update tt set ti=\
test=# setweight( to_tsvector(title), 'A' )    ||\
test=# setweight( to_tsvector(keyword), 'B' )  ||\
test=# setweight( to_tsvector(abstract), 'C' ) ||\
test=# setweight( to_tsvector(body), 'D' );

Later, you may specify in ranking function actual weights to be used to sort search results. Default weights are {0.1, 0.2, 0.4, 1.0}, so title is the most important part (A) of document.

Notice:

  • weights must be positive and lesser than 1
  • part D is used on default.

You may explicitly specify weights when calling ranking function:

test=# select title,rank('{0.1,0.2,0.4,1.0}',ti,\
test=# to_tsquery('supernovae & stars')) as rank from apod\
test=# where ti @@ to_tsquery('supernovae & stars') order by rank desc;
Document length normalization
It's possible to normalize rank of document by document length (total number of indexed words) To do so, you need to specify the last argument in ranking function - rank(tsvector,query,len_norm), where len_norm:
  • 0 (default) - no normalization
  • 1 - normalized by log (document length)
  • 2 - normalized by document length itself

If you want something other, you could use length(tsvector) as a document length.

Restricted search
You may specify part identificator (A,B,C,D) for each word in query to restrict search in corresponding part of document. For example, I created tsvestor ti as a combination of title and body, and I assigned title to be more important than body.
apod=# update apod set ti=setweight( to_tsvector(title), 'A' ) || to_tsvector(body);

apod=# \d apod
      Table "public.apod"
 Column |   Type   | Modifiers 
--------+----------+-----------
 id     | integer  | 
 title  | text     | 
 body   | text     | 
 ti     | tsvector | 
Indexes: tiidx gist (ti)

To search only in titles I used select below:

apod=# select title from apod where ti @@ to_tsquery('supernovae:a & stars');

It's possible to specify several parts of documents to be searched, suppose we have constructed tsvector from 4 parts - a,b,c,d. To search in parts 'a', 'b' use:

to_tsquery('supernovae:ab & stars');
Use subselects
1st query:
select id,headline(body,q),rank(ti,q) as rank
from apod, to_tsquery('stars') q 
where ti @@ q order by rank desc limit 10;

2nd query with subselect:

select id,headline(body,q),rank
from ( select id,body,q, rank(ti,q) as rank from apod, to_tsquery('stars') q
where ti @@ q order by rank desc limit 10) as foo;

2nd query could be by order of magnitude faster than 1st one ! This is because of slow headline() function, which processes a document (and so needs to read document from disk), not just an index, so 1st query has to produce as many headlines as the number of search results, while 2nd query needs to read only 10 documents. That makes the difference !

Keep the number of unique words small
  • carefully choose types of lexemes you want to be indexed and, corresondingly, searched. Here is examples of useful selects:
    • Specify current configuration:
     apod=#  select set_curcfg('default');
    • Show current configuration:
     apod=# select * from pg_ts_cfg where oid=show_curcfg();
     ts_name | prs_name | locale 
     ---------+----------+--------
     default | default  | C
    • Test your configuration: Function ts_debug allows easy testing of your current configuration. You may always test another configuration using set_curcfg function.
apod=# select * from ts_debug('Tsearch module for PostgreSQL 7.3.3');
 ts_name | tok_type | description |   token    | dict_name |  tsvector    
---------+----------+-------------+------------+-----------+--------------
 default | lword    | Latin word  | Tsearch    | {en_stem} | 'tsearch'
 default | lword    | Latin word  | module     | {en_stem} | 'modul'
 default | lword    | Latin word  | for        | {en_stem} | 
 default | lword    | Latin word  | PostgreSQL | {en_stem} | 'postgresql'
 default | version  | VERSION     | 7.3.3      | {simple}  | '7.3.3'

Here:

      • tsname - configuration name
      • tok_type - token type
      • description - human readable name of tok_type
      • token - parser's token
      • dict_name - dictionary will be used for the token
      • tsvector - final result
  • use dictionaries

General rule: For each lexeme type it's possible to specify which dictionaries and in what order, will be used to process lexeme. This information is stored in table pg_ts_cfgmap. For example:

 ......skip.........................................
 default_russian | lword        | {astrosyn,en_stem}
 default_russian | lpart_hword  | {astrosyn,en_stem}
 default_russian | lhword       | {astrosyn,en_stem}

Note, that once lexeme gets recognized by one dictionary it will not pass to the next dictionary ! So, it's meaningless to specify {simple, en_stem}, or {en_stem, astrosyn}, because 'simple' and 'en_stem' recognize any lexeme by definition.

    • create custom dictionaries specific for topic of documents collection
    • use synonym dictionaries

Synonym dictionaries is a file containing 'word synonym' pairs. For example, I want to change words 'supernovae', 'supernova', 'sne' to 'sn'. I created file 'astrosyn':

supernovae sn
supernova  sn
sne        sn

Then, I registered synonym dictionary using existing template:

apod=# insert into pg_ts_dict\
apod=# (select 'astrosyn',dict_init, 'path', dict_lexize, 'AstroSyn'  from pg_ts_dict where dict_name='synonym');

and specify dictionaries to be used for english words ( I use 'default_russian' configuration ):

apod=# update  pg_ts_cfgmap set dict_name='{astrosyn, en_stem}' where\
apod=# ts_name='default_russian' and  tok_alias in \
apod=# ('lword', 'lpart_hword','lhword' );

English words will be first pass to 'astrosyn' dictionary and then, iff it will not recognized, will pass to 'en_stem' dictionary.

That's all:

apod=# select to_tsvector('default_russian','supernovae stars');
   to_tsvector   
-----------------
 'sn':1 'star':2
(1 row)
  • use stop words
  • if you don't want relevance ranking, use strip functions.
Check words statistics
It's useful to see tsvector statistics. For example, to check how well your dictionaries work or how did you configure pg_ts_cfgmap. Also, you may find stop-word candidates relevant for your collection. Tsearch provides stat() function:
apod=# \d apod
      Table "public.apod"
 Column |   Type   | Modifiers
--------+----------+-----------
 id     | integer  |
 title  | text     |
 body   | text     |
 ti     | tsvector |
Indexes: tiidx gist (ti)
apod=# select count(*) from apod;
 count
-------
  4227
(1 row)

Top10 most frequent words:

apod=# select * from stat('select ti from apod') order by ndoc desc, nentry\
apod=# desc,word limit 10;
  word  | ndoc | nentry
--------+------+--------
 year   | 1283 |   2064
 star   | 1273 |   3908
 imag   | 1267 |   1967
 light  | 1232 |   2206
 pictur | 1177 |   1442
 earth  | 1059 |   1798
 visibl |  992 |   1259
 bright |  936 |   1335
 котор  |  903 |   1441
 эт     |  879 |   1397
(10 rows)

Word year appears 2064 times in 1283 documents. This select is quite slow, so you may want to save result into separate table using SELECT INTO for further playing.

It's possible to obtain word statistics on specific parts of documents by passing second parameter to 'stat' function. Above example, but for words in titles:

apod=# select * from stat('select ti from apod','a') order by ndoc desc, nentry desc,word limit 10;

To get stats on several parts use concatenation ('ad').

Token types
I've asked how to know token types supported by parser. Actually, there is function token_type(parser), so you just use:
select * from token_type();

tokid |    alias     |               descr               
-------+--------------+-----------------------------------
     1 | lword        | Latin word
     2 | nlword       | Non-latin word
     3 | word         | Word
     4 | email        | Email
     5 | url          | URL
     6 | host         | Host
     7 | sfloat       | Scientific notation
     8 | version      | VERSION
     9 | part_hword   | Part of hyphenated word
    10 | nlpart_hword | Non-latin part of hyphenated word
    11 | lpart_hword  | Latin part of hyphenated word
    12 | blank        | Space symbols
    13 | tag          | HTML Tag
    14 | http         | HTTP head
    15 | hword        | Hyphenated word
    16 | lhword       | Latin hyphenated word
    17 | nlhword      | Non-latin hyphenated word
    18 | uri          | URI
    19 | file         | File or path name
    20 | float        | Decimal notation
    21 | int          | Signed integer
    22 | uint         | Unsigned integer
    23 | entity       | HTML Entity
(23 rows)
How tsearch2 dictionaries work with stop words
Below is excerpt from my message to Markus Wollny and Andrew J. Kopciuch
Hi guys,

I'm CCing to Andrew, because he is working on improving tsearch2
documentation and I think he would be interesting in some details.

Marcus, I agree docs on stop-words is a bit vague, but you could see default 
configuration (see tsearch2.sql) - it's already specify stop-word list 
for english and russian languages.

insert into pg_ts_dict select
        'en_stem',
        'snb_en_init(text)',
        '/usr/local/pgsql/share/contrib/english.stop',
        'snb_lexize(internal,internal,int4)',
        'English Stemmer. Snowball.'
;

Andrew, could, please, you mention this in docs.

Now, I'm going to describe how dictionaries treat stop words - they do it
in *different* way and this is a source of confusion:

   ispell -   normalize word and then lookup normalized form in stop-word file,
   stemmer -  lookup word in stop-word file and then does it job

I think you have enough information to decide what you have to do :)
Andrew, this should be described in details with examples:
 a) only stemmer
 b) only ispell
 c) ispell,stemmer

In principle, you have to use different stop-files for different dictionaries. Below is a simple shell script, which shows how to get stop words for english stemmer from ispell's one. Hint: database SOMEDB should have en_stem_no_stopwords dictionary installed.

insert into pg_ts_dict
              (SELECT 'en_stem_no_stopwords', dict_init,
                       '',
                       dict_lexize,
                       'english stemmer without stop words'
                FROM pg_ts_dict
                WHERE dict_name = 'en_stem');


cat stop-words-file-for-ispell |\
awk " { print \"select lexize('en_stem_no_stopwords','\"\$1\"');\" }" |\
 psql SOMEDB -P 'tuples_only'|\
 grep -v '^$' |\
 tr -d ' {}' > stop-words-file-for-stemmer

Do you need to index them ?

From my posting to -hackers list:

Hmm, default configuration is too eager, you index every lexem using simple dictionary ! Probably, it's too much. Here is what I have for my russian configuration in dictionary database:

 default_russian | lword        | {en_ispell,en_stem}
 default_russian | lpart_hword  | {en_ispell,en_stem}
 default_russian | lhword       | {en_ispell,en_stem}
 default_russian | nlword       | {ru_ispell,ru_stem}
 default_russian | nlpart_hword | {ru_ispell,ru_stem}
 default_russian | nlhword      | {ru_ispell,ru_stem}

Notice, I index only russian and english words, no numbers, url, etc. You may just delete unwanted rows in pg_ts_cfgmap for your configuration, but I'd recommend just update them setting dict_name to NULL. For example, to avoid indexing integers:

voc=# update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian'
and tok_alias='int';

voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have +7000 bucks');
 token  |      dict_name      | tok_type | tsvector 
--------+---------------------+----------+----------
 Do     | {en_ispell,en_stem} | lword    | 
 you    | {en_ispell,en_stem} | lword    | 
 have   | {en_ispell,en_stem} | lword    | 
 +70000 |                     | int      | 
 bucks  | {en_ispell,en_stem} | lword    | 'buck'

Only 'bucks' gets indexed :)

Scaling up tsearch2

Since built-in ranking functions doesn't use global information it's very easy to scale full-text search distributing database over several computers and use dblink to execute selects, union search results and order them by rank.

=# select dblink_connect('apod','dbname=apod');
=# select * from (
select from dblink('apod', 'select title, rank_cd(fts,q) as rank from apod, to_tsquery(''crab'') q  
where fts @@ q order by rank desc limit 10' ) as t1(title text, rank real)
 union all
select title, rank_cd(fts,q) as rank from apod,
to_tsquery('crab') q
where fts @@ q order by rank desc limit 10
) as foo
order by rank desc limit 10;
Debugging tsearch2 index
If you really want to understand a reason why your search is slow, you need to know tsearch2 internals. Also, you need tsearch2 from 8.02 and gevel module
=# select * from gist_print('a_gist_key') as t(level int, a gtsvector) order by level asc;
 level |                a                
-------+---------------------------------
     1 | 2016 true bits, 0 false bits
     1 | 2016 true bits, 0 false bits
     1 | 2016 true bits, 0 false bits
     2 | 2009 true bits, 7 false bits
     2 | 2004 true bits, 12 false bits
     2 | 2010 true bits, 6 false bits
     2 | 2003 true bits, 13 false bits
     2 | 2008 true bits, 8 false bits
     2 | 2003 true bits, 13 false bits
     2 | 2005 true bits, 11 false bits
     2 | 2005 true bits, 11 false bits
     ...................................
     4 | 23 unique words
     4 | 41 unique words
     4 | 1092 true bits, 924 false bits
     4 | 1003 true bits, 1013 false bits
     4 | 982 true bits, 1034 false bits
     4 | 57 unique words
     4 | 62 unique words
     4 | 537 true bits, 1479 false bits

This is an example of bad tree, there are many very long documents (books), so many signatures are degenerated and not useful. root's signatures are degenerated and signatures of 2nd level are also not very good, 4-th level is a leaf-level. Documents is represented as a bit-signature, but sometimes we use space optimization and store words itself, like '23 unique words'.

Phrase search
This tip is by Mike Rylander
To do phrase searching just add an additional WHERE clause to your query:

SELECT id FROM tab WHERE ts_idx_col @@ to_tsquery('history&lesson')
AND text_col ~* '.*history\\s+lesson.*';

The full-text index will still be used, and the regex will be used to
prune the results afterwards.

Notes about dictionaries

Short notice about using dictionaries with tsearch2.

The purpose of any dictionaries in search engines is to help people to search words not bothering about different forms (declension, inflexion,…) Dictionaries could be used to process query as well as in the indexing. You may store original form of a word and/or it's stem. Most complete index stores both variants and could provide exact search, but at index's size cost. Historically, since tsearch2 was based on gist storage, which is quite sensitive to the number of unique words, so we store only stems. This might be changed in future, since now we could use inverted index with tsearch2.

ISpell dictionary is a (open-source) way to find word's stem(s), their quality is very different for different languages. We use russian ispell dictionary and found it rather useful. Of course, since real language is much complex than ispell rules, there are errors, which produce "noise" in search results. Ispell dictionary could return several normal forms for one word, for example, booking has two infinitives - booking and book.

Ispell dictionary support many ways of word building, but are difficult to build and support. That's why various stemming algorithms become popular, read http://snowball.tartarus.org/texts/introduction.html for good introduction We chose snowball stemmer since it's open-source and written/supported by well-known Martin Porter.

For each lexeme class there is a configurable dictionary queue (in pg_ts_cfgmap). Lexeme passes through this queue until it recognized by some dictionary (currently, there is no possibility to recognize lexeme and pass it to the next dictionary). It's tenable to begin from very specific dictionary (topic related, synonym), and finish queue with most common dictionary like 'simple' or 'stemmer', which recognize everything :)

Specific configuration is very depends on the language, availability of good dictionaries and the goals of search engine. Snowball stemmer works good for english language, since word formation is mostly suffix-oriented (I might be wrong here !), so having not good ispell dictionary, one could use just snowball stemmer. On the other side, for russian language we have good ispell dictionary, which is actively developed and supported, and russian word building is quite complex, so we definitely recommend to use ispell dictionary before snowball stemmer.

It's quite difficult to index mix of several languages which share common characters, since there is no possibility to recognize language. I'd definitely warn you against using stemmer except at the very end of queue, since it recognizes everything and no dictionaries after it will be utilized. Hopefully, any useful text shoud have only one main language. If, for example, the main language is French and second one - English, I'd use French Ispell, English Ispell, French stemmer.