Full-Text Search in PostgreSQL: A Gentle Introduction | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 1. FTS Introduction | Fast Forward | Next |
To implement full-text search engine we need some functions to obtain tsvector from a document and tsquery from user's query. Also, we need to return results in some order, i.e., we need a function which compare documents in respect to their relevance to the tsquery. FTS in PostgreSQL provides support of all of these functions, introduced in this section.
FTS in PostgreSQL provides
function to_tsvector
, which transforms
document to tsvector data type. More details is available in
Section 2.2, but for now we consider a simple example.
=# select to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------- 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
In the example above we see, that resulted tsvector does not contains a,on,it, word rats became rat and punctuation sign - was ignored.
to_tsvector
function internally calls parser function
which breaks document (a fat cat sat on a mat - it ate a fat rats)
on words and corresponding type. Default parser recognizes 23 types,
see Section 2.4 for details. Each word, depending
on its type, comes through a stack of dictionaries (Section 1.3.5).
At the end of this step we obtain what we call a lexeme.
For example, rats became rat, because
one of the dictionaries recognized that word rats is a
plural form of rat.
Some words are treated as a "stop-word" (Section 1.3.6)
and ignored, since they are too frequent and have no informational value.
In our example these are a,on,it. Punctuation sign
-
was also ignored, because it's type (Space symbols)
was forbidden for indexing. All information about the parser, dictionaries
and what types of lexemes to index contains in the full-text configuration
(Section 2.9). It's possible to have many configurations and
actually, many predefined system configurations are available for different
languages. In our example we used default configuration english
for english language.
To make things clear, below is an output from ts_debug
function ( Section 2.10 ), which show all details
of FTS machinery.
=# select * from ts_debug('english','a fat cat sat on a mat - it ate a fat rats'); Alias | Description | Token | Dicts list | Lexized token -------+---------------+-------+----------------------+--------------------------- lword | Latin word | a | {pg_catalog.en_stem} | pg_catalog.en_stem: {} blank | Space symbols | | | lword | Latin word | fat | {pg_catalog.en_stem} | pg_catalog.en_stem: {fat} blank | Space symbols | | | lword | Latin word | cat | {pg_catalog.en_stem} | pg_catalog.en_stem: {cat} blank | Space symbols | | | lword | Latin word | sat | {pg_catalog.en_stem} | pg_catalog.en_stem: {sat} blank | Space symbols | | | lword | Latin word | on | {pg_catalog.en_stem} | pg_catalog.en_stem: {} blank | Space symbols | | | lword | Latin word | a | {pg_catalog.en_stem} | pg_catalog.en_stem: {} blank | Space symbols | | | lword | Latin word | mat | {pg_catalog.en_stem} | pg_catalog.en_stem: {mat} blank | Space symbols | | | blank | Space symbols | - | | lword | Latin word | it | {pg_catalog.en_stem} | pg_catalog.en_stem: {} blank | Space symbols | | | lword | Latin word | ate | {pg_catalog.en_stem} | pg_catalog.en_stem: {ate} blank | Space symbols | | | lword | Latin word | a | {pg_catalog.en_stem} | pg_catalog.en_stem: {} blank | Space symbols | | | lword | Latin word | fat | {pg_catalog.en_stem} | pg_catalog.en_stem: {fat} blank | Space symbols | | | lword | Latin word | rats | {pg_catalog.en_stem} | pg_catalog.en_stem: {rat} (24 rows)
Function setweight()
is used
to label tsvector. The typical usage of this is to mark out
the different parts of document (say, importance).
Later, this can be used for ranking of search results in addition to
the positional information (distance between query terms).
If no ranking is required, positional information can be removed from
tsvector using strip()
function to
save some space.
Since to_tsvector
(NULL) produces
NULL, it is recomended to use coalesce
to avoid unexpected results. Here is the safe method of obtaining
tsvector of structured document.
test=# update tt set ti=\ test=# setweight( to_tsvector(coalesce(title,'')), 'A' ) || ' ' ||\ test=# setweight( to_tsvector(coalesce(keyword,'')), 'B' ) || ' ' ||\ test=# setweight( to_tsvector(coalesce(abstract,'')), 'C' ) || ' ' ||\ test=# setweight( to_tsvector(coalesce(body,'')), 'D' );
FTS provides two functions for obtaining
tsquery -
to_tsquery
and plainto_tsquery
( Section 2.3.2 ).
=# select to_tsquery('english', 'fat & rats'); to_tsquery --------------- 'fat' & 'rat' =# select plainto_tsquery('english', 'fat rats'); plainto_tsquery ----------------- 'fat' & 'rat'
Tsquery data type obtained at search time and the same way as tsvector ( Section 1.3.1 ).
There is a powerful technique to rewrite query online, called Query Rewriting ( Section 2.3.1 ). It allows to manage searches on the assumption of application semantics. Typical usage is a synonym extension or changing query to direct search in the necessary direction. The nice feature of Query Rewriting is that it doesn't require reindexing in contrast of using thesaurus dictionary (Section 2.8.5). Also, Query Rewriting is table-driven, so it can be configured online.
Ranking of search results is de-facto standard feature of all search engines and PostgreSQL FTS provides two predefined ranking functions, which attempt to produce a measure of how a document is relevant to the query. In spite of that the concept of relevancy is vague and is very application specific, these functions try to take into account lexical, proximity and structural information. Detailed description is available (Section 2.5). Different application may require an additional information to rank, for example, document modification time.
Lexical part of ranking reflects how often are query terms in the document, proximity - how close in document query terms are and structural - in what part of document they occur.
Since longer document has a bigger chance to contain a query, it is reasonable to take into account the document size. FTS provides several options for that.
It is important to notice, that ranking functions does not use any global information, so, it is impossible to produce a fair normalization to 1 or 100%, as sometimes required. However, a simple technique, like rank/(rank+1) can be applied. Of course, this is just a cosmetic change, i.e., ordering of search results will not changed.
Several examples are shown below. Notice, that second example used normalized rank.
=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) as rnk from apod, to_tsquery('neutrino|(dark & matter)') query where query @@ fts order by rnk desc limit 10; title | rnk -----------------------------------------------+---------- Neutrinos in the Sun | 3.1 The Sudbury Neutrino Detector | 2.4 A MACHO View of Galactic Dark Matter | 2.01317 Hot Gas and Dark Matter | 1.91171 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 Rafting for Solar Neutrinos | 1.9 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 Hot Gas and Dark Matter | 1.6123 Ice Fishing for Cosmic Neutrinos | 1.6 Weak Lensing Distorts the Universe | 0.818218 =# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query)/ (rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) + 1) as rnk from apod, to_tsquery('neutrino|(dark & matter)') query where query @@ fts order by rnk desc limit 10; title | rnk -----------------------------------------------+------------------- Neutrinos in the Sun | 0.756097569485493 The Sudbury Neutrino Detector | 0.705882361190954 A MACHO View of Galactic Dark Matter | 0.668123210574724 Hot Gas and Dark Matter | 0.65655958650282 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 Rafting for Solar Neutrinos | 0.655172410958162 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 Hot Gas and Dark Matter | 0.617195790024749 Ice Fishing for Cosmic Neutrinos | 0.615384618911517 Weak Lensing Distorts the Universe | 0.450010798361481
First argument in rank_cd
( '{0.1, 0.2, 0.4, 1.0}' ) is an optional parameter, which
specifies actual weights for labels D,C,B,A, used in
function setweight
. These default values show that
lexemes labeled as A are 10 times important than one with
label D.
Ranking could be expensive, since it requires consulting tsvector of all found documents, which is IO bound and slow. Unfortunately, it is almost impossible to avoid, since FTS in databases should works without index, moreover, index could be lossy (GiST index, for example), so it requires to check documents to avoid false hits. External search engines doesn't suffer from this, because ranking information usually contain in the index itself and it is not needed to read documents.
To present search results it is desirable to show part(s) of documents which
somehow identify its context and how it is related to the query. Usually,
search engines show fragments of documents with marked search terms.
FTS provides function headline()
(see details in Section 2.6) for this. It uses original
document, not tsvector, so it is rather slow and should be
used with care. Typical mistake is to call headline()
for all found documents, while usually one need
only 10 or so documents to show. SQL subselects help
here. Below is an example of that.
SELECT id,headline(body,q),rank FROM ( SELECT id,body,q, rank_cd (ti,q) AS rank FROM apod, to_tsquery('stars') q WHERE ti @@ q ORDER BY rank DESC LIMIT 10) AS foo;
Dictionary is a program, which accepts lexeme(s) on input and returns:
array of lexeme(s) if input lexeme is known to the dictionary
void array - dictionary knows lexeme, but it's stop word.
NULL - dictionary doesn't recognized input lexeme
WARNING: Data files, used by dictionaries, should be in server_encoding to avoid possible problems !
Usually, dictionaries used for normalization of words and allows user to not bother which word form use in query. Also, normalization can reduce a size of tsvector. Normalization not always has linguistic meaning and usually depends on application semantics.
Some examples of normalization:
Linguistic - ispell dictionaries try to reduce input word to its infinitive, stemmer dictionaries remove word ending.
All URL-s are equivalent to the http server:
http://www.pgsql.ru/db/mw/index.html
http://www.pgsql.ru/db/mw/
http://www.pgsql.ru/db/../db/mw/index.html
Colour names substituted by their hexadecimal values - red,green,blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
Cut fractional part to reduce the number of possible numbers, so 3.14159265359, 3.1415926, 3.14 will be the same after normalization, if leave only two numbers after period. See dictionary for integers (Appendix C) for more details.
FTS provides several predefined dictionaries (Section 2.8), available for many languages, and SQL commands to manipulate them online (Part I). Besides this, it is possible to develop custom dictionaries using API, see dictionary for integers Appendix C, for example.
CREATE FULLTEXT MAPPING command ( CREATE FULLTEXT MAPPING ) binds specific type of lexeme and a set of dictionaries to process it. Lexeme come through a stack of dictionaries until some dictionary identify it as a known word or found it is a stop-word. If no dictionary will recognize a lexeme, than it will be discarded and not indexed. A general rule for configuring stack of dictionaries is to place at first place the most narrow, most specific dictionary, then more general dictionary and finish it with very general dictionary, like snowball stemmer or simple, which recognize everything. For example, for astronomy specific search (astro_en configuration) one could bind lword (latin word) with synonym dictionary of astronomical terms, general english dictionary and snowball english stemmer.
=# CREATE FULLTEXT MAPPING ON astro_en FOR lword WITH astrosyn, en_ispell, en_stem;
Function lexize
can be used to test dictionary,
for example:
=# select lexize('en_stem', 'stars'); lexize -------- {star} (1 row)
Also, ts_debug
function ( Section 2.10 )
is very useful.
Stop words are the words, which are too popular and appear almost in every document and have no discrimination value, so they could be ignored in full-text index. For example, every english text contains word a and it is useless to have it in index. However, stop words does affect to the positions in tsvector, which in turn, does affect ranking.
=# select to_tsvector('english','in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6
The gaps between positions 1-3 and 3-5 are because of stop words, so ranks, calculated for document with/without stop words, are quite different !
=# select rank_cd ('{1,1,1,1}', to_tsvector('english','in the list of stop words'), to_tsquery('list & stop')); rank_cd --------- 0.5 postgres=# select rank_cd ('{1,1,1,1}', to_tsvector('english','list stop words'), to_tsquery('list & stop')); rank_cd --------- 1
It is up to the specific dictionary, how to treat stop-words. For example, ispell dictionaries first normalized word and then lookups it in the list of stop words, while stemmers first lookups input word in stop words. The reason for such different behaviour is an attempt to decrease a possible noise.