2.7. Full-text indexes

There are two kinds of indexes which can be used to speedup FTS operator ( Section 2.1 ). Notice, indexes are not mandatory for FTS !

CREATE INDEX name ON table USING gist(column);

Creates GiST (The Generalized Search Tree) based index.

CREATE INDEX name ON table USING gin(column);

Creates GIN (The Generalized Inverted Index) based index. column is one of the TSVECTOR, or TEXT, or VARCHAR types.

GiST index is lossy, which means its' required to consult heap to check results for false hits. PostgreSQL does this automatically, Filter: in an example below.

=# explain select * from apod where fts @@ to_tsquery('supernovae');
                               QUERY PLAN                                
 Index Scan using fts_gidx on apod  (cost=0.00..12.29 rows=2 width=1469)
   Index Cond: (fts @@ '''supernova'''::tsquery)
   Filter: (fts @@ '''supernova'''::tsquery)

Lossiness is the result of two factors - we represent a document by its fixed-length signature. We obtain signature in the following way - we hash (crc32) each word into random bit in a n-bit length strings and their superposition produces n-bit document signature. Because of hashing, there is a chance, that some words hashed to the same position and it could be resulted in false hit. Signatures, calculated for each document in collection, are stored in RD-tree (Russian Doll tree), invented by Hellerstein, which is an adaptation of the R-tree to sets. In our case transitive containment relation realized with superimposed coding (Knuth,1973) of signatures - parent is 'OR'-ed bit-strings of all children. This is a second factor of lossiness. It's clear, that parents tend to be full of '1' (degenerates) and become quite useless because of it's little selectivity. Searching performs as a bit comparison of a signature represented query and RD-tree entry. If all '1' of both signatures are in the same position we say that this branch probably contains query, but if there is even one discrepancy we could definitely reject this branch. Lossiness causes serious performance degradation, since random accessing of heap records is slow and limits applicability of GiST index. Probability of false drops is depends on several factors and the number of unique words is one of them, so using dictionaries to reduce this number is practically mandatory.

Actually, it's not the whole story. GiST index has optimization for storing small tsvectors (< TOAST_INDEX_TARGET bytes, 512 bytes). On leaf pages small tsvectors stored as is, while longer one are represented by their signatures, which introduce some losiness. Unfortunately, existing index API doesn't allow to say index that it found an exact values (tsvector) or results need to be checked. That's why GiST index currently is marked as lossy. We hope in future to overcome this issue.

Contrary, GIN index isn't lossy and it's performance depends logarithmically on the number of unique words.

There is one side-effect of "non-lossiness" of GIN index and using queries with lexemes and weights, like 'supernovae:a'. Since information about these labels stored in heap only and GIN index is not lossy, there is no necessity to access heap, one should use special FTS operator @@@, which forces using of heap to get information about labels. GiST index is lossy, so it reads heap anyway and there is no need in special operator. In example below, fts_idx is a GIN index.

=# explain select * from apod where fts @@@ to_tsquery('supernovae:a');
                               QUERY PLAN                               
 Index Scan using fts_idx on apod  (cost=0.00..12.30 rows=2 width=1469)
   Index Cond: (fts @@@ '''supernova'':A'::tsquery)
   Filter: (fts @@@ '''supernova'':A'::tsquery)

Experiments lead to the following observations:

Overall, GiST index is very good for online update and fast for collections with the number of unique words about 100,000, but is not as scalable as Gin index, which in turn isn't good for updates. Both indexes support concurrency and recovery.

Partitioning of big collections and proper use of GiST and GIN indexes allow implementation of very fast search with online update. Partitioning can be done on database level using table inheritance and Constraint Exclusion, or distributing documents over servers and collecting search results using contrib/dblink extension module. The latter is possible, because ranking functions use only local information.