GiST - Introduction

Introduction to GiST

Update: We wrote a paper GiST programming tutorial, in Russian.

Hellerstein et al. [HNP95] introduced an index structure, called a Generalized Search Tree (GiST), which is a generalized form of an R-tree [Gut84]. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries for specific domain experts not a database one.

GiST was implemented in an early version of PostgreSQL by J. Hellerstein and P.Aoki, more details is available from "The GiST Indexing Project" (http://gist.cs.berkeley.edu/) at Berkeley. As an "university" project it has a limited number of features and was in rare use. Since 7.1 version of PostgreSQL the GiST was taken up by Oleg Bartunov and Teodor Sigaev, see "GiST development" (http://www.sai.msu.su/~postgres/gist) for information about current development.

Current implementation of GiST supports:

Variable length keys
Composite keys (multi-key)
provides NULL-safe interface to GiST core

There are several contrib modules developed using GiST and distributed with PostgreSQL. For example:

rtree_gist, btree_gist - GiST implementation of R-Tree and B-Tree
intarray - index support for one-dimensional array of int4's
tsearch - a searchable (full text) data type with indexed access
ltree - data types, indexed access methods and queries for data organized as a tree-like structures
cube - data type, representing multidimensional cubes

GiST is used in several public projects, like PosGIS project, which adds support for geographic objects to the PostgreSQL, OpenFTS - Open Source Full Text Search engine, which provides online indexing of data and relevance ranking for database searching.

Definitions

GiST is a height-balanced tree structure with tree nodes containing (p, ptr) pairs, where p is a predicate (attribute),which is used as a search key, and ptr is the pointer to data for a leaf node or a pointer to another tree node for a non-leaf node.

GiST has the following properties:

Every node contains between min and max index entries unless it is the root
For each index entry (p,ptr) in a leaf node, p is true when instantiated with the values from the indicated tuple (i.e., p holds for the tuple).
For each index entry (p, ptr) in a non-leaf node, p is true when instantiated with the values of any tuple reachable from ptr.
The root has at least two children unless it is a leaf
All leaves appear on the same level

In principle, the keys of a GiST may be any arbitrary predicates that hold for each datum below the key. In practice, the keys come from a user-defined object class, which provides a set of six methods required by the GiST:

Consistent(E,q):: given an entry E = (p, ptr), and a query predicate q, returns false if p ^ q can be guaranteed unsatisfiable, and true otherwise.
Union(P ):: given a set P of entries (p1,ptr1), ... (pn,ptrn), returns some predicate r that holds for all tuples stored below ptr1 through ptrn.
Compress(E):: given an entry E = (p, ptr), returns an entry (pp, ptr) where pp is a compressed representation of p.
Decompress(E):: given a compressed representation E = (pp, ptr), returns an entry (r, ptr) where r is a decompressed representation of pp.
Penalty(E1,E2):: given two entries E1 = (p1, ptr1), E2 = (p2, ptr2), returns a domain specific penalty for inserting E2 into the subtree rooted at E1, which is used to aid the splitting process of the insertion operation.
PickSplit(P ):: given a set P of M+1 entries (p, ptr), splits P into two sets of entries P1 and P2, each of size kM , where k is the minimum fill factor.

The three methods of GiST provide algorithms for search, insertion and deletion operations:

SEARCH: can be used to search any dataset with any query predicate by traversing as much of the tree as necessary to satisfy the query.
The general search algorithm is controlled by the user-specified Consistent method, which returns 'TRUE' if the predicate is satisfiable and 'FALSE' in vice versa. The same Consistent method applies to both index node and leaf node. But usually the satisfiable conditions are different for the index node and leaf node, no matter if it is an exact matching search or a range/window query. So further checking is needed after leaf entries are fetched according to the algorithm. Another restriction is that only one type of query can be conducted in one program because function Consistent is used in the search instead of a pointer to a function.
INSERT: guarantees that the GiST remains balanced. User defined key method Penalty is used for choosing a subtree to insert; method PickSplit is used for the node splitting algorithm; method Union is used for propagating changes upward to maintain the tree properties.
DELETE: determines a leaf with specified key and removes entry (p, ptr) and, if required, maintains the balance of the tree using method Union (as in INSERT).

Literature:

[Gut84] Antonin Guttman. R-trees: a dynamic index structure for spatial searching. In ACM SIGMOD International Conference on Management of Data, pages 47-54, 1984.
[HNP95] J. M. Hellerstein, J. F. Naughton, and Avi Pfeffer. Generalized search trees for database systems. In Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995.