|Full-Text Search in PostgreSQL: A Gentle Introduction|
|Prev||Fast Backward||Chapter 1. FTS Introduction||Fast Forward||Next|
This is a historical notes about full-text search in PostgreSQL by authors of FTS Oleg Bartunov and Teodor Sigaev.
Development of full-text search in PostgreSQL began from OpenFTS in 2000 after realizing that we need a search engine optimized for online updates with access to metadata from the database. This is essential for online news agencies, web portals, digital libraries, etc. Most search engines available at that time utilize an inverted index which is very fast for searching but very slow for online updates. Incremental updates of an inverted index is a complex engineering task while we needed something light, free and with the ability to access metadata from the database. The last requirement was very important because in a real life search application should always consult metadata ( topic, permissions, date range, version, etc.).
We extensively use PostgreSQL as a database backend and have no intention to move from it, so the problem was to find a data structure and a fast way to access it. PostgreSQL has rather unique data type for storing sets (think about words) - arrays, but lacks index access to them. During our research we found a paper of Joseph Hellerstein, who introduced an interesting data structure suitable for sets - RD-tree (Russian Doll tree). Further research lead us to the idea to use GiST for implementing RD-tree, but at that time the GiST code was untouched for a long time and contained several bugs. After work on improving GiST for version 7.0.3 of PostgreSQL was done, we were able to implement RD-Tree and use it for index access to arrays of integers. This implementation was ideally suited for small arrays and eliminated complex joins, but was practically useless for indexing large arrays. The next improvement came from an idea to represent a document by a single bit-signature, a so-called superimposed signature (see "Index Structures for Databases Containing Data Items with Set-valued Attributes", 1997, Sven Helmer for details). We developed the contrib/intarray module and used it for full text indexing.
It was inconvenient to use integer id's instead of words, so we introduced a new data type txtidx - a searchable data type (textual) with indexed access. This was a first step of our work on an implementation of a built-in PostgreSQL full-text search engine. Even though tsearch v1 had many features of a search engine it lacked configuration support and relevance ranking. People were encouraged to use OpenFTS, which provided relevance ranking based on positional information and flexible configuration. OpenFTS v.0.34 was the last version based on tsearch v1.
People recognized tsearch as a powerful tool for full text searching and insisted on adding ranking support, better configurability, etc. We already thought about moving most of the features of OpenFTS to tsearch, and in the early 2003 we decided to work on a new version of tsearch. We abandoned auxiliary index tables,used by OpenFTS to store positional information, and modified the txtidx type to store them internally. We added table-driven configuration, support of ispell dictionaries, snowball stemmers and the ability to specify which types of lexemes to index. Now, it's possible to generate headlines of documents with highlighted search terms. These changes make tsearch user friendly and turn it into a really powerful full text search engine. For consistency, tsearch functions were renamed, txtidx type became tsvector. To allow users of tsearch v1 smooth upgrade, we named the module as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
PostgreSQL version 8.2 contains a major upgrade of tsearch v2 - multibyte and GIN (A Generalized Inverted Index) support. Multibyte support provides full UTF-8 support and GIN scales tsearch v2 to millions of documents. Both indices (GiST and GiN) are concurrent and recoverable. All these improvements bring out FTS to enterprise level.
Since PostgreSQL 8.3 release, there is no need to compile and install contrib/tsearch2 module, it's already installed in your system with PostgreSQL. Most important new features are:
A set of SQL commands, which controls creation, modification and dropping of FTS objects. This allow to keep dependencies and correct dumping and dropping.
Many FTS configurations already predefined for different languages with snowball stemmers are available.
FTS objects now have ownership and namespace support like other postgresql's objects.
Current FTS configuration could be defined using GUC variable tsearch_conf_name.
Default FTS configuration is now schema specific.