|Full-Text Search in PostgreSQL: A Gentle Introduction|
|Prev||Fast Backward||Chapter 1. FTS Introduction||Fast Forward||Next|
Full-Text Search ( FTS ) is a search for the documents, which satisfy query and, optionally, return them in some order. Most usual case is to find documents containing all query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on specific applications. The simplest search machine considers query as a set of words and similarity - as how frequent are query words in the document.
Ordinary full text search engines operate with collection of documents where document is considered as a "bag of words", i.e., there is a minimal knowledge about the document structure and its metadata. Big search machines make use of sophisticated heuristics to get some metadata, such as title, author(s), modification date, but their knowledge is limited by web site owner policy. But, even if you have a full access to the documents, very often, document itself, as it shown to the visitor, depends on many factors, which makes indexing of such dynamical documents practically impossible and actually, search engines fail here ("The Hidden Web" phenomena). Moreover, modern information systems are all database driven and there is a need in IR (Information Retrieval) style full text search inside database with full conformance to the database principles (ACID). That's why, many databases have built-in full text search engines, which allow to combine text searching and additional metadata, stored in various tables and available through powerful and standard SQL language.
Document, in usual meaning, is a text file, that one could open, read and modify. Search machines parse text files and store associations of lexemes (words) with their parent document. Later, these associations used to search documents, which contain query words. In databases, notion of document is much complex, it could be any textual attribute or their combination ( concatenation ), which in turn may be stored in various tables or obtained on-fly. In other words, document looks as it were constructed from different pieces (of various importance) for a moment of indexing and it might be not existed as a whole. For example,
SELECT title || ' ' || author || ' ' || abstract || ' ' || body as document FROM messages WHERE mid = 12; SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body as document FROM messages m, docs d WHERE mid = did and mid = 12;
Document can be ordinary file, stored in filesystem, but accessible through database. In that case, database used as a storage for full text index and executor for searches. Document processed outside of database using external programs. In any cases, it's important, that document must be somehow uniquely identified.
Actually, in previous examples we should use coalesce function to prevent document to be NULL if some of its part is NULL.