Intoductory guide to contrib/tsearch

----------------------------------------------------------
USING TSEARCH AND POSTGRESQL FOR A WEB BASED SEARCH ENGINE
----------------------------------------------------------

This documentation is provided as a short guide on how to quickly get up and 
running with tsearch and PostgreSQL, for those who want to implement a full 
text indexed based search engine. It is not meant to be a complete in-depth 
guide into the full ins and outs of the contrib/tsearch module, and is 
primarily aimed at beginners who want to speed up searching of large text 
fields, or those migrating from other database systems such as MS-SQL.

A much more detailed (and technical) description of the tsearch module can be 
found in the proper README.tsearch distributed with the tsearch module.


--------
OVERVIEW
--------

MS-SQL provides a full text indexing (FTI) system which enables the fast 
searching of text based fields, very useful for websites (and other 
applications) that require a results set based on key words. PostgreSQL ships 
with a contributed module called tsearch, which implements a special type of 
index that can also be used for full text indexing. Further more, unlike MS' 
offering which requires regular incremental rebuilds of the text indexes 
themselves, tsearch indexes are always up-to-date and keeping them so induces 
very little overhead.

Before we get into the details, it is recommended that you have installed and 
tested PostgreSQL, are reasonably familiar with databases, the SQL query 
language and also understand the basics of connecting to PostgreSQL from the 
local shell. This document isn't intended for the complete PostgreSQL newbie, 
but anyone with a reasonable grasp of the basics should be able to follow it.


------------
INSTALLATION
------------

tsearch is very easy to install, just change the current directory to 
PGSQL_SRC/contrib/tsearch and type:

	gmake
	gmake install
	gmake installcheck

That is pretty much all you have to do, unless of course you get errors. 
However if you get those, you better go check with the mailing lists over at 
http://www.postgresql.org since its never failed for me.

Note that in the default installation, some common words such as "the" and 
"who" are not indexed at all. If you wish them to be, or want to add further 
words that should not be indexed, you will need to create a custom dictionary 
and recompile and reinstall tsearch. See the section on changing non-indexed 
words later in this document.


-------------------------------------------
ADDING TSEARCH FUNCTIONAILITY TO A DATABASE
-------------------------------------------

If you thought installation was easy, this next bit is even easier. Change to 
the PGSQL_SRC/contrib/tsearch directory and type:

	psql DATABASE < tsearch.sql

This should populate the database specified with all the functions, types and 
operators required to use txtidx FTI indexes.


--------------------------
INDEXING FIELDS IN A TABLE
--------------------------

The next stage is to add a text index to an existing table. In this example we 
already have a table defined as follows:

	CREATE TABLE tblMessages {
		intIndex	int4,
		strTopic	varchar(100),
		strMessage	text,
	};

The next stage is to create a special text index which we will use for FTI, so 
we can search our table of messages for words or a phrase. We do this using the 
SQL command:

	ALTER TABLE tblMessages ADD idxFTI txtidx;

Note that unlike traditional indexes, this is actually a new field in the same 
table, which is then used (through the magic of the tsearch operators and 
functions) by a special index we will create in a moment.

The next thing to do is to populate the new field with indexed data, which we 
do as follows:

	UPDATE tblTitles SET idxFTI=txt2txtidx(strMessage);

Note that this only indexes the field strMessage, so if you want to also add 
strTopic to the same index, you should instead do the following, which 
effectively concates the two fields into one field before being fed into the 
indexing system:

	UPDATE tblTitles SET idxFTI=txt2txtidx(strTopic || '  ' || strMessage);

Two final things you need to do, are create a special index on your newly 
created field, which you do as follows:

	CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);

And to set up a trigger so every time a row in this table is changed, the text 
index is automatically updated. This is easily done using:

	CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages
 		FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strMessage);

Or if you are indexing both strMessage and strTopic you should instead do:

	CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages
 		FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strTopic, 
strMessage);

Before you ask, the tsearch function accepts multiple fields as arguments so 
there is no need to concate the two into one like we did before.


----------------
QUERYING A TABLE
----------------

So you have your indexed data, how do you query it? Well the README.tsearch has 
some very detailed descriptions of everything you can do, but here are some 
examples to get you going.

If you wanted to search the above example table for all entries which contain 
the word "minority", you could use the query:

	SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 
'minority'

Note that searching this way isn't case sensitive, so using "MiNoRiTy" in place 
of the above will return the same results. A good thing for those wanting to 
use these as search engines on websites!

Now suppose you want to search for more than one word, say "minority" and 
"report", then you would use the following query instead:

	SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 
'minority&report'

Note that spaces are a bad thing in these sorts of queries, so you better get 
your regexps out and filter these away. Also note that the & has special 
meaning here too, so you can't search for anything with "&" in it either.

The last example here relates to searching for a phrase, for example "minority 
report". This poses a problem with regard to tsearch, as it doesn't index 
phrases, only words. But there is a way around which doesn't appear to have a 
significant impact on query time, and that is to use a query such as the 
following:

	SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 
'minority&report' AND strMessage ILIKE '%minority report%'

Of course if your indexing both strTopic and strMessage, and want to search for 
this phrase on both, then you will have to get out the brackets and extend this 
query a little more.


--------------------------
CHANGING NON-INDEXED WORDS
--------------------------

Some words such as "and", "the", and "who" are automatically not indexed, since 
they belong to a pre-existing dictionary of Stop Words which tsearch does not 
perform indexing on. If someone needs to search for "The Who" in your database, 
they are going to have a tough time coming up with any results, since both are 
ignored in the indexes. But there is a solution, and that is to build your own 
dictionary file for tsearch.

This is pretty easy to do, just change to the 
PGSQL_SRC/contrib/tsearch/makedict directory, and create yourself a text file 
with a list of words (one per line) that you wish to be ignored. Example files 
for English and Russian languages are at 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/ if you need somewhere to 
start.

When you have all the files you need, run the command:

	./makedict.pl -l 'C' -s stopwords.txt  > ../dict/mydict.dct

You can also create a list of word endings in another text file, which will be 
stripped from any relevant words before being indexed. For example, if you have 
the word "ending" in your table, it will be indexed as "end" and the "ing" is 
stripped out. If you choose to have word endings included in your dictionary, 
then you can create your new dictionary file using the following command 
instead:
	
	./makedict.pl -l 'C' -e endwords.txt -s stopwords.txt  > 
../dict/mydict.dct

Note that there is no dictionary of words which do appear in the index, only a 
dictionary of words and word endings that will not.

The next stage is to change back to the PGSQL_SRC/contrib/tsearch and edit the 
source header file dict.h, so it contains the new line:

	#include "dict/mydict.dct"

immediately before the default dictionary, which should be commented out as 
shown here:

	//#include "dict/porter_english.dct"
	
The whole dict.h file should now look something like this:

	#define TABLE_DICT_START        ,{
	#define TABLE_DICT_END          }
	
	#include "dict/mydict.dct"
	//#include "dict/porter_english.dct"
	#ifdef USE_LOCALE
	#include "dict/russian_stemming.dct"
	#endif
	
	#undef TABLE_DICT_START
	#undef TABLE_DICT_END

After you've done this, making sure you are in the PGSQL_SRC/contrib/tsearch 
directory, perform:

	gmake clean
	gmake
	gmake install

At this point you will have to restart PostgreSQL, and rebuild any relevant 
indexes. The latter is done merely by using the same UPDATE statement you used 
to populate the index in the first place.


-------------------------------------------------------
BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH
-------------------------------------------------------

Believe it or not, this isn't as straight forward as it should be, and you will 
have problems trying to backup and restore any database which uses tsearch 
unless you take the steps shown below. And before you ask using pg_dumpall will 
result in failure every time. These took a lot of trial and error to get 
working, but the process as laid down below has been used a dozen times now in 
live production environments so it should work fine.

HOWEVER never rely on anyone elses instructions to backup and restore a 
database system, always develop and understand your own methodology, and test 
it numerous times before you need to do it for real.


To Backup a PostgreSQL database that uses the tsearch module:

1) Backup any global database objects such as users and groups (this step is 
usually only necessary when you will be restoring to a virgin system)

	pg_dumpall -g > GLOBALobjects.sql

2) Backup the full database schema using pg_dump

	pg_dump -s DATABASE > DATABASEschema.sql
	
3) Backup the full database using pg_dump

	pg_dump -Fc DATABASE > DATABASEdata.tar


To Restore a PostgreSQL database that uses the tsearch module:

1) Create the blank database

	createdb DATABASE

2) Restore any global database objects such as users and groups (this step is 
usually only necessary when you will be restoring to a virgin system)

	psql DATABASE < GLOBALobjects.sql

3) Create the tsearch objects, functions and operators

	psql DATABASE < tsearch.sql

4) Edit the backed up database schema and delete all SQL commands which create 
tsearch related functions, operators and data types, BUT NOT fields in table 
definitions that specify txtidx types. If your not sure what these are, they 
are the ones listed in tsearch.sql. Then restore the edited schema to the 
database

	psql DATABASE < DATABASEschema.sql

5) Restore the data for the database

	pg_restore -N -a -d DATABASE DATABASEdata.tar

If you get any errors in step 4, it will most likely be because you forgot to 
remove an object that was created in tsearch.sql. Any errors in step 5 will 
mean the database schema was probably restored wrongly.


----------------
ACKNOWLEDGEMENTS
----------------

Thanks to Oleg Bartunov for taking the time to answer many of my questions 
regarding this module, and also to Teodor Sigaev for clearing up the process of 
making your own dictionaries. Plus of course a big thanks to the pair of them 
for writing this module in the first place!