----------------------------------------------------------
USING TSEARCH AND POSTGRESQL FOR A WEB BASED SEARCH ENGINE
----------------------------------------------------------
This documentation is provided as a short guide on how to quickly get up and
running with tsearch and PostgreSQL, for those who want to implement a full
text indexed based search engine. It is not meant to be a complete in-depth
guide into the full ins and outs of the contrib/tsearch module, and is
primarily aimed at beginners who want to speed up searching of large text
fields, or those migrating from other database systems such as MS-SQL.
A much more detailed (and technical) description of the tsearch module can be
found in the proper README.tsearch distributed with the tsearch module.
--------
OVERVIEW
--------
MS-SQL provides a full text indexing (FTI) system which enables the fast
searching of text based fields, very useful for websites (and other
applications) that require a results set based on key words. PostgreSQL ships
with a contributed module called tsearch, which implements a special type of
index that can also be used for full text indexing. Further more, unlike MS'
offering which requires regular incremental rebuilds of the text indexes
themselves, tsearch indexes are always up-to-date and keeping them so induces
very little overhead.
Before we get into the details, it is recommended that you have installed and
tested PostgreSQL, are reasonably familiar with databases, the SQL query
language and also understand the basics of connecting to PostgreSQL from the
local shell. This document isn't intended for the complete PostgreSQL newbie,
but anyone with a reasonable grasp of the basics should be able to follow it.
------------
INSTALLATION
------------
tsearch is very easy to install, just change the current directory to
PGSQL_SRC/contrib/tsearch and type:
gmake
gmake install
gmake installcheck
That is pretty much all you have to do, unless of course you get errors.
However if you get those, you better go check with the mailing lists over at
http://www.postgresql.org since its never failed for me.
Note that in the default installation, some common words such as "the" and
"who" are not indexed at all. If you wish them to be, or want to add further
words that should not be indexed, you will need to create a custom dictionary
and recompile and reinstall tsearch. See the section on changing non-indexed
words later in this document.
-------------------------------------------
ADDING TSEARCH FUNCTIONAILITY TO A DATABASE
-------------------------------------------
If you thought installation was easy, this next bit is even easier. Change to
the PGSQL_SRC/contrib/tsearch directory and type:
psql DATABASE < tsearch.sql
This should populate the database specified with all the functions, types and
operators required to use txtidx FTI indexes.
--------------------------
INDEXING FIELDS IN A TABLE
--------------------------
The next stage is to add a text index to an existing table. In this example we
already have a table defined as follows:
CREATE TABLE tblMessages {
intIndex int4,
strTopic varchar(100),
strMessage text,
};
The next stage is to create a special text index which we will use for FTI, so
we can search our table of messages for words or a phrase. We do this using the
SQL command:
ALTER TABLE tblMessages ADD idxFTI txtidx;
Note that unlike traditional indexes, this is actually a new field in the same
table, which is then used (through the magic of the tsearch operators and
functions) by a special index we will create in a moment.
The next thing to do is to populate the new field with indexed data, which we
do as follows:
UPDATE tblTitles SET idxFTI=txt2txtidx(strMessage);
Note that this only indexes the field strMessage, so if you want to also add
strTopic to the same index, you should instead do the following, which
effectively concates the two fields into one field before being fed into the
indexing system:
UPDATE tblTitles SET idxFTI=txt2txtidx(strTopic || ' ' || strMessage);
Two final things you need to do, are create a special index on your newly
created field, which you do as follows:
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
And to set up a trigger so every time a row in this table is changed, the text
index is automatically updated. This is easily done using:
CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages
FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strMessage);
Or if you are indexing both strMessage and strTopic you should instead do:
CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages
FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strTopic,
strMessage);
Before you ask, the tsearch function accepts multiple fields as arguments so
there is no need to concate the two into one like we did before.
----------------
QUERYING A TABLE
----------------
So you have your indexed data, how do you query it? Well the README.tsearch has
some very detailed descriptions of everything you can do, but here are some
examples to get you going.
If you wanted to search the above example table for all entries which contain
the word "minority", you could use the query:
SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ##
'minority'
Note that searching this way isn't case sensitive, so using "MiNoRiTy" in place
of the above will return the same results. A good thing for those wanting to
use these as search engines on websites!
Now suppose you want to search for more than one word, say "minority" and
"report", then you would use the following query instead:
SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ##
'minority&report'
Note that spaces are a bad thing in these sorts of queries, so you better get
your regexps out and filter these away. Also note that the & has special
meaning here too, so you can't search for anything with "&" in it either.
The last example here relates to searching for a phrase, for example "minority
report". This poses a problem with regard to tsearch, as it doesn't index
phrases, only words. But there is a way around which doesn't appear to have a
significant impact on query time, and that is to use a query such as the
following:
SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ##
'minority&report' AND strMessage ILIKE '%minority report%'
Of course if your indexing both strTopic and strMessage, and want to search for
this phrase on both, then you will have to get out the brackets and extend this
query a little more.
--------------------------
CHANGING NON-INDEXED WORDS
--------------------------
Some words such as "and", "the", and "who" are automatically not indexed, since
they belong to a pre-existing dictionary of Stop Words which tsearch does not
perform indexing on. If someone needs to search for "The Who" in your database,
they are going to have a tough time coming up with any results, since both are
ignored in the indexes. But there is a solution, and that is to build your own
dictionary file for tsearch.
This is pretty easy to do, just change to the
PGSQL_SRC/contrib/tsearch/makedict directory, and create yourself a text file
with a list of words (one per line) that you wish to be ignored. Example files
for English and Russian languages are at
http://www.sai.msu.su/~megera/postgres/gist/tsearch/ if you need somewhere to
start.
When you have all the files you need, run the command:
./makedict.pl -l 'C' -s stopwords.txt > ../dict/mydict.dct
You can also create a list of word endings in another text file, which will be
stripped from any relevant words before being indexed. For example, if you have
the word "ending" in your table, it will be indexed as "end" and the "ing" is
stripped out. If you choose to have word endings included in your dictionary,
then you can create your new dictionary file using the following command
instead:
./makedict.pl -l 'C' -e endwords.txt -s stopwords.txt >
../dict/mydict.dct
Note that there is no dictionary of words which do appear in the index, only a
dictionary of words and word endings that will not.
The next stage is to change back to the PGSQL_SRC/contrib/tsearch and edit the
source header file dict.h, so it contains the new line:
#include "dict/mydict.dct"
immediately before the default dictionary, which should be commented out as
shown here:
//#include "dict/porter_english.dct"
The whole dict.h file should now look something like this:
#define TABLE_DICT_START ,{
#define TABLE_DICT_END }
#include "dict/mydict.dct"
//#include "dict/porter_english.dct"
#ifdef USE_LOCALE
#include "dict/russian_stemming.dct"
#endif
#undef TABLE_DICT_START
#undef TABLE_DICT_END
After you've done this, making sure you are in the PGSQL_SRC/contrib/tsearch
directory, perform:
gmake clean
gmake
gmake install
At this point you will have to restart PostgreSQL, and rebuild any relevant
indexes. The latter is done merely by using the same UPDATE statement you used
to populate the index in the first place.
-------------------------------------------------------
BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH
-------------------------------------------------------
Believe it or not, this isn't as straight forward as it should be, and you will
have problems trying to backup and restore any database which uses tsearch
unless you take the steps shown below. And before you ask using pg_dumpall will
result in failure every time. These took a lot of trial and error to get
working, but the process as laid down below has been used a dozen times now in
live production environments so it should work fine.
HOWEVER never rely on anyone elses instructions to backup and restore a
database system, always develop and understand your own methodology, and test
it numerous times before you need to do it for real.
To Backup a PostgreSQL database that uses the tsearch module:
1) Backup any global database objects such as users and groups (this step is
usually only necessary when you will be restoring to a virgin system)
pg_dumpall -g > GLOBALobjects.sql
2) Backup the full database schema using pg_dump
pg_dump -s DATABASE > DATABASEschema.sql
3) Backup the full database using pg_dump
pg_dump -Fc DATABASE > DATABASEdata.tar
To Restore a PostgreSQL database that uses the tsearch module:
1) Create the blank database
createdb DATABASE
2) Restore any global database objects such as users and groups (this step is
usually only necessary when you will be restoring to a virgin system)
psql DATABASE < GLOBALobjects.sql
3) Create the tsearch objects, functions and operators
psql DATABASE < tsearch.sql
4) Edit the backed up database schema and delete all SQL commands which create
tsearch related functions, operators and data types, BUT NOT fields in table
definitions that specify txtidx types. If your not sure what these are, they
are the ones listed in tsearch.sql. Then restore the edited schema to the
database
psql DATABASE < DATABASEschema.sql
5) Restore the data for the database
pg_restore -N -a -d DATABASE DATABASEdata.tar
If you get any errors in step 4, it will most likely be because you forgot to
remove an object that was created in tsearch.sql. Any errors in step 5 will
mean the database schema was probably restored wrongly.
----------------
ACKNOWLEDGEMENTS
----------------
Thanks to Oleg Bartunov for taking the time to answer many of my questions
regarding this module, and also to Teodor Sigaev for clearing up the process of
making your own dictionaries. Plus of course a big thanks to the pair of them
for writing this module in the first place!