The Crash-course to OpenFTS --------------------------------- MOTIVATION: Current document is devoted to novices whose interests are quick installation, testing and playing around to get feeling . Assuming you already have all prerequisities installed a whole process should takes about 2 minutes. OpenFTS is based on quite complex algorithms from Information Retrieval and Database theory. It's intended to be flexible. OpenFTS Primer describes installation, running, API and should be used to write your own search applications. After completing tests you're welcome to read README.INSIDE for comments on the examples scripts. PREREQUISITIES: Postgresql 7.4 + contrib/tsearch2 module (7.3.X is also works) OpenFTS v.0.35 - available from http://openfts.sourceforge.net, currently could be downloaded from CVS only. Perl modules: DBI, DBD-Pg, Time::HiRes - available from CPAN (http://www.cpan.org) DBI - http://search.cpan.org/search?dist=DBI DBD::Pg - http://search.cpan.org/search?dist=DBD-Pg Time::HiRes - http://search.cpan.org/search?dist=Time-HiRes Test collection of documents is available for download from http://openfts.sourceforge.net/test-collections/apod-en.tar.gz Download and install the collection somewhere: cd /path/to/test-collection/ tar xzvf apod-en.tar.gz Now you should have test documents in /path/to/test-collection/apod directory. APOD stands for the Astronomy Picture of the Day ( http://antwrp.gsfc.nasa.gov/apod/ ). Authors have kindly granted permission to use texts for testing and non-commercial purposes in framework of OpenFTS project. APOD collection is consists of 1757 articles (about 7 Mb) and ideally suited for OpenFTS. Indexing tooks about 29 seconds on my IBM ThinkPad T21 notebook ( Linux, 2.4.17, 256 Mb RAM, 20 Gb IDE HD). Total number of lexems is 131310, while the number of unique lexemes is only 8,806 ( using Porter's stemmer ). Demo is available from http://xware.astronet.ru/db/apod.html Make sure you have enough rights to create database. Now you may note the time ! RUNNING: 1. createdb openfts Create test database 2. psql openfts < /path/to/share/contrib/tsearch2.sql Load functions. Usually, if you postgresql is installed in /usr/local/pgsql directory, these sql files should be in /usr/local/pgsql/share/contrib directory. 4. ./init.pl openfts drop Drop previous openfts instance if any 5. ./init.pl openfts Create openfts instances (tables) in database 6. find /path/to/test-collection/apod -type f | ./index.pl openfts index APOD collection Resulting database occupies about 21 Mb on my notebook. 7. ./search.pl -p openfts supernovae stars Output should looks like a string with document identificators separated by semicolon: Found documents:118 573;1241;1419;828;879;1629;553;795;740;1533;.... 8. ./search.pl -p openfts -h5 supernovae stars Show text fragments of the first 5 matched files with hilighted query terms. ( It's possible to specify offset and limit in form of -h offset-limit, i.e., -h 5-10 ) 9. Benchmarking. ./search.pl -p openfts -b 100 supernovae stars Found documents:118, total time (100 runs):4.19, average time: 0.042 sec (Keep in my mind these numbers are for my notebook, your mileage may vary) -------------------------------------------------------------------- PS. 1) A list of unique lexems indexed with their frequencies could be obtained using following command: psql -d openfts -qt -c "select * from stat('select fts_index from txt')\ order by ndoc desc, nentry desc,word" Total number of lexems: psql -d openfts -qt -c "select count(*) from stat('select fts_index from txt')" 2) We use Porters stemming algorithm in this example, but I highly recommend Snowball algorithm (http://snowball.tartarus.org). You'll need to install our perl interface to snowball which could be downloaded from http://openfts.sourceforge.net/contributions.shtml Snowball stemmer is a high quality stemmer and available for many languages. -------------------------------------------------------------------- Sat Aug 2 23:08:10 MSD 2003 Comments to Oleg Bartunov