OverBlogFts

Difference between revision 6 and current revision

No diff available.

OverBlog full text search project

Introduction

There is a big dynamical storage of documents in external database, which needs to be indexed and searched. The challenge is to use the full text extension to PostgreSQL database (contrib/tsearch2) as a scalable solution. There are about 5 mln documents in database and about 20,000 documents are coming every day. The requirement to the system is to serve about 100,000 search requests per day with ability to use sophisticated ranking, based on different thingies.

To meet this requirement we design search daemon which accepts queries, transfer them to postgresql database and stores the results in the cache, organized as a LRU buffer in shared memory. Clients and search daemon communicate in according to described API. Search daemon was implemented as a fcgi program (C-language),which invoked by lighthttpd server, responsible for communication with clients. We developed Gin (Generalized inverted index) to scale tsearch2 module. Testing of the system was performed on two (actually, three) machines using parallel tester scripts. Input queries were randomly chosen from the ranked list of words collected from the all documents (about 50,000 unique words with frequency more than 100 documents). We tested direct full text searches in database as well as indirect searches using search daemon. Unfortunately, we were not able to get real-life query statistics to simulate more realistic workload. On the server with 8Gb RAM (3Gb for PostgreSQL buffer) we were able to get about 1mln/req per day! The rule of thumb for choosing a good database server is - more RAM for both - database and system, and good raid (10) for disk storage. RAM is used to cache disk blocks, which greatly increases the performance.

Components

Base directory is /home/megera/app/over

Database

  • Source is a fts/dump/pgsql/cvs-8.1.tar.gz, this is a patched version of stable branch of 8.1 tree , which contains Gin, patched tsearch2 and french snowball stemmer
  • Installation is easy: configure --prefix=/usr/local/pgsql && make&& make install
  • Examples of postgresql.conf files are: postgresql-test.conf and big/postgresql.conf for test server and big one accordingly.

Several things to know:

  • You need postgres user in your system
    • /usr/sbin/useradd -g users postgres
  • You need web user in your system and in database for search daemon:
    • /usr/sbin/groupadd web
    • /usr/sbin/useradd -g web web
    • createuser -SDRI web

Getting working together

General considerations