Our new TODO list
- GiST - Generalized Search Tree
The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries for specific domain experts, not a database one. It could be considered as ancestor of "Relational Extenders" in DB2, Cartridges in Oracle and DataBlade in Informix.
- extending and improving of GiST interfaces to make easier developing GiST based extensions and implementation of ADT not supported by current implementation of GiST core. For example, quad trees, digital trees, S tree (similarity tree). There are several papers proposed changes in GiST interface for feature and performance reasons -
- "Generalizing 'Search' in Generalized Search Trees" by Paul Aoki,
- "High-Performance Extensible Indexing", by Marcel Kornacker and
- "A Framework for Supporting the Class of Space Partitioning Trees", by Walid G.Aref and Ihab F. Ilyas.
- Improve NULL handling in GiST. Current interface is NULL-safe and this could cause serious problem when clustering table (CLUSTER command) on index if first column contains NULLs - those records will be lost !
- Better interaction of GiST with optimizer and planner, developing cost functions for several popular extensions (intarray, ltree, tsearch2,,,)
- Figure out how to configure parameters to GiST index supported functions, for example, change size of signature SIGLENINT for tsearch2 index, which could greatly improve performance. Syntax of create index command doesn't allow this. We need to store its value somewhere in index.
- Better documentation of GiST programming interface and tutorials.
GiST programming tutorial, in russian.
- Better node split algorithm for R-tree, see, for example, cR-tree, which make use of clustering instead of splitting. Relax constraint of splitting an overflow node into two nodes and use clustering to create more than two clusters to produce better R-tree structure. There is an implementation of cR-tree based on modified libgist.
- Inverted index with byte compression in PostgreSQL
- This is well known ADT used in IR (information retrieval) and search engines. It has good scalability and a lot of compression techniques were developed to balance index size and performance. We have good experience using inverted indices outside of database, for example, GTSearch - search engine for www.pgsql.ru. See Gin for status of the development and GinTest for current results. UPDATES: GIN is available as a core feature since 8.2 release !
We still have several todo items:
- effective full scan
- wildcard search
- support of non-B-tree operations
- fillfactor support
- storing additional information, such as positions
- GiST modules
- add UTF-8 support to pg_trgm module
- Full text search engine
- We've got a lot of requests from users of our tsearch2 - full text extension about improving scalability and adding new features. Tsearch2 is a de-facto standard full text search engine for PostgreSQL included into distribution and has many users worldwide. Also, tsearch2 is used in OpenFTS search engine.
Issues/TODO to move tsearch2 into core: - DONE for 8.3 (sponsored by EnterpriseDB). See [text search documentation] for more information.
- memory management. Dictionaries and tsearch2 itself cache a lot of data in memory allocated by malloc/palloc(TopContext) and not all of them are correctly freed by reset_tsearch() function.
- tsearch2 doesn't automatically reinitializes dictionary/configuration after configuration changes
- dictionary doesn't automatically recompiles their files if they were changed
- It'be nice to store some information (such as dictionary's structure) in shared memory. For example, ispell dict takes for compile its files about 1-5 seconds, sharing it should increase performance
- To provide support for end-users, snowball stemmers should be compiled in postgresql or compiled into separate *.so libs. Sources of all snowball's files take ~1Mb on disk. Ispell dict's files take 1-3 Mb per language.
- Use syscache instead of built-in cache
Other tasks (brief list):
- DONE (8.2) full UTF-8 support
- DONE (8.2, see Gin) adding support of different storages. Currently, we have rd-tree storage, which is good for online indexing, but not scalable, read tsearch2 internals for details. We propose inverted index for indexing static collections, which has proven good scalability, but not good for online indexing. Balancing these storages and using common techniques for document and query processing, working with dictionaries, should scale tsearch2 to tens of millions documents. We used sophisticated technique for inverted index construction in GTSearch - generic text search engine (see www.pgsql.ru for example).
- improve algorithm of page splitting for rd-tree storage, which is currently very CPU bound
- DONE (8.2) rewriting query support ( thesauri search)
- DONE (8.2) improved dictionary support - we'd like to recognize "phrases", for example, "supernovae stars" could be replaced by "SN".
- DONE (8.2) improving ranking function by extending "cover density" approach (cd_rank) to handle complex logical operators in queries and taking into account different weights of class lexems.
- better and faster headline generation. Headline is a part of document relevant to query with highlighted search terms. In short, we want headline to be compact sentence with high density of search terms and maximum dispersion
- fuzzy search support using language independent trigram statistics (see pg_trgm module for details)
- better tsquery parsing error message
- phrase search through new operator or adding word ordering in tsquery.
- utilize GiST-index optimization which store small tsvectors (<512b) as is in the leaves instead of signatures, so if hit is on tsvector - it's an exact hit and there is no neccessity to check heap for "false drop".
- store the name of FTS configuration somewhere in index
- PROTOTYPE Add system independing collating support, i.e., developing user defined type with use of IBM's ICU library. Current behaviour depends on system locale.
- PROTOTYPE Improve optimizer to recognize a=b OR a>b case
- PROTOTYPE Add StreamBitmapScan
- XML storage and access
Key idea of all aproaches to handle XML in RDBMS is inventing numbering scheme and mapping it to relations.
- Native storage - current implementations use string storage
- XPath, XQuery support
- Full text search support
There are several approaches for handling XML in flat relational databases:
Also,we have GiST modules which could be used for XML handling in PostgreSQL - ltree for tree numeration (labelling) and hstore as a storage for node attributes.
Certainly, we have to do some research before writing proposals and build engine prototype. Personally, I incline to Staircase joins approach because Grust et.al propose index structure "XPath Accelerator" and novel staircase join for efficient evaluation of XPath expressions. Moreover, source code for PostgreSQL 7.3.3 is available.
Also, see [Pgxml thoughts] by Nickolay Samokhvalov.
- phppgadmin - php-based generic web interface to postgresql databases
UPDATES: We got SoC-2007 project for this. I'm mentoring Ivan Zolotukhin.
It's good for quick'n dirty administration work. Personally, I don't know php and even don't want to learn it, but there is one idea I'd like to share, which should be easy to implement and very needed. I mean data version support, not just DDL. It's very easy to implement using our hstore extension. Any volunteers ?
For current test site see vo.astronet.ru
- Q3C (qu-tree-cube) algorithm
This is a postgresql extension for fast access to very big data with spherical atrributes, for example, astronomical catalogs. Some information about idea and current implementation is available on SkyPixelization page. Our experiments with USNO A1 catalogue (>500,000,000 stars) are very promising. We have implemented support of cone search and xmatch queries needed for full skynode and we plan to include it into Astrogrid software. Sergey will present Q3C at ADASS 2005 meeting in Madrid. Q3C can be downloaded from Sourceforge.
Notice, that we have already developed pg sphere ADT for data with spherical attributes using GiST. Q3C is fully btree based and should scaled much better than pg_sphere.
- ADQL to SQL transformation using XSLT
ADQL is stands for Astronomical Data Query Language used for quering VO-ready storage - XML based (ADQLx) language. ADQL is a high level query language behind VOQuery (the highest abstraction, not yet even proposed) which abstracts specific SQL dialects suported by different DBMSes. It enables intelligent agents query astronomical databases via webservices.
- ADQLx to SQL (PostgreSQL SQL92)
- ADQLx to SQL (PostgreSQL with Q3C)
- developing ADQLx webservice and cgi interface
- integrating with Astrogrid software
Check ADQL short introduction and current status.
- VO Discoverer
Algorithm for discovering of new (variable) objects using multiwavelength data and set of criteria. Basically, it's a distributed XMATCH query over tables, each of them could be located in different skynodes.
I'm a head of Astronet project - a knowledge database on astronomy (in Russian). We have many IR techniques already applied there and below I list some others I'd like to learn/develop:
- Focused crawling
- Similar image search - clustering based on feature extraction technique
- Document clustering - similar document search based on keyword extraction and keywords statistics
- query expansion using thesauri