Update: Download spots data (900,000 spots, 25 Mb compressed).
After our first announcement in -hackers mailing list of knngist we received positive feedback and recommendation to make GiST smarter. This night I tested new version of knn-search, which Teodor heroically made in 2 days, based on improved GiST, which now understand which algorithm of tree traverse to use (depth-first as for plain GiST, or distance based priority queue for KNNGiST). We also introduced amcanorderbyop flag to pg_am, which indicates if access method supports ordering operations.
We need index on coordinates.
=# create index pt_idx on spots using gist(coordinates);
Now, compare traditional approach and knn (apply patches to CVS HEAD from commitfest).
1. Find 10 closest points to the given point.
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots ORDER BY dist ASC LIMIT 10; coordinates | dist -------------------------------------+------------------ (3.57192993164062,6.51727240153665) | 2.08362656457647 (3.49502563476562,6.49134782128243) | 2.11874164636854 (3.4393,6.4473) | 2.12848814420001 (3.31787109375,6.50089913799597) | 2.25438592075067 (2.6323,6.4779) | 2.79109148900569 (2.66349792480469,6.53159856871478) | 2.79374947392946 (1.84102535247803,6.27874198581057) | 3.4079762161672 (1.2255,6.1228) | 3.93796014327215 (1.22772216796875,6.15693947094637) | 3.94570513108469 (9.6977,4.044503) | 4.79388775494473 (10 rows)
Total number of rows processed by traditional approach is 908846.
8.4.2 464.501 ms 8.5dev+knngist-0.6 0.971 ms
We see more than 400x better performance for knngist.
2. More complex query with fts (total rows 5098 rows).
=# CREATE INDEX spots_idx ON spots USING gist (coordinates,to_tsvector('french',address)); =# SELECT id, address, (coordinates <->'(2.29470491409302,48.858263472125)'::point) AS dist FROM spots WHERE to_tsvector('french',address) @@ to_tsquery('french','mars') ORDER BY coordinates <-> '(2.29470491409302,48.858263472125)'::point LIMIT 10; 8.4.2 166.089 ms 5dev+knngist-0.6 2.758 ms
Performance benefit is also remarkable for complex queries.
3. Complex polygon (PARIS 1, see knngist ), number of rows is 238.
=# SELECT id, address, (coordinates <-> '(2.29470491409302,48.858263472125)'::point) AS dist FROM spots WHERE to_tsvector('french',address) @@ to_tsquery('french','place') AND coordinates <@ :PARIS1::polygon ORDER BY coordinates <-> '(2.29470491409302,48.858263472125)'::point LIMIT 10; 8.4.2 18.125 ms 8.5dev+knngist-0.6 19.704 ms
Since benefit of knn comes from reduction number of rows to sort (traditional approach is a fetch,sort,limit scenario) it's clear, that we'll see improvement when:
Another benefit of knn-search is more simple logic, since one doesn't need to know "radius" of neighbourhood in advance. In the above queries I assume infinite radius to demonstrate the biggest benefit of knn-search.