2009-11-25

K-nn search in PostgreSQL

Update: Download spots data (900,000 spots, 25 Mb compressed).

After our first announcement in -hackers mailing list of knngist we received positive feedback and recommendation to make GiST smarter. This night I tested new version of knn-search, which Teodor heroically made in 2 days, based on improved GiST, which now understand which algorithm of tree traverse to use (depth-first as for plain GiST, or distance based priority queue for KNNGiST). We also introduced amcanorderbyop flag to pg_am, which indicates if access method supports ordering operations.

We need index on coordinates.

=# create index pt_idx on spots using gist(coordinates); 

Now, compare traditional approach and knn (apply patches to CVS HEAD from commitfest).

1. Find 10 closest points to the given point.

SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist 
FROM spots ORDER BY dist ASC LIMIT 10;
             coordinates             |       dist       
-------------------------------------+------------------
 (3.57192993164062,6.51727240153665) | 2.08362656457647
 (3.49502563476562,6.49134782128243) | 2.11874164636854
 (3.4393,6.4473)                     | 2.12848814420001
 (3.31787109375,6.50089913799597)    | 2.25438592075067
 (2.6323,6.4779)                     | 2.79109148900569
 (2.66349792480469,6.53159856871478) | 2.79374947392946
 (1.84102535247803,6.27874198581057) |  3.4079762161672
 (1.2255,6.1228)                     | 3.93796014327215
 (1.22772216796875,6.15693947094637) | 3.94570513108469
 (9.6977,4.044503)                   | 4.79388775494473
(10 rows)

Total number of rows processed by traditional approach is 908846.

8.4.2                 464.501 ms
8.5dev+knngist-0.6    0.971 ms

We see more than 400x better performance for knngist.

2. More complex query with fts (total rows 5098 rows).

=# CREATE INDEX spots_idx ON spots USING gist (coordinates,to_tsvector('french',address));
=# SELECT id, address,
(coordinates <->'(2.29470491409302,48.858263472125)'::point) AS dist
FROM spots WHERE to_tsvector('french',address) @@ to_tsquery('french','mars')
ORDER BY coordinates <-> '(2.29470491409302,48.858263472125)'::point
LIMIT 10;

8.4.2               166.089 ms
5dev+knngist-0.6    2.758 ms

Performance benefit is also remarkable for complex queries.

3. Complex polygon (PARIS 1, see knngist ), number of rows is 238.

=# SELECT id, address,  (coordinates <->
'(2.29470491409302,48.858263472125)'::point) AS dist
FROM spots WHERE
to_tsvector('french',address) @@ to_tsquery('french','place')
AND  coordinates <@ :PARIS1::polygon
ORDER BY  coordinates <-> '(2.29470491409302,48.858263472125)'::point
LIMIT 10;

8.4.2                18.125 ms
8.5dev+knngist-0.6   19.704 ms

Since benefit of knn comes from reduction number of rows to sort (traditional approach is a fetch,sort,limit scenario) it's clear, that we'll see improvement when:

  1. the total number of rows returned by a query (before limit) is big
  2. rows are long

Another benefit of knn-search is more simple logic, since one doesn't need to know "radius" of neighbourhood in advance. In the above queries I assume infinite radius to demonstrate the biggest benefit of knn-search.