We're very interested in performance testing of pg_sphere, especially when we experiment with index contstructions.
# # Access statistics collection # #stats_start_collector = true #stats_reset_on_server_start = true stats_command_string = true #false stats_row_level = true# false stats_block_level = true #fase
Detailed information about postgresql statistics is available in documantations shipped with postgresql (monitoring-stats.html).
select * into table tycho_random from tycho order by random(); alter table tycho rename to tycho_orig; alter table tycho_random rename to tycho; create index pos_random_idx on tycho using gist(pos);
create index pos_z_idx on tycho(pos); cluster pos_z_idx on tycho; create index pos_clustered_idx on tycho using gist(pos);
I used autogenerated queries (total 647 covering sphere) like:
SELECT count(*) FROM tycho WHERE pos @ sbox '((0d,-90d),(10d,-80d))';
psql tycho < run.sql > LOG
ordered: relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | ----------+------------+---------+----------------+---------------+---------------+--------------+ 18058124 | public | tycho | 420323 | 989674 | 30644 | 1455989 | relid | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ----------+------------+------------+---------+--------------+---------------+-------------- 18058124 | 19113241 | public | tycho | pos_idx | 30644 | 1455989 random: relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | ----------+------------+---------+----------------+---------------+---------------+--------------+ 19113242 | public | tycho | 1249539 | 166284 | 2208863 | 3974368 | relid | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ----------+------------+------------+---------+----------------+---------------+-------------- 19113242 | 20168359 | public | tycho | pos_random_idx | 2208863 | 3974368 clustered: relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | ----------+------------+---------+----------------+---------------+---------------+--------------+ 19113242 | public | tycho | 13500 | 1396497 | 7437 | 1455742 | relid | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ----------+------------+------------+---------+-------------------+---------------+-------------- 19113242 | 20168365 | public | tycho | pos_clustered_idx | 7437 | 1455742
It's clear why clustered table outperforms ordered and randomized, and why randomized is so low.
Size:
We could calculate how many times the whole table and index were read:
We now see, that this test is not "honest" :) Queries are distributed in 36 layers on longtitude, each layer has 18 separate partitions on latitude, so they are ordered and this explain why clustered is so fast ! Total number (heap_blks_read + heap_blks_hit) of table reads is the same, the difference is in disk reads (slow) and buffer reads (fast). The real question was not why random is so slow, as Janko asked, but why clustered is so fast. Now, I'm sure, clustered would be slowdown if queries will be in random order. Here is [4] - randomized quieries.
Results:
ordered: relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | ----------+------------+---------+----------------+---------------+---------------+--------------+ 20168366 | public | tycho | 318352 | 1096543 | 35679 | 1456149 | relid | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ----------+------------+------------+---------+--------------+---------------+-------------- 20168366 | 21223483 | public | tycho | pos_idx | 35679 | 1456149 random: relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | ----------+------------+---------+----------------+---------------+---------------+--------------+ 21223484 | public | tycho | 1222230 | 192665 | 1786043 | 3617033 | relid | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ----------+------------+------------+---------+----------------+---------------+-------------- 21223484 | 22278601 | public | tycho | pos_random_idx | 1786043 | 3617033 clustered: relid | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | ----------+------------+---------+----------------+---------------+---------------+--------------+ 21223484 | public | tycho | 24246 | 1366954 | 18909 | 1424932 | relid | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ----------+------------+------------+---------+-------------------+---------------+-------------- 21223484 | 22278607 | public | tycho | pos_clustered_idx | 18909 | 1424932
Now, we see, indeed, clustered performance becomes about 2 times worse in terms of disk access, comparing to ordered set of queries and this is more realistic estimation. Nevertheless, ordered GiST index is the best option. btw, timings for both tests are not changed very much for clustered, because of little disk operations.