Tycho

We're very interested in performance testing of pg_sphere, especially when we experiment with index contstructions.

  • Dataset - Tycho catalogue, [ http://www.sai.msu.su/~megera/postgres/gist/pg_sphere/tycho/tycho.tar.gz archive] was prepared by Janko. Readme file contains information how to setup tycho database
  • Measurements - timing is always good but are very difficult to compare, so it's necessary to obtain more objective information. It's known, performance of large databases mostly affected by IO operations, so I decided to count the number of disk operations. Forunately, postgreSQL provides such info. It's necessary to change postgresql.conf:
#
#       Access statistics collection
#
#stats_start_collector = true
#stats_reset_on_server_start = true
stats_command_string = true #false
stats_row_level = true# false
stats_block_level = true #fase

Detailed information about postgresql statistics is available in documantations shipped with postgresql (monitoring-stats.html).

  • Tests
    • Ordered - original table, ordered by latitude
    • Randomized - data from original table are copied to the new table in random order (using random())
  select * into table tycho_random from tycho order by random();
  alter table tycho rename to tycho_orig;
  alter table tycho_random rename to tycho;
  create index pos_random_idx on tycho using gist(pos);
    • Clustered - original data are clustered using z-index (btree index on some combination of latitude and longtitude)
  create index pos_z_idx on tycho(pos);
  cluster pos_z_idx on tycho;
  create index pos_clustered_idx on tycho using gist(pos);

I used autogenerated queries (total 647 covering sphere) like:

SELECT count(*) FROM tycho WHERE pos @ sbox '((0d,-90d),(10d,-80d))';
  • Scripts -
    • [1] - generates queries for queries.sql
    • [2] - run it as
psql tycho < run.sql > LOG
    • [3] - contains generated queries and selects from statistics
  • Results:
ordered:
  relid   | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
----------+------------+---------+----------------+---------------+---------------+--------------+
 18058124 | public     | tycho   |         420323 |        989674 |         30644 |      1455989 |
  relid   | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit
----------+------------+------------+---------+--------------+---------------+--------------
 18058124 |   19113241 | public     | tycho   | pos_idx      |         30644 |      1455989

random:
  relid   | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
----------+------------+---------+----------------+---------------+---------------+--------------+
 19113242 | public     | tycho   |        1249539 |        166284 |       2208863 |      3974368 |
  relid   | indexrelid | schemaname | relname |  indexrelname  | idx_blks_read | idx_blks_hit 
----------+------------+------------+---------+----------------+---------------+--------------
 19113242 |   20168359 | public     | tycho   | pos_random_idx |       2208863 |      3974368 

clustered:
  relid   | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
----------+------------+---------+----------------+---------------+---------------+--------------+
 19113242 | public     | tycho   |          13500 |       1396497 |          7437 |      1455742 |
  relid   | indexrelid | schemaname | relname |   indexrelname    | idx_blks_read | idx_blks_hit  
----------+------------+------------+---------+-------------------+---------------+-------------- 
 19113242 |   20168365 | public     | tycho   | pos_clustered_idx |          7437 |      1455742  
  • Discussion
    • heap_blks_read - blocks readed from a table
    • heap_blks_hit - blocks readed from postgresql buffer
    • idx_blks_read, idx_blks_hit - the same as above but for index

It's clear why clustered table outperforms ordered and randomized, and why randomized is so low.

Size:

  • Table tycho - 110788608 b (13524 blk)
  • Index pos_idx - 72531968 b (8854 blk
  • Index pos_random_idx - 66822144 b (8157 blk)
  • Index pos_clustered_idx - 66822144 ((8157 blk)

We could calculate how many times the whole table and index were read:

ordered
  • Table tycho (heap_blks_read:31, heap_blks_hit:73)
  • Index pos_idx (idx_blks_read:3.46, idx_blks_hit:164)
randomized
  • Table tycho (heap_blks_read:92, heap_blks_hit:12)
  • Index pos_random_idx (idx_blks_read:271, idx_blks_hit:487)
clustered
  • Table tycho (heap_blks_read:0.99, heap_blks_hit:103)
  • Index pos_clustered_idx (idx_blks_read:0.91, idx_blks_hit:178)

We now see, that this test is not "honest" :) Queries are distributed in 36 layers on longtitude, each layer has 18 separate partitions on latitude, so they are ordered and this explain why clustered is so fast ! Total number (heap_blks_read + heap_blks_hit) of table reads is the same, the difference is in disk reads (slow) and buffer reads (fast). The real question was not why random is so slow, as Janko asked, but why clustered is so fast. Now, I'm sure, clustered would be slowdown if queries will be in random order. Here is [4] - randomized quieries.

Results:

ordered:
  relid   | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
----------+------------+---------+----------------+---------------+---------------+--------------+
 20168366 | public     | tycho   |         318352 |       1096543 |         35679 |      1456149 |
  relid   | indexrelid | schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit 
----------+------------+------------+---------+--------------+---------------+--------------
 20168366 |   21223483 | public     | tycho   | pos_idx      |         35679 |      1456149

random:
  relid   | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
----------+------------+---------+----------------+---------------+---------------+--------------+
 21223484 | public     | tycho   |        1222230 |        192665 |       1786043 |      3617033 |
  relid   | indexrelid | schemaname | relname |  indexrelname  | idx_blks_read | idx_blks_hit 
----------+------------+------------+---------+----------------+---------------+--------------
 21223484 |   22278601 | public     | tycho   | pos_random_idx |       1786043 |      3617033

clustered:
  relid   | schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
----------+------------+---------+----------------+---------------+---------------+--------------+
 21223484 | public     | tycho   |          24246 |       1366954 |         18909 |      1424932 |
  relid   | indexrelid | schemaname | relname |   indexrelname    | idx_blks_read | idx_blks_hit 
----------+------------+------------+---------+-------------------+---------------+--------------
 21223484 |   22278607 | public     | tycho   | pos_clustered_idx |         18909 |      1424932

Now, we see, indeed, clustered performance becomes about 2 times worse in terms of disk access, comparing to ordered set of queries and this is more realistic estimation. Nevertheless, ordered GiST index is the best option. btw, timings for both tests are not changed very much for clustered, because of little disk operations.