Zen: spgist dev

SP-GiST for hackers

SP-GiST is an abbreviation of space-partitioned GiST - the search tree, which allows to implement a wide range of different non-balanced disk-based data structures, such as quadtree, kd-tree, tries - popular data structures, originally developed for memory storage. Main memory access structures usually designed as a set of dynamically allocated nodes linked by pointers, which is not suitable for direct storing on disk, since these chains of pointers can be rather long and require too many disk accesses. In opposite, disk based data structures have a high fanout to minimize I/O. The challenge is to map nodes of tree to disk pages in such a way, so search algorithm accesses only a few disk pages, even if it traverse many nodes.

COMMON STRUCTURE DESCRIPTION

Logically, tree is a set of tuples, each of different number of children, each of which can also be an inner or leaf tuple. It's important to note, that on-disk storage is not effective to store a lot of inner tuples and so leaf tuples gathered into linked list, consequently, all leaves in such a list have the same parent. Branches can be of different depth (actually, there is no control or code to support balancing), which means that the tree is non-balanced.

SP-GiST core is responsible to map inner/leaf tuple to disk pages and imposes the following restriction - the size of inner or leaf tuples should match a page size, as well as each list of leaf tuples should be placed on the same page. SP-GiST differs pages for inner and leaf tuples, they are called inner and leaf pages respectively. Root page and root inner tuple are special - root page contains only one inner tuple. If total number of leaf tuples is small enough to fit all tuples to single page, then root page contains only array of leaf tuples.

Each inner tuples contains a 'nodes', which are (key,pointer) pairs, where pointer (ItemPointerData) is a pointer to the child. Search traversal algorithm on each inner tuple chooses a set of nodes to continue tree traverse in depth. If it reaches a leaf page it scans a list of leaf tuples (they are on the same page) to find matched ones. Insertion algorithm does the same, except, it should choose only one node instead of a set and it could modify inner tuple to satisfy a new value to be inserted. If it's needed to append new leaf tuple to a list and there is no free space on page, then SP-GiST creates a new inner tuple and distribute leaf tuples into a set of lists on, perhaps, several pages.

Inner tuple consists of:

optional prefix value - all successors should be consistent with it.

  Example: 
	suffix tree  - prefix value is a common prefix
	quad tree    - centroid
	k-d-tree     - one coordinate

list of nodes, where node is a (key, pointer) pair.

  Example of a key: a single character for suffix tree

Leaf tuple consists of:

a value

  Example:
	suffix tree - the rest of string (postfix)
	quad and k-d tree - the point itself

ItemPointer to the heap

INSERTION ALGORITHM

Insertion algorithm is designed to keep a tree in consistent state at any moment. Below is a simplified insertion algorithm specification (numbers refers to a notes below):

	Start with the first tuple on the root page (1)
  
  loop:
	if (page is leaf) then
		if (enough space)
			insert on page and exit (5)
		else (7)
			call PickSplitFn() (2)
		end if
	else
		switch(chooseFn)
			case MatchNode  - continue with pointer of node 
			case AddNode    - add node and then continue with node's 
							  pointer  (3, 6)
			case SplitTuple - split inner tuple to prefix and postfix 
                                and continue with the postfix tuple (4, 6)
	end if

Notes:

(1) root inner tuple is always a single tuple on page, since after addition node to the root, inner tuple could be too large to be stored on the root page. Current implementation allows moving non-root inner tuple to another page, which isn't possible for root, else we have to separate code path to support root migration. On other side, we could implement moving non-root inner tuple from root page to another page, but since tuple has no link to parent tuple, it can be difficult to support consistent state of tree.

(2) Current implementation allows to do picksplit and insert a new leaf tuple at once, if the new list of leaf tuples fits one page. It's always true for trees like quad tree of k-d-tree, but suffix tree may require subsequent picksplit.

(3) Addition of node should keep size of inner tuple to small to fit on page. After addition inner tuple could become too large to be stored on current page. In this case it will be moved to another (may be new) inner page (see notes about page management). Next, moving tuple to another page could change enumeration of tuples on page and pointers to tuples could be invalid. To prevent that, core leaves a placeholder, named dead tuple which can be reused later. See also Concurrency and Vacuum chapter. Right now only suffix tree could add a node to the tuple, quad tree and k-d-tree make all possible nodes at once in PickSplitFn() call. Current implementation supports ordered array of nodes.

(4) Prefix value could only partially satisfy a new value, in this case it needs to produce new subbranch of tree contains two new inner tuples: prefix and postfix. Consider example of insertion into a suffix tree. We use the following notation, where tuple's id used only for just for illustration (there is no id).

 inner tuple: {tuple id}(prefix string)[ comma separated list of node's keys ]
 leaf tuple: {tuple id}<value>

Insert string 'www.gogo.com' to inner tuple {1}(www.google.com/)[a, i] . We need to split it to two ones:

	{2}(www.go)[o]  - prefix tuple
	            |
                   {3}(gle.com/)[a,i] - postfix tuple

On the next iteration of loop we add node [g] to tuple {2}

				   NIL (no value inserted yet)
				   |
	                          {2}(www.go)[o, g]
			                      |
			                     {3}(gle.com/)[a,i]

And finally, insert a leaf tuple

				  {4}<o.com> 
				   |
	                          {2}(www.go)[o, g]
			                      |
			                     {3}(gle.com/)[a,i]

As we can see, node's array moves to postfix tuple without any changes and SP-GiST core believes, that prefix tuple is not larger than old inner tuple. That allows to store prefix tuple instead of old inner tuple. SP-GiST core will try to store postfix tuple on the same page if possible and will use another page if there is not enough free space (see notes 5 and 6). Currently, quad and k-d tree doesn't use this feature, because they grow their depth only by PickSplitFn() call.

(5) If pointer from node of parent is a NIL pointer (it can be a result of chooseFn with SplitTuple output or vacuum), algorithm chooses a leaf page to store. At first, it try to use the last used leaf page with the largest free space (we track it) to better utilize a disk space. If it's not large enough, then the algorithm allocates a new page.

(6) Management of inner pages is very similar to management of leaf pages, described in (5)

(7) Actually, current implementation can move the whole list of leaf tuples and a new tuple to another page, if the list is a short enough. This increases space utilization, but doesn't changes the basis of the algorithm.

CONCURRENCY

Insertion algorithm always requires an exclusive lock on parent and child pages (parent and child pages can be the same, see notes above). There is a possibility of dead lock in case of cross referenced pages in different branches, for example, inner tuple with id 1 on page N has a child on page M and inner tuple with id 2 from another branch has a child on the same page N. It can happens, due to remembering of the last used page. To prevent dead lock we introduced a "triple parity" of page - if inner tuple is placed on page with BlockNumber N, then it's child tuple should be placed on the same page or on the page with the BlockNumber M and (N+1) mod 3 == M mod 3 to guarantee that tuples on page M have no children on page N ( (M+1) mod 3 != N mod 3 ).

Search traversal algorithm is rather traditional - it share locks pointed page and works with pointed tuple and puts pointers from nodes to the stack, then unlock the page and pop stack to get next pointer. But here there is a race condition - while we playing around our stack, insertion process could move our target inner (or leaf tuple after PickSplitFn() is replaced by inner tuple stored on another page) tuple to another page. To prevent that, insertion algorithm puts a redirection tuple, which contains pointer to the actual place of child.

VACUUM

Vacuum traversal algorithm is similar to search one, except it always keeps exclusive lock on parent page, if child is a leaf page, since after vacuum list of leaf tuples can be changed or becomes empty and pointer in node in inner parent tuple should be updated.

Also, vacuum converts redirection tuples to dead tuples, but only such tuples, which have id of parent transaction is less than GetOldestXmin(). Dead tuples can be reused, because they are just a placeholders to keep tuple's enumeration constant.

LAST USED PAGE MANAGEMENT

List of last used page contains four pages - leaf page and three pages for each "triple parity". This list is stored between calls on meta page, which is never WAL-logged to decrease WAL traffic. Incorrected data on meta page isn't critical, because we could allocate a new page at any moment.

Authors:

Teodor Sigaev <teodor@sigaev.ru>
Oleg Bartunov <oleg@sai.msu.su>

spgist dev