tsearch2UTF8Test

Testing new tsearch2 parser with full UTF8 support

This is a completely rewritten parser for tsearch2 with full UTF8 support. Parser uses finite-state automata technique and expected to be flexible and compatible with old tsearch2 parser (fixed some errors).

A list of current issues in parser (available from CVS HEAD).

Multiple consecutive slashes ('////'): broken

test=# select * from parse('~//downloads////qq');
 tokid |   token    
-------+------------
    12 | ~
    12 | /
    19 | /downloads
    12 | /
    12 | /
    12 | /
    19 | /qq
(7 rows)

We consider '_' as space symbol

test=# select * from parse('a_b_c');
 tokid | token 
-------+-------
     1 | a
    12 | _
     1 | b
    12 | _
     1 | c

XHTML tag: broken (FIXED)

test=# select * from parse('<br/>');
 tokid | token 
-------+-------
    12 | <
     1 | br
    12 | />

word…: broken (FIXED)

test=# select * from parse('etc...');
 tokid | token 
-------+-------
    19 | etc..
    12 | .

~ in path: broken (FIXED)

test=# select * from parse('~/downloads/Harry_Potter.avi');
 tokid |            token            
-------+-----------------------------
    12 | ~
    19 | /downloads/Harry_Potter.avi

version: broken (FIXED)

test=# select * from parse('-1.2.3');
 tokid | token 
-------+-------
    20 | -1.2
    12 | .
    22 | 3

but see below:

test=# select * from parse('version-1.2.3');
 tokid |     token     
-------+---------------
    15 | version-1.2.3
    11 | version
    12 | -
     8 | 1.2.3

Backslash(\) handling: broken (BRR)

select * from parse('a \ b ');
 tokid | token 
-------+-------
     1 | a
    12 |   
     1 | b
    12 |