2008-10-04

Find documents for specific token type

Someone asked us about how to find documents, which contain emails. One possible solution is to write function

CREATE OR REPLACE FUNCTION document_token_types(text)
RETURNS _text AS
$$

SELECT ARRAY (
    SELECT
        DISTINCT alias
    FROM
        ts_token_type('default') AS tt,
        ts_parse('default', $1) AS tp
    WHERE
        tt.tokid = tp.tokid
    );
$$ LANGUAGE SQL immutable;

arxiv=# select document_token_types(title) from papers limit 10;
                     document_token_types
---------------------------------------------------------------
 {asciihword,asciiword,blank,hword_asciipart}
 {asciiword,blank}
 {asciiword,blank}
 {asciiword,blank}
 {asciiword,blank}
 {asciiword,blank,float,host}
 {asciiword,blank}
 {asciihword,asciiword,blank,hword_asciipart,int,numword,uint}
 {asciiword,blank}
 {asciiword,blank}
(10 rows)

Now we can create functional GIN index on document (which can be any combination of text fields) to speedup search.

create index fts_types_idx on papers using gin( document_token_types (comment) );

Find all documents with urls

explain analyze select comment from papers where document_token_types(comment) && '{url}';


 Bitmap Heap Scan on papers  (cost=40.80..8135.97 rows=2098 width=67) (actual time=6.811..27.679 rows=15483 loops=1)
   Recheck Cond: (document_token_types(comment) && '{url}'::text[])
   ->  Bitmap Index Scan on fts_types_idx  (cost=0.00..40.28 rows=2098 width=0) (actual time=3.820..3.820 rows=15483 loops=1)
         Index Cond: (document_token_types(comment) && '{url}'::text[])
 Total runtime: 33.124 ms

The list of available token types, supported by parser is available as

arxiv=# select * from  ts_token_type('default');