2010-01-12

What keywords are in text

The problem: Find what keywords are in text. We have papers and kw (keywords). kw.ts is a plainto_tsquery(kw.name).

          Table "public.papers"
      Column       |   Type   | Modifiers
-------------------+----------+-----------
 id                | integer  |
 oai_id            | text     |
 datestamp         | date     |
 title             | text     |
 authors           | text     |
 description       | text     |
 comment           | text     |
 creation_date     | date     |
 modification_date | date     |
 fts               | tsvector |
Indexes:
    "fts_idx" gin (fts)
    "fts_types_idx" gin (document_token_types(comment))
    "id_idx" btree (id)
Table "public.kw"
  Column   |         Type          | Modifiers
-----------+-----------------------+-----------
 key_id    | integer               |
 name      | character varying(64) |
 status_id | integer               |
 ts        | tsquery               |

arxiv=# select papers.title, array_agg(kw.name) from papers 
join kw on papers.fts @@ kw.ts and papers.id=2 
group by papers.title;

                  title                   |            array_agg
------------------------------------------+---------------------------------
 Sparsity-certifying Graph Decompositions | {color,Williams,trees,colorful}
(1 row)

In principle, it's possible to use this approach to find plagiarism.

Other way is to use ts_stat) function, which decomposes tsvector on words.

CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT word text, OUT ndoc
integer, OUT nentry integer)
RETURNS SETOF record AS
$$
    SELECT ts_stat('SELECT ' || quote_literal( $1::text ) || '::tsvector', quote_literal( $2::text) );
$$ LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;

select ARRAY ( select (ts_stat(fts,'*')).word from papers where id=2);