2010-01-12

What keywords are in text

The problem: Find what keywords are in text. We have papers and kw (keywords). kw.ts is a plainto_tsquery(kw.name).

```          Table "public.papers"
Column       |   Type   | Modifiers
-------------------+----------+-----------
id                | integer  |
oai_id            | text     |
datestamp         | date     |
title             | text     |
authors           | text     |
description       | text     |
comment           | text     |
creation_date     | date     |
modification_date | date     |
fts               | tsvector |
Indexes:
"fts_idx" gin (fts)
"fts_types_idx" gin (document_token_types(comment))
"id_idx" btree (id)
Table "public.kw"
Column   |         Type          | Modifiers
-----------+-----------------------+-----------
key_id    | integer               |
name      | character varying(64) |
status_id | integer               |
ts        | tsquery               |

arxiv=# select papers.title, array_agg(kw.name) from papers
join kw on papers.fts @@ kw.ts and papers.id=2
group by papers.title;

title                   |            array_agg
------------------------------------------+---------------------------------
Sparsity-certifying Graph Decompositions | {color,Williams,trees,colorful}
(1 row)

```

In principle, it's possible to use this approach to find plagiarism.

Other way is to use ts_stat) function, which decomposes tsvector on words.

```CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT word text, OUT ndoc
integer, OUT nentry integer)
RETURNS SETOF record AS
\$\$
SELECT ts_stat('SELECT ' || quote_literal( \$1::text ) || '::tsvector', quote_literal( \$2::text) );
\$\$ LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;

select ARRAY ( select (ts_stat(fts,'*')).word from papers where id=2);

```