2009-02-26

Nepali support for text search !

Looks like we (Oleg Bartunov, Teodor Sigaev) did first working patch for nepali language (Devanagari script) support in text search ! We introduced Virama/Halanta support and Spacing Combining category. A lot of Unicode documents readings :)

Thanks to Dibyendra Hyoju and Bal Krishna Bal for testing and valuable discussion !

postgres=# set client_encoding to UTF8;
SET
Time: 0.119 ms
postgres=# select * from ts_parse('default','मदन पुरस्कार पुस्तकालय');
 tokid |  token  
-------+---------
     2 | मदन
    12 |  
     2 | पुरस्कार
    12 |  
     2 | पुस्तकालय
(5 rows)

'मदन पुरस्कार पुस्तकालय' - Madan Puraskar Pustakalaya, name of an entity

This looks pretty trivial, but actually it was not easy. Below is the the same string, displayed using uniname program. Notice, DEVANAGARI SIGN VIRAMA and DEVANAGARI VOWEL SIGNs, they are listed as punct in locale and should break words, which would be wrong !

character  byte       UTF-32   encoded as     glyph   name
        0          0  00092E   E0 A4 AE       म      DEVANAGARI LETTER MA
        1          3  000926   E0 A4 A6       द      DEVANAGARI LETTER DA
        2          6  000928   E0 A4 A8       न      DEVANAGARI LETTER NA
        3          9  000020   20                     SPACE
        4         10  00092A   E0 A4 AA       प      DEVANAGARI LETTER PA
        5         13  000941   E0 A5 81       ु      DEVANAGARI VOWEL SIGN U
        6         16  000930   E0 A4 B0       र      DEVANAGARI LETTER RA
        7         19  000938   E0 A4 B8       स      DEVANAGARI LETTER SA
        8         22  00094D   E0 A5 8D       ्      DEVANAGARI SIGN VIRAMA
        9         25  000915   E0 A4 95       क      DEVANAGARI LETTER KA
       10         28  00093E   E0 A4 BE       ा      DEVANAGARI VOWEL SIGN AA
       11         31  000930   E0 A4 B0       र      DEVANAGARI LETTER RA
       12         34  000020   20                     SPACE
       13         35  00092A   E0 A4 AA       प      DEVANAGARI LETTER PA
       14         38  000941   E0 A5 81       ु      DEVANAGARI VOWEL SIGN U
       15         41  000938   E0 A4 B8       स      DEVANAGARI LETTER SA
       16         44  00094D   E0 A5 8D       ्      DEVANAGARI SIGN VIRAMA
       17         47  000924   E0 A4 A4       त      DEVANAGARI LETTER TA
       18         50  000915   E0 A4 95       क      DEVANAGARI LETTER KA
       19         53  00093E   E0 A4 BE       ा      DEVANAGARI VOWEL SIGN AA
       20         56  000932   E0 A4 B2       ल      DEVANAGARI LETTER LA
       21         59  00092F   E0 A4 AF       य      DEVANAGARI LETTER YA

Next step is to port nepali stemmer, so we can provide default text search configuration for nepali.

Also, we need to improve hunspell support, so nepali ispell dictionaries can be used with text search !

This project is a volunteer work to support PostgreSQL promotion in Nepal (btw, elephants are there). I will visit Nepal this april and will establish more close connections with nepali developers.