Looks like we (Oleg Bartunov, Teodor Sigaev) did first working patch for nepali language (Devanagari script) support in text search ! We introduced Virama/Halanta support and Spacing Combining category. A lot of Unicode documents readings :)
Thanks to Dibyendra Hyoju and Bal Krishna Bal for testing and valuable discussion !
postgres=# set client_encoding to UTF8; SET Time: 0.119 ms postgres=# select * from ts_parse('default','मदन पुरस्कार पुस्तकालय'); tokid | token -------+--------- 2 | मदन 12 | 2 | पुरस्कार 12 | 2 | पुस्तकालय (5 rows)
'मदन पुरस्कार पुस्तकालय' - Madan Puraskar Pustakalaya, name of an entity
This looks pretty trivial, but actually it was not easy. Below is the the same string, displayed using uniname program. Notice, DEVANAGARI SIGN VIRAMA and DEVANAGARI VOWEL SIGNs, they are listed as punct in locale and should break words, which would be wrong !
character byte UTF-32 encoded as glyph name 0 0 00092E E0 A4 AE म DEVANAGARI LETTER MA 1 3 000926 E0 A4 A6 द DEVANAGARI LETTER DA 2 6 000928 E0 A4 A8 न DEVANAGARI LETTER NA 3 9 000020 20 SPACE 4 10 00092A E0 A4 AA प DEVANAGARI LETTER PA 5 13 000941 E0 A5 81 ु DEVANAGARI VOWEL SIGN U 6 16 000930 E0 A4 B0 र DEVANAGARI LETTER RA 7 19 000938 E0 A4 B8 स DEVANAGARI LETTER SA 8 22 00094D E0 A5 8D ् DEVANAGARI SIGN VIRAMA 9 25 000915 E0 A4 95 क DEVANAGARI LETTER KA 10 28 00093E E0 A4 BE ा DEVANAGARI VOWEL SIGN AA 11 31 000930 E0 A4 B0 र DEVANAGARI LETTER RA 12 34 000020 20 SPACE 13 35 00092A E0 A4 AA प DEVANAGARI LETTER PA 14 38 000941 E0 A5 81 ु DEVANAGARI VOWEL SIGN U 15 41 000938 E0 A4 B8 स DEVANAGARI LETTER SA 16 44 00094D E0 A5 8D ् DEVANAGARI SIGN VIRAMA 17 47 000924 E0 A4 A4 त DEVANAGARI LETTER TA 18 50 000915 E0 A4 95 क DEVANAGARI LETTER KA 19 53 00093E E0 A4 BE ा DEVANAGARI VOWEL SIGN AA 20 56 000932 E0 A4 B2 ल DEVANAGARI LETTER LA 21 59 00092F E0 A4 AF य DEVANAGARI LETTER YA
Next step is to port nepali stemmer, so we can provide default text search configuration for nepali.
Also, we need to improve hunspell support, so nepali ispell dictionaries can be used with text search !
This project is a volunteer work to support PostgreSQL promotion in Nepal (btw, elephants are there). I will visit Nepal this april and will establish more close connections with nepali developers.