2006-05-18

Japanese Language Parsing

Interesting-looking piece of software, MeCab. I had been wondering how to parse Japanese text into keywords, like search engines would have to do. Turns out it's not as easy as splitting text on spaces as in English. There are bindings for Perl, also.

MeCab apparently uses Markov models to parse text. Supposedly it doesn't need a dictionary or corpus, using "conditional random fields" to build probability data. Cool!

According to the MeCab page, other parsers include ChaSen, JUMAN, and KAKASI. In my searches, the latter was cited quite a bit.

No comments: