Javascript word tokenizer library with support for multiple languages (as many as possible)

2.5k views Asked by At

I am looking for a word tokenizer library for node.js, that supports as many languages as possible. I'd like to pass in a string like: tokenize('Hello, world!', 'en') and have it return ['Hello', 'world']. The number of supported languages is more important than precision.

2

There are 2 answers

0
prtksxna On

Wink's tokenizer supports two scripts (Latin and Devanagri) and all its languages. Also, it is able to detect language automatically, so, you'll be able to just write:

var tokenizer = require( 'wink-tokenizer' );
var t = tokenizer();
t.tokenize( 'This sentence is in English' );
t.tokenize( 'Mieux vaut prévenir que guérir:-)' );
t.tokenize( 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।' );

You can check out the docs at https://winkjs.org/wink-tokenizer/.

0
Keon Kim On

How about Natural?

It is relatively new and still unstable, but has many language plugins

https://github.com/NaturalNode/natural