I am using fasttext model to predict labels for text.
Usually fasttext can classify text on word level such as:
model = fasttext.train_supervised(input="training_fasttextFormat.csv", lr=0.1, epoch=50, loss='hs', wordNgrams=2, dim=200)
print(model.test('testing_fasttextFormat.csv'))
But it seems that the parameter explanation in https://fasttext.cc/docs/en/options.html can do character level as well:
The following arguments for the dictionary are optional:
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
But I am not sure how to use these parameters to run fasttext on character level, Could anyone make an example?
If you're referring to the
minn&maxnparameters, in the classic non-classification (notsupervised) FastText modes, those control FastText's main difference with original word2vec: learning vectors for word-fragments, in addition to full-word vectors.Such word-fragment vectors can then be used to synthesize word-vectors for words that weren't seen during training – "out of vocabulary" (or "oov") words. These synthesized vectors often work fairly well, or at least better than nothing, especially for things like typos or words where word-roots hint strongly at meaning.
I suspect the excerpt you've quoted only shows
0as the defaults forminnandmaxninsupervisedmode, & you'd see other defaults if executingfasttext skipgram(etc) without arguments. (Actually setting these parameters to0makes Fasttext for word-modeling essentially plain word2vec.)Given that the
supervisedmode seems to default these to0may imply the creators of FastText didn't think, or find, the subword vectors to be as useful in the classification case.But, you could certainly try setting them to other values, and checking if they improve your classification results over the defaults.