IBM Watson Natural Language Classifier (NLC) limits the text values in the training set to 1024 characters: https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits .

However the trained model can then classify every text whose length is at most 2048 characters: https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase .

This difference creates some confusion for me: I have always known that we should apply the same pre-processing to both training phase and production phase, therefore if I had to cap off the training data at 1024 chars I would do the same also in production.

Is my reasoning correct or not? Should I cap off the text in production at 1024 chars (as I think I should) or at 2048 chars (maybe because 1024 chars are too few)?

Thank you in advance!

1

There are 1 answers

5
Vidyasagar Machupalli On BEST ANSWER

Recently, I had the same question and one of the answers on an article clarified the same

Currently, the limits are set at 1024 for training and 2048 for testing/classification. The 1024 limit may require some curation of the training data prior to training. Most organizations who require larger character limits for their data end up chunking their input text into 1024 chunks. Additionally, in use cases with data similar to the Airbnb reviews, the primary category can typically be assessed within the first 2048 characters since there is often a lot of noise in lengthy reviews.

Here's the link to the article