Words in wordlist appear with character spaces in Rapidminer.

228 views Asked by At

I created wordlists from a single text file & it worked fine. Steps:-

Process Operators: Retrieve data > Nominal to Text > Process Documents from Data > Sub-process Operators: Tokenize > Transform Cases > Filter Tokens (by length) >Filter Stopwords (English) > Stem (Snowball) Vector Creation is TF-IDF & Prune method absolute, Prune below 1 & Prune above 5 (since its a file with very few rows) Sample words from resulting wordlist: analyst cloud clear But when I did the same from a corpus of text files the resulting wordlist had character spacing within each word. Steps: Process Documents from Files (select the corpus Directory) with vector creation as TF-IDF & Prune method as none Sub-process: Tokenize > Filter Stopwords (English) > Filter Tokens (by Length) > Stem (Snowball) > Transform Cases (to lowercase) I set a breakpoint at Tokenize & noticed that the character spacing within words appeared at this stage itself. The source text files did not have such issue with the data. Sample words from the resulting WordList:

a g i l e

e m p o w e r
b i g d a t a

Could someone please help resolve this? Are there any other parameters to set or any other operators to be used in the process to resolve this issue and to create a more meaningful and concentrated wordlist? Thanks, Geeta

0

There are 0 answers