I created wordlists from a single text file & it worked fine. Steps:-
Process Operators: Retrieve data > Nominal to Text > Process Documents from Data > Sub-process Operators: Tokenize > Transform Cases > Filter Tokens (by length) >Filter Stopwords (English) > Stem (Snowball) Vector Creation is TF-IDF & Prune method absolute, Prune below 1 & Prune above 5 (since its a file with very few rows) Sample words from resulting wordlist: analyst cloud clear But when I did the same from a corpus of text files the resulting wordlist had character spacing within each word. Steps: Process Documents from Files (select the corpus Directory) with vector creation as TF-IDF & Prune method as none Sub-process: Tokenize > Filter Stopwords (English) > Filter Tokens (by Length) > Stem (Snowball) > Transform Cases (to lowercase) I set a breakpoint at Tokenize & noticed that the character spacing within words appeared at this stage itself. The source text files did not have such issue with the data. Sample words from the resulting WordList:
a g i l e
e m p o w e r
b i g d a t a
Could someone please help resolve this? Are there any other parameters to set or any other operators to be used in the process to resolve this issue and to create a more meaningful and concentrated wordlist? Thanks, Geeta