One Hot Encoding for words from a text corpus

Question

One Hot Encoding for words from a text corpus

3.1k views Asked by Shadab Shaikh At 06 January 2017 at 10:17

How can I create one hot encoding of words with each word represented by a sparse vector of vocab size and the index of that particular word equated to 1 , using tensorflow ?

something like

oneHotEncoding(words = ['a','b','c','d']) -> [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] ?

Original Q&A

There are 1 answers

**GrimTrigger** · Answer 1 · 2017-11-12T18:37:46+00:00

Scikits one hot encoder takes an int-array (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Building on your example you could us a dictionary to map words to integers and go from there:

import numpy as np
from sklearn.preprocessing import OneHotEncoder
wdict = {'a': 0, 'b': 1, 'c': 2, 'd': 3}
dictarr = np.asarray(wdict.values()).reshape(-1, 1)
enc = OneHotEncoder()
enc.fit(dictarr)
enc.transform([[2]]).toarray()

which yields

array([[ 0.,  0.,  1.,  0.]])

TechQA.

One Hot Encoding for words from a text corpus

There are 1 answers

Related Questions in SCIKIT-LEARN

Related Questions in ONE-HOT-ENCODING

Popular Questions

Popular Tags

Trending Questions