Encoding String to numbers so as to use it in scikit-learn

Question

Encoding String to numbers so as to use it in scikit-learn

7.8k views Asked by Huga At 16 June 2015 at 13:42

My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.

Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?

Original Q&A

There are 2 answers

Chung-Yen Hung On 17 June 2015 at 03:15

You could save the mapping: string -> label in training data with each column.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}

When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.

**lmjohns3** · Accepted Answer · 2015-06-17T06:55:20+00:00

The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:

>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')

Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:

>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]

The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.

TechQA.

Encoding String to numbers so as to use it in scikit-learn

There are 2 answers

Related Questions in ENCODING

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in RANDOM-FOREST

Popular Questions

Popular Tags

Trending Questions