How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?

19.4k views Asked by At

In most of the Scikit-learn algorithms, the data must be loaded as a Bunch object. For many example in the tutorial load_files() or other functions are used to populate the Bunch object. Functions like load_files() expect data to be present in certain format, but I have data stored in a different format, namely a CSV file with strings for each field.

How do I parse this and load data in the Bunch object format?

3

There are 3 answers

2
ogrisel On BEST ANSWER

You don't have to create Bunch objects. They are just useful for loading the internal sample datasets of scikit-learn.

You can directly feed a list of Python strings to your vectorizer object.

1
Hugh Perkins On

You can do it like this:

import numpy as np
import sklearn.datasets

examples = []
examples.append('some text')
examples.append('another example text')
examples.append('example 3')

target = np.zeros((3,), dtype=np.int64)
target[0] = 0
target[1] = 1
target[2] = 0
dataset = sklearn.datasets.base.Bunch(data=examples, target=target)
0
Gabriel Martinez Cruz On

This is an example of Breast Cancer Wisconsin (Diagnostic) Data Set, you can find the CSV file in Kaggle:

  1. From column 2 at 32 in the CSV file are X_train and X_test data @usecols=range(2,32) this is stored in the Bunch Object key named data

    from numpy import genfromtxt
    data = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1,  usecols=range(2,32))
    
  2. I am interested in the column data B (column 1 in Numpy Array @usecols=(1)) in the CSV file because it is the output of y_train and y_test and is stored in the Bunch Object Key named: target

    import pandas as pd
    target = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1, usecols=(1), dtype=str)
    

    There are some tricks to transform the target like it has in sklearn, of course it can be made in a unique variable target, target1, ... is separated only to explain what I did.

  3. First transform the numpy into a Panda

    target2 = pd.Series(target)
    
  4. It's for use the rank function, you could skip the step number 5

    target3 = target2.rank(method='dense', axis=0)
    
  5. This is only for transform the target in 0 or 1 like the example in the Book

    target4 = (target3 % 2 == 0) * 1 
    
  6. Got values into numpy

    target5 = target4.values
    

Here I copied Hugh Perkins's solution:

import sklearn
dataset = sklearn.datasets.base.Bunch(data=data, target=target5)