How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?

Question

How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?

19.4k views Asked by David At 07 December 2024 at 04:50

In most of the Scikit-learn algorithms, the data must be loaded as a Bunch object. For many example in the tutorial load_files() or other functions are used to populate the Bunch object. Functions like load_files() expect data to be present in certain format, but I have data stored in a different format, namely a CSV file with strings for each field.

How do I parse this and load data in the Bunch object format?

Original Q&A

There are 3 answers

Hugh Perkins On 21 December 2016 at 12:15

You can do it like this:

import numpy as np
import sklearn.datasets

examples = []
examples.append('some text')
examples.append('another example text')
examples.append('example 3')

target = np.zeros((3,), dtype=np.int64)
target[0] = 0
target[1] = 1
target[2] = 0
dataset = sklearn.datasets.base.Bunch(data=examples, target=target)

Gabriel Martinez Cruz On 14 October 2017 at 05:53

This is an example of Breast Cancer Wisconsin (Diagnostic) Data Set, you can find the CSV file in Kaggle:

From column 2 at 32 in the CSV file are X_train and X_test data @usecols=range(2,32) this is stored in the Bunch Object key named data
```
from numpy import genfromtxt
data = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1,  usecols=range(2,32))
```
I am interested in the column data B (column 1 in Numpy Array @usecols=(1)) in the CSV file because it is the output of y_train and y_test and is stored in the Bunch Object Key named: target
```
import pandas as pd
target = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1, usecols=(1), dtype=str)
```
There are some tricks to transform the target like it has in sklearn, of course it can be made in a unique variable target, target1, ... is separated only to explain what I did.
First transform the numpy into a Panda
```
target2 = pd.Series(target)
```
It's for use the rank function, you could skip the step number 5
```
target3 = target2.rank(method='dense', axis=0)
```
This is only for transform the target in 0 or 1 like the example in the Book
```
target4 = (target3 % 2 == 0) * 1 
```
Got values into numpy
```
target5 = target4.values
```

Here I copied Hugh Perkins's solution:

import sklearn
dataset = sklearn.datasets.base.Bunch(data=data, target=target5)

**ogrisel** · Accepted Answer · 2013-12-10T10:14:49+00:00

ogrisel On 10 December 2013 at 10:14 BEST ANSWER

You don't have to create Bunch objects. They are just useful for loading the internal sample datasets of scikit-learn.

You can directly feed a list of Python strings to your vectorizer object.

TechQA.

How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?

There are 3 answers

Related Questions in SCIKIT-LEARN

Related Questions in SCIKITS

Popular Questions

Popular Tags

Trending Questions