🔬 Scikit-Learn

🔬 Scikit-Learn#

What is `Bunch` in `sklearn`?#

A Bunch in sklearn is a dictionary-like object used to hold datasets in a structured format. It is returned by functions like datasets.load_iris() and other dataset loaders in sklearn. The Bunch object behaves like a dictionary but provides attribute-style access to its keys, making it user-friendly and intuitive.

Key Characteristics of `Bunch`:#

Dictionary-Like Behavior:
You can access its elements using dictionary syntax (bunch['key']).
Attribute-Style Access:
Keys in the Bunch can also be accessed as attributes (bunch.key).
Components:
A Bunch typically contains:
- data: The main feature matrix as a NumPy array.
- target: The target labels as a NumPy array.
- feature_names: Names of the features (columns).
- target_names: Names of the classes or target labels.
- DESCR: A description of the dataset.
- filename: Path to the dataset file (if applicable).
Convenience for Machine Learning:
The structured nature of a Bunch makes it easy to manipulate datasets and integrate them into workflows for machine learning and data analysis.

Using the Iris Dataset as an Example#

Let’s explore the Bunch object returned by datasets.load_iris():

from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()

# Inspect the type of the object
print(type(iris))  # <class 'sklearn.utils.Bunch'>

# Accessing keys in the Bunch object
print(
    iris.keys()
)  # dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

# Access feature data and target labels
print(iris.data[:5])  # First 5 rows of feature matrix (sepal and petal measurements)
print(iris.target[:5])  # First 5 target labels (encoded as integers 0, 1, 2)

# Access metadata
print(
    iris.feature_names
)  # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target_names)  # ['setosa', 'versicolor', 'virginica']

# Access description of the dataset
print(iris.DESCR)

<class 'sklearn.utils._bunch.Bunch'>
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. dropdown:: References

  - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
    Mathematical Statistics" (John Wiley, NY, 1950).
  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
  - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
    Structure and Classification Rule for Recognition in Partially Exposed
    Environments".  IEEE Transactions on Pattern Analysis and Machine
    Intelligence, Vol. PAMI-2, No. 1, 67-71.
  - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
    on Information Theory, May 1972, 431-433.
  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
    conceptual clustering system finds 3 classes in the data.
  - Many, many more ...

Practical Usage with the Iris Dataset#

1. Converting `Bunch` to a Pandas DataFrame#

You can convert the Bunch into a more familiar tabular format for easier analysis:

import pandas as pd

# Convert feature data and target labels to a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Add a column for target class labels
df['species'] = iris.target

# Map numerical target labels to their corresponding class names
df['species_name'] = df['species'].map({i: name for i, name in enumerate(iris.target_names)})

# Display the first few rows
print(df.head())

2. Visualizing the Data#

Using the converted DataFrame, we can visualize the features and relationships in the Iris dataset:

import seaborn as sns
import matplotlib.pyplot as plt

# Pairplot to visualize feature relationships across species
sns.pairplot(df, hue='species_name', diag_kind='kde')
plt.show()

3. Building a Simple Classifier#

We can directly use the data and target from the Bunch to build a machine learning model:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Train a Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))

Benefits of Using `Bunch` in `sklearn`:#

Organized Access:
The Bunch structure keeps all relevant components of a dataset (features, labels, metadata) neatly organized in one object.
Ease of Use:
Attribute-style access reduces boilerplate code, making it easier to explore and use datasets.
Compatibility:
Bunch integrates seamlessly with NumPy, Pandas, and other Python libraries, enabling quick preprocessing and analysis.
Standardization:
Since all sklearn datasets use the Bunch format, you can follow consistent workflows regardless of the dataset.

Summary:#

The Bunch object in sklearn provides a structured, dictionary-like interface for handling datasets. Its ability to store both data and metadata makes it a powerful tool for machine learning workflows. By converting it into other formats (like Pandas DataFrame) or directly using its attributes, you can easily integrate the dataset into preprocessing pipelines and model training.

🔬 Scikit-Learn

Contents