π¬ Scikit-Learn#
What is Bunch
in sklearn
?#
A Bunch
in sklearn
is a dictionary-like object used to hold datasets in a structured format. It is returned by functions like datasets.load_iris()
and other dataset loaders in sklearn
. The Bunch
object behaves like a dictionary but provides attribute-style access to its keys, making it user-friendly and intuitive.
Key Characteristics of Bunch
:#
Dictionary-Like Behavior:
You can access its elements using dictionary syntax (bunch['key']
).Attribute-Style Access:
Keys in theBunch
can also be accessed as attributes (bunch.key
).Components:
ABunch
typically contains:data
: The main feature matrix as a NumPy array.target
: The target labels as a NumPy array.feature_names
: Names of the features (columns).target_names
: Names of the classes or target labels.DESCR
: A description of the dataset.filename
: Path to the dataset file (if applicable).
Convenience for Machine Learning:
The structured nature of aBunch
makes it easy to manipulate datasets and integrate them into workflows for machine learning and data analysis.
Using the Iris Dataset as an Example#
Letβs explore the Bunch
object returned by datasets.load_iris()
:
from sklearn import datasets
# Load the Iris dataset
iris = datasets.load_iris()
# Inspect the type of the object
print(type(iris)) # <class 'sklearn.utils.Bunch'>
# Accessing keys in the Bunch object
print(
iris.keys()
) # dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
# Access feature data and target labels
print(iris.data[:5]) # First 5 rows of feature matrix (sepal and petal measurements)
print(iris.target[:5]) # First 5 target labels (encoded as integers 0, 1, 2)
# Access metadata
print(
iris.feature_names
) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target_names) # ['setosa', 'versicolor', 'virginica']
# Access description of the dataset
print(iris.DESCR)
<class 'sklearn.utils._bunch.Bunch'>
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
[0 0 0 0 0]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. dropdown:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Practical Usage with the Iris Dataset#
1. Converting Bunch
to a Pandas DataFrame#
You can convert the Bunch
into a more familiar tabular format for easier analysis:
import pandas as pd
# Convert feature data and target labels to a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Add a column for target class labels
df['species'] = iris.target
# Map numerical target labels to their corresponding class names
df['species_name'] = df['species'].map({i: name for i, name in enumerate(iris.target_names)})
# Display the first few rows
print(df.head())
2. Visualizing the Data#
Using the converted DataFrame, we can visualize the features and relationships in the Iris dataset:
import seaborn as sns
import matplotlib.pyplot as plt
# Pairplot to visualize feature relationships across species
sns.pairplot(df, hue='species_name', diag_kind='kde')
plt.show()
3. Building a Simple Classifier#
We can directly use the data
and target
from the Bunch
to build a machine learning model:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Split the data
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
# Train a Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
Benefits of Using Bunch
in sklearn
:#
Organized Access:
TheBunch
structure keeps all relevant components of a dataset (features, labels, metadata) neatly organized in one object.Ease of Use:
Attribute-style access reduces boilerplate code, making it easier to explore and use datasets.Compatibility:
Bunch
integrates seamlessly with NumPy, Pandas, and other Python libraries, enabling quick preprocessing and analysis.Standardization:
Since allsklearn
datasets use theBunch
format, you can follow consistent workflows regardless of the dataset.
Summary:#
The Bunch
object in sklearn
provides a structured, dictionary-like interface for handling datasets. Its ability to store both data and metadata makes it a powerful tool for machine learning workflows. By converting it into other formats (like Pandas DataFrame) or directly using its attributes, you can easily integrate the dataset into preprocessing pipelines and model training.