综合技术

Applying the Universal Machine Learning Workflow to the UCI Mushroom Dataset

微信扫一扫,分享到朋友圈

Applying the Universal Machine Learning Workflow to the UCI Mushroom Dataset
0


Lepiota Mushrooms – Image Credit :
East Tennessee Wildflowers

Applying the Universal Machine Learning Workflow to the UCI Mushroom Dataset

This post is intended to demonstrate the universal machine learning workflow as stated by Francois Chollet in Deep Learning with Python .

Kirby

W e will be using the Mushroom Dataset from UCI’s Machine Learning Repository to perform our demonstration. This work is meant for a reader who has at least a basic understanding of Python fundamentals and some experience with machine learning. That being said, I will provide copious links to supporting sources for the uninitiated so that anyone can make use of the information presented.

Before we get started, I’d like to give thanks to my fellow Lambdonian Ned H for all his help on this post.

The Universal Machine Learning Workflow

  1. Define the problem and assemble a dataset
  2. Choose a measure of success
  3. Decide on an evaluation protocol
  4. Prepare the data
  5. Develop a model that does better than a baseline
  6. Develop a model that overfits
  7. Regularize the model and tune its hyperparameters

1. Define the problem and assemble a dataset

Stated concisely our problem is the binary classification of a mushroom as edible or poisonous. We are given a dataset with 23 features including the class (edible or poisonous) of the mushroom.

From the features listed in the data information file we can create a list of column names for our dataset.

column_names = ['class',
                'cap-shape',
                'cap-surface',
                'cap-color',
                'bruises?',
                'odor',
                'gill-attachment',
                'gill-spacing',
                'gill-size',
                'gill-color',
                'stalk-shape',
                'stalk-root',
                'stalk-surface-above-ring',
                'stalk-surface-below-ring',
                'stalk-color-above-ring',
                'stalk-color-below-ring',
                'veil-type',
                'veil-color',
                'ring-number',
                'ring-type',
                'spore-print-color',
                'population',
                'habitat']

Lets import our dataset and create a Pandas DataFrame from the .data file using pd.read_csv()

import pandas as pd
url = '<a href="https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data%27" data-href="https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'" rel="nofollow noopener" target="_blank">https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'</a>
mushrooms = pd.read_csv(url, header=None, names=column_names)

2. Choose a measure of success

Given the nature of our problem; classifying whether or not a mushroom is poisonous or not, we will be using precision as our measure of success. Precision is the ability of the classifier not to label as edible mushrooms which are poisonous. We would much rather people discard edible mushrooms that our model classified as poisonous than eat poisonous mushrooms our classifier labeled as edible.


from sklearn.metrics import precision_score

3. Decide on an evaluation protocol

We will be using 10-fold cross validation to evaluate our model. While a simple holdout validation set would probably suffice, I am skeptical of its viability given our ~8,000 samples.

from sklearn.model_selection import train_test_split, cross_validate

First lets split our data into a feature matrix (X), and a target vector (y). We will use OneHotEncoder to encode our categorical variables.

import category_encoders as ce
#Drop target feature
X = mushrooms.drop(columns='class')
#Encode categorical features
X = ce.OneHotEncoder(use_cat_names=True).fit_transform(X)
 
y = mushrooms['class'].replace({'p':0, 'e':1})
print('Feature matrix size:',X.shape)
print('Target vector size:',len(y))
____________________________________________________________________
Feature matrix size: (8124, 117) Target vector size: 8124

Next we will split our data into a training set and a test set .

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.2, stratify=y)
print('Training feature matrix size:',X_train.shape)
print('Training target vector size:',y_train.shape)
print('Test feature matrix size:',X_test.shape)
print('Test target vector size:',y_test.shape)
____________________________________________________________________
Training feature matrix size: (6499, 117) 
Training target vector size: (6499,) 
Test feature matrix size: (1625, 117) 
Test target vector size: (1625,)

4. Prepare the data

Were almost ready to begin training models, but first we should explore our data, familiarize ourselves with its characteristics, and format it so that it can be fed into our model.

We could use .dtypes().columns , and  .shape to examine our dataset, but Pandas provides a  .info function that will allow us to view all this information in one place.

print(mushrooms.info())
____________________________________________________________________
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
class                       8124 non-null object
cap-shape                   8124 non-null object 
cap-surface                 8124 non-null object 
cap-color                   8124 non-null object 
bruises?                    8124 non-null object 
odor                        8124 non-null object 
gill-attachment             8124 non-null object 
gill-spacing                8124 non-null object 
gill-size                   8124 non-null object 
gill-color                  8124 non-null object 
stalk-shape                 8124 non-null object 
stalk-root                  8124 non-null object 
stalk-surface-above-ring    8124 non-null object 
stalk-surface-below-ring    8124 non-null object 
stalk-color-above-ring      8124 non-null object 
stalk-color-below-ring      8124 non-null object 
veil-type                   8124 non-null object 
veil-color                  8124 non-null object 
ring-number                 8124 non-null object 
ring-type                   8124 non-null object 
spore-print-color           8124 non-null object 
population                  8124 non-null object 
habitat                     8124 non-null object 
dtypes: object(23) memory usage: 1.4+ MB None

Another useful step is to check is the number of null values and where they are in the DataFrame

print(mushrooms.isna().sum())
____________________________________________________________________
class                       0 
cap-shape                   0 
cap-surface                 0 
cap-color                   0 
bruises?                    0 
odor                        0 
gill-attachment             0 
gill-spacing                0 
gill-size                   0 
gill-color                  0 
stalk-shape                 0 
stalk-root                  0 
stalk-surface-above-ring    0 
stalk-surface-below-ring    0 
stalk-color-above-ring      0 
stalk-color-below-ring      0 
veil-type                   0 
veil-color                  0 
ring-number                 0 
ring-type                   0 
spore-print-color           0 
population                  0 
habitat                     0 
dtype: int64

None… that seems a bit too good to be true.

Since we were studious and read the dataset information file . We’re aware that all missing values are marked with a question mark. Once this is clear we can use df.replace() to convert the ? to NaNs.

import numpy as np
mushrooms = mushrooms.replace({'?':np.NaN})
print(mushrooms.isna().sum())
____________________________________________________________________
class                       0 
cap-shape                   0 
cap-surface                 0 
cap-color                   0 
bruises?                    0 
odor                        0 
gill-attachment             0 
gill-spacing                0 
gill-size                   0 
gill-color                  0 
stalk-shape                 0 
stalk-root               2480
stalk-surface-above-ring    0 
stalk-surface-below-ring    0 
stalk-color-above-ring      0 
stalk-color-below-ring      0 
veil-type                   0 
veil-color                  0 
ring-number                 0 
ring-type                   0 
spore-print-color           0 
population                  0 
habitat                     0 
dtype: int64

There we are, stalk_root has 2480 blank features, lets replace these with m for missing.

mushrooms['stalk-root'] = mushrooms['stalk-root'].replace(np.NaN,'m')
print(mushrooms['stalk-root'].value_counts())
____________________________________________________________________
b    3776 
m    2480 
e    1120 
c     556 
r     192 
Name: stalk-root, dtype: int64

5. Develop a model that does better than a baseline

Baseline Model

Using the most common label from our dataset we will create a baseline model that we hope to beat.

First let’s look at how class is distributed using df.value_counts()

mushrooms['class'].value_counts(normalize=True)
____________________________________________________________________
e    0.517971 
p    0.482029 
Name: class, dtype: float64

We will use the mode of the class attribute to create our baseline prediction.

majority_class = y_train.mode()[0]
baseline_predictions = [majority_class] * len(y_train)

Lets see how accurate our baseline model is.

from sklearn.metrics import accuracy_score
majority_class_accuracy = accuracy_score(baseline_predictions,
                                         y_train)
print(majority_class_accuracy)
____________________________________________________________________

0.5179258347438067

~52%… Which is what we would expect given the distribution of class in our initial dataset.

Decision Tree

We will attempt to fit a decision tree to our training data and produce an accuracy score greater than 52%.

from sklearn.tree import DecisionTreeClassifier
import graphviz
from sklearn.tree import export_graphviz
tree = DecisionTreeClassifier(max_depth=1)
# Fit the model
tree.fit(X_train, y_train)
# Visualize the tree
dot_data = export_graphviz(tree, out_file=None, feature_names=X_train.columns, class_names=['Poisonous', 'Edible'], filled=True, impurity=False, proportion=True)
graphviz.Source(dot_data)


Now that we have fitted the decision tree to our data we can analyze our model by looking at the prediction probability distribution for our classifier. In simple terms, prediction probability represents how sure the model is about its classification label.

In addition to prediction probability, we will look at the precision score of our decision tree. Sklearn provides us with a simple way to see many of the relevant scores for classification models with classification_report .

We will also generate a confusion matrix using sklearn’s confusion_matrix . A confusion matrix shows the number of true and false positives and negatives.

Since we will be using these tools again we will write a function to run our model analysis for us.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
def model_analysis(model, train_X, train_y):
  model_probabilities = model.predict_proba(train_X)
Model_Prediction_Probability = []
for _ in range(len(train_X)):
      x = max(model_probabilities[_])
      Model_Prediction_Probability.append(x)
plt.figure(figsize=(15,10)) 
 
  sns.distplot(Model_Prediction_Probability)
plt.title('Best Model Prediction Probabilities')
# Set x and y ticks
  plt.xticks(color='gray')
#plt.xlim(.5,1)
  plt.yticks(color='gray')
# Create axes object with plt. get current axes
  ax = plt.gca()
# Set grid lines
  ax.grid(b=True, which='major', axis='y', color='black', alpha=.2)
# Set facecolor
  ax.set_facecolor('white')
# Remove box
  ax.spines['top'].set_visible(False)
  ax.spines['right'].set_visible(False)
  ax.spines['bottom'].set_visible(False)
  ax.spines['left'].set_visible(False)
  ax.tick_params(color='white')
plt.show();
  
  model_predictions = model.predict(train_X)
# Classification Report
  print('nn', classification_report(train_y, model_predictions, target_names=['0-Poisonous', '1-Edible']))
# Confusion Matrix
  con_matrix = pd.DataFrame(confusion_matrix(train_y, model_predictions), columns=['Predicted Poison', 'Predicted Edible'], index=['Actual Poison', 'Actual Edible'])
  
  plt.figure(figsize=(15,10))
sns.heatmap(data=con_matrix, cmap='cool');
plt.title('Model Confusion Matrix')
  plt.show();
  
  return con_matrix

Now to apply this function to our decision tree.

model_analysis(tree, X_train, y_train)


We will store our predictions as a tree_predictions variable for use in interpreting our models accuracy.

tree_predictions = tree.predict(X_train)
accuracy_score(y_train, tree_predictions)
____________________________________________________________________
0.8862901984920757

88% accuracy isn’t bad, but let’s move on to the next step in our workflow.

6. Develop a model that overfits

We will use the RandomForestClassifier for our overfitting model.

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, max_depth=5)
cv = cross_validate(estimator = random_forest, X = X_train, y = y_train, scoring='accuracy', n_jobs=-1, cv=10, verbose=10, return_train_score=True)
____________________________________________________________________

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.6s [Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    3.2s [Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    4.7s finished

Now we can see our random forest’s accuracy score.

random_forest.fit(X_test, y_test)
test_predictions = random_forest.predict(X_train)
accuracy_score(y_train, test_predictions)
____________________________________________________________________

0.9924603785197723

99% accuracy looks overfitted to me.

We can use our model_analysis function from earlier to analyze our model.

model_analysis(random_forest, X_train, y_train)


7. Regularize the model and tune its hyperparameters

Now we will tune the hyperparameters of our RandomForestClassifier and attempt to walk the line between underfitting and overfitting. We can use sklearn’s RandmoizedSearchCV to search the hyperparameters in our param_distributions dictionary.

from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
    'max_depth':[1, 2, 3, 4, 5],
    'n_estimators': [10, 25, 50, 100, 150, 200]}
search = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = param_distributions, n_iter=100, scoring='precision', n_jobs=-1, cv=10, verbose=10, return_train_score=True)
 
search.fit(X_train, y_train)

We can use search.best_estimator_ to see which model has the highest precision score.

best_model = search.best_estimator_
best_model
____________________________________________________________________
RandomForestClassifier
(bootstrap=True, class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,             min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,             oob_score=False, random_state=None, verbose=0,             warm_start=False)

From the model description we can see that a RandomForestClassifier with a max_depth of 5 and 10 estimators is our optimal model. Now we can run our analysis function.

model_analysis(best_model, X_test, y_test)


3 false positives, not perfect, but pretty good.

Conclusion

To restate our workflow.

  1. Define the problem and assemble a dataset
  2. Choose a measure of success
  3. Decide on an evaluation protocol
  4. Prepare the data
  5. Develop a model that does better than a baseline
  6. Develop a model that overfits
  7. Regularize the model and tune its hyperparameters

While Chollet describes this as THE universal machine learning workflow, there are infinite variations depending on the specific problem we are trying to solve. In general though, you will always start with defining your problem and collecting data, (whether that be from a premade dataset or doing your own data collection).

I hope this post has presented an informative walkthrough of Chollet’s universal machine learning workflow.

Thanks for reading!

Follow me on T witter , GitHub , and LinkedIn

P.S. Here is the link to the Colab Notebook I used for this post.

阅读原文...


微信扫一扫,分享到朋友圈

Applying the Universal Machine Learning Workflow to the UCI Mushroom Dataset
0

Towards Data Science

网易有道翻译蛋拆解报告

上一篇

How do I set up the Zurb Foundation icons?

下一篇

评论已经被关闭。

插入图片

热门分类

往期推荐

Applying the Universal Machine Learning Workflow to the UCI Mushroom Dataset

长按储存图像,分享给朋友