Skip to content

Commit

Permalink
update class 13 files
Browse files Browse the repository at this point in the history
  • Loading branch information
justmarkham committed Sep 30, 2015
1 parent a055169 commit 88447ae
Show file tree
Hide file tree
Showing 6 changed files with 1,660 additions and 56 deletions.
7 changes: 5 additions & 2 deletions code/13_advanced_model_evaluation_nb.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape


# check for missing values
Expand Down Expand Up @@ -48,7 +49,7 @@


# most frequent Age
titanic.Age.value_counts().head(1).index
titanic.Age.mode()


# fill missing values for Age with the median age
Expand All @@ -75,7 +76,7 @@
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})


# create a DataFrame of dummy variables
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

Expand Down Expand Up @@ -120,6 +121,8 @@


import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14


# plot ROC curve
Expand Down
120 changes: 120 additions & 0 deletions code/13_bank_exercise_nb.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,133 @@

# ## Step 1: Read the data into Pandas

import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bank-additional.csv'
bank = pd.read_csv(url, sep=';')
bank.head()


# ## Step 2: Prepare at least three features
#
# - Include both numeric and categorical features
# - Choose features that you think might be related to the response (based on intuition or exploration)
# - Think about how to handle missing values (encoded as "unknown")

# list all columns (for reference)
bank.columns


# ### y (response)

# convert the response to numeric values and store as a new column
bank['outcome'] = bank.y.map({'no':0, 'yes':1})


# ### age

# probably not a great feature
bank.boxplot(column='age', by='outcome')


# ### job

# looks like a useful feature
bank.groupby('job').outcome.mean()


# create job_dummies (we will add it to the bank DataFrame later)
job_dummies = pd.get_dummies(bank.job, prefix='job')
job_dummies.drop(job_dummies.columns[0], axis=1, inplace=True)


# ### default

# looks like a useful feature
bank.groupby('default').outcome.mean()


# but only one person in the dataset has a status of yes
bank.default.value_counts()


# so, let's treat this as a 2-class feature rather than a 3-class feature
bank['default'] = bank.default.map({'no':0, 'unknown':1, 'yes':1})


# ### contact

# looks like a useful feature
bank.groupby('contact').outcome.mean()


# convert the feature to numeric values
bank['contact'] = bank.contact.map({'cellular':0, 'telephone':1})


# ### month

# looks like a useful feature at first glance
bank.groupby('month').outcome.mean()


# but, it looks like their success rate is actually just correlated with number of calls
# thus, the month feature is unlikely to generalize
bank.groupby('month').outcome.agg(['count', 'mean']).sort('count')


# ### duration

# looks like an excellent feature, but you can't know the duration of a call beforehand, thus it can't be used in your model
bank.boxplot(column='duration', by='outcome')


# ### previous

# looks like a useful feature
bank.groupby('previous').outcome.mean()


# ### poutcome

# looks like a useful feature
bank.groupby('poutcome').outcome.mean()


# create poutcome_dummies
poutcome_dummies = pd.get_dummies(bank.poutcome, prefix='poutcome')
poutcome_dummies.drop(poutcome_dummies.columns[0], axis=1, inplace=True)


# concatenate bank DataFrame with job_dummies and poutcome_dummies
bank = pd.concat([bank, job_dummies, poutcome_dummies], axis=1)


# ### euribor3m

# looks like an excellent feature
bank.boxplot(column='euribor3m', by='outcome')


# ## Step 3: Model building
#
# - Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
# - Try to increase the AUC by selecting different sets of features

# new list of columns (including dummy columns)
bank.columns


# create X (including 13 dummy columns)
feature_cols = ['default', 'contact', 'previous', 'euribor3m'] + list(bank.columns[-13:])
X = bank[feature_cols]


# create y
y = bank.outcome


# calculate cross-validated AUC
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression(C=1e9)
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
9 changes: 9 additions & 0 deletions homework/13_cross_validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,20 @@ Alternatively, read section 5.1 of [An Introduction to Statistical Learning](htt
Here are some questions to think about:

- What is the purpose of model evaluation?
- The purpose is to estimate the likely performance of a model on out-of-sample data, so that we can choose the model that is most likely to generalize, and so that we can have an idea of how well that model will actually perform.
- What is the drawback of training and testing on the same data?
- Training accuracy is maximized for overly complex models which overfit the training data, and thus it's not a good measure of how well a model will generalize.
- How does train/test split work, and what is its primary drawback?
- It splits the data into two pieces, trains the model on the training set, and tests the model on the testing set. Testing accuracy can change a lot depending upon which observations happen to be in the training and testing sets.
- How does K-fold cross-validation work, and what is the role of "K"?
- First, it splits the data into K equal folds. Then, it trains the model on folds 2 through K, tests the model on fold 1, and calculates the requested evaluation metric. Then, it repeats that process K-1 more times, until every fold has been the testing set exactly once.
- Why do we pass X and y, not X_train and y_train, to the `cross_val_score` function?
- It will take care of splitting the data into the K folds, so we don't need to split it ourselves.
- Why does `cross_val_score` need a "scoring" parameter?
- It needs to know what evaluation metric to calculate, since many different metrics are available.
- What does `cross_val_score` return, and what do we usually do with that object?
- It returns a NumPy array containing the K scores. We usually calculate the mean score, though we might also be interested in the standard deviation.
- Under what circumstances does `cross_val_score` return negative scores?
- The scores will be negative if the evaluation metric is a loss function (something you want to minimize) rather than a reward function (something you want to maximize).
- When should you use train/test split, and when should you use cross-validation?
- Train/test split is useful when you want to inspect your testing results (via confusion matrix or ROC curve) and when evaluation speed is a concern. Cross-validation is useful when you are most concerned with the accuracy of your estimation.
11 changes: 11 additions & 0 deletions homework/13_roc_auc.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,24 @@ Then, watch my video on [ROC Curves and Area Under the Curve](https://www.youtub
Here are some questions to think about:

- What is the difference between the predict and predict_proba methods in scikit-learn?
- The former outputs class predictions, and the latter outputs predicted probabilities of class membership.
- If you have a classification model that outputs predicted probabilities, how could you convert those probabilities to class predictions?
- Set a threshold, and classify everything above the threshold as a 1 and everything below the threshold as a 0.
- Why are predicted probabilities (rather than just class predictions) required to generate an ROC curve?
- Because an ROC curve is measuring the performance of a classifier at all possible thresholds, and thresholds only make sense in the context of predicted probabilities.
- Could you use an ROC curve for a regression problem? Why or why not?
- No, because ROC is a plot of TPR vs FPR, and those concepts have no meaning in a regression problem.
- What's another term for True Positive Rate?
- Sensitivity or recall.
- If I wanted to increase specificity, how would I change the classification threshold?
- Increase it.
- Is it possible to adjust your classification threshold such that both sensitivity and specificity increase simultaneously? Why or why not?
- No, because increasing either of those requires moving the threshold in opposite directions.
- What are the primary benefits of ROC curves over classification accuracy?
- Doesn't require setting a classification threshold, allows you to visualize the performance of your classifier, works well for unbalanced classes.
- What should you do if your AUC is 0.2?
- Reverse your predictions so that your AUC is 0.8.
- What would the plot of reds and blues look like for a dataset in which each observation was a credit card transaction, and the response variable was whether or not the transaction was fraudulent? (0 = not fraudulent, 1 = fraudulent)
- Blues would be significantly larger, lots of overlap between blues and reds.
- What's a real-world scenario in which you would prefer high specificity (rather than high sensitivity) for your classifier?
- Speed cameras issuing speeding tickets.
Loading

0 comments on commit 88447ae

Please sign in to comment.