update class 13 files

Soroushmehr1 · Sep 30, 2015 · 88447ae · 88447ae
1 parent a055169
commit 88447ae
Show file tree

Hide file tree

Showing 6 changed files with 1,660 additions and 56 deletions.
diff --git a/code/13_advanced_model_evaluation_nb.py b/code/13_advanced_model_evaluation_nb.py
@@ -21,6 +21,7 @@
 import pandas as pd
 url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
 titanic = pd.read_csv(url, index_col='PassengerId')
+titanic.shape
 
 
 # check for missing values
@@ -48,7 +49,7 @@
 
 
 # most frequent Age
-titanic.Age.value_counts().head(1).index
+titanic.Age.mode()
 
 
 # fill missing values for Age with the median age
@@ -75,7 +76,7 @@
 titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
 
 
-# create a DataFrame of dummy variables
+# create a DataFrame of dummy variables for Embarked
 embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
 embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
 
@@ -120,6 +121,8 @@
 
 
 import matplotlib.pyplot as plt
+plt.rcParams['figure.figsize'] = (8, 6)
+plt.rcParams['font.size'] = 14
 
 
 # plot ROC curve

diff --git a/code/13_bank_exercise_nb.py b/code/13_bank_exercise_nb.py
@@ -8,13 +8,133 @@
 
 # ## Step 1: Read the data into Pandas
 
+import pandas as pd
+url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bank-additional.csv'
+bank = pd.read_csv(url, sep=';')
+bank.head()
+
+
 # ## Step 2: Prepare at least three features
 # 
 # - Include both numeric and categorical features
 # - Choose features that you think might be related to the response (based on intuition or exploration)
 # - Think about how to handle missing values (encoded as "unknown")
 
+# list all columns (for reference)
+bank.columns
+
+
+# ### y (response)
+
+# convert the response to numeric values and store as a new column
+bank['outcome'] = bank.y.map({'no':0, 'yes':1})
+
+
+# ### age
+
+# probably not a great feature
+bank.boxplot(column='age', by='outcome')
+
+
+# ### job
+
+# looks like a useful feature
+bank.groupby('job').outcome.mean()
+
+
+# create job_dummies (we will add it to the bank DataFrame later)
+job_dummies = pd.get_dummies(bank.job, prefix='job')
+job_dummies.drop(job_dummies.columns[0], axis=1, inplace=True)
+
+
+# ### default
+
+# looks like a useful feature
+bank.groupby('default').outcome.mean()
+
+
+# but only one person in the dataset has a status of yes
+bank.default.value_counts()
+
+
+# so, let's treat this as a 2-class feature rather than a 3-class feature
+bank['default'] = bank.default.map({'no':0, 'unknown':1, 'yes':1})
+
+
+# ### contact
+
+# looks like a useful feature
+bank.groupby('contact').outcome.mean()
+
+
+# convert the feature to numeric values
+bank['contact'] = bank.contact.map({'cellular':0, 'telephone':1})
+
+
+# ### month
+
+# looks like a useful feature at first glance
+bank.groupby('month').outcome.mean()
+
+
+# but, it looks like their success rate is actually just correlated with number of calls
+# thus, the month feature is unlikely to generalize
+bank.groupby('month').outcome.agg(['count', 'mean']).sort('count')
+
+
+# ### duration
+
+# looks like an excellent feature, but you can't know the duration of a call beforehand, thus it can't be used in your model
+bank.boxplot(column='duration', by='outcome')
+
+
+# ### previous
+
+# looks like a useful feature
+bank.groupby('previous').outcome.mean()
+
+
+# ### poutcome
+
+# looks like a useful feature
+bank.groupby('poutcome').outcome.mean()
+
+
+# create poutcome_dummies
+poutcome_dummies = pd.get_dummies(bank.poutcome, prefix='poutcome')
+poutcome_dummies.drop(poutcome_dummies.columns[0], axis=1, inplace=True)
+
+
+# concatenate bank DataFrame with job_dummies and poutcome_dummies
+bank = pd.concat([bank, job_dummies, poutcome_dummies], axis=1)
+
+
+# ### euribor3m
+
+# looks like an excellent feature
+bank.boxplot(column='euribor3m', by='outcome')
+
+
 # ## Step 3: Model building
 # 
 # - Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
 # - Try to increase the AUC by selecting different sets of features
+
+# new list of columns (including dummy columns)
+bank.columns
+
+
+# create X (including 13 dummy columns)
+feature_cols = ['default', 'contact', 'previous', 'euribor3m'] + list(bank.columns[-13:])
+X = bank[feature_cols]
+
+
+# create y
+y = bank.outcome
+
+
+# calculate cross-validated AUC
+from sklearn.linear_model import LogisticRegression
+from sklearn.cross_validation import cross_val_score
+logreg = LogisticRegression(C=1e9)
+cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
diff --git a/homework/13_cross_validation.md b/homework/13_cross_validation.md
@@ -7,11 +7,20 @@ Alternatively, read section 5.1 of [An Introduction to Statistical Learning](htt
 Here are some questions to think about:
 
 - What is the purpose of model evaluation?
+    - The purpose is to estimate the likely performance of a model on out-of-sample data, so that we can choose the model that is most likely to generalize, and so that we can have an idea of how well that model will actually perform.
 - What is the drawback of training and testing on the same data?
+    - Training accuracy is maximized for overly complex models which overfit the training data, and thus it's not a good measure of how well a model will generalize.
 - How does train/test split work, and what is its primary drawback?
+    - It splits the data into two pieces, trains the model on the training set, and tests the model on the testing set. Testing accuracy can change a lot depending upon which observations happen to be in the training and testing sets.
 - How does K-fold cross-validation work, and what is the role of "K"?
+    - First, it splits the data into K equal folds. Then, it trains the model on folds 2 through K, tests the model on fold 1, and calculates the requested evaluation metric. Then, it repeats that process K-1 more times, until every fold has been the testing set exactly once.
 - Why do we pass X and y, not X_train and y_train, to the `cross_val_score` function?
+    - It will take care of splitting the data into the K folds, so we don't need to split it ourselves.
 - Why does `cross_val_score` need a "scoring" parameter?
+    - It needs to know what evaluation metric to calculate, since many different metrics are available.
 - What does `cross_val_score` return, and what do we usually do with that object?
+    - It returns a NumPy array containing the K scores. We usually calculate the mean score, though we might also be interested in the standard deviation.
 - Under what circumstances does `cross_val_score` return negative scores?
+    - The scores will be negative if the evaluation metric is a loss function (something you want to minimize) rather than a reward function (something you want to maximize).
 - When should you use train/test split, and when should you use cross-validation?
+    - Train/test split is useful when you want to inspect your testing results (via confusion matrix or ROC curve) and when evaluation speed is a concern. Cross-validation is useful when you are most concerned with the accuracy of your estimation.
diff --git a/homework/13_roc_auc.md b/homework/13_roc_auc.md
@@ -9,13 +9,24 @@ Then, watch my video on [ROC Curves and Area Under the Curve](https://www.youtub
 Here are some questions to think about:
 
 - What is the difference between the predict and predict_proba methods in scikit-learn?
+    - The former outputs class predictions, and the latter outputs predicted probabilities of class membership.
 - If you have a classification model that outputs predicted probabilities, how could you convert those probabilities to class predictions?
+    - Set a threshold, and classify everything above the threshold as a 1 and everything below the threshold as a 0.
 - Why are predicted probabilities (rather than just class predictions) required to generate an ROC curve?
+    - Because an ROC curve is measuring the performance of a classifier at all possible thresholds, and thresholds only make sense in the context of predicted probabilities.
 - Could you use an ROC curve for a regression problem? Why or why not?
+    - No, because ROC is a plot of TPR vs FPR, and those concepts have no meaning in a regression problem.
 - What's another term for True Positive Rate?
+    - Sensitivity or recall.
 - If I wanted to increase specificity, how would I change the classification threshold?
+    - Increase it.
 - Is it possible to adjust your classification threshold such that both sensitivity and specificity increase simultaneously? Why or why not?
+    - No, because increasing either of those requires moving the threshold in opposite directions.
 - What are the primary benefits of ROC curves over classification accuracy?
+    - Doesn't require setting a classification threshold, allows you to visualize the performance of your classifier, works well for unbalanced classes.
 - What should you do if your AUC is 0.2?
+    - Reverse your predictions so that your AUC is 0.8.
 - What would the plot of reds and blues look like for a dataset in which each observation was a credit card transaction, and the response variable was whether or not the transaction was fraudulent? (0 = not fraudulent, 1 = fraudulent)
+    - Blues would be significantly larger, lots of overlap between blues and reds.
 - What's a real-world scenario in which you would prefer high specificity (rather than high sensitivity) for your classifier?
+    - Speed cameras issuing speeding tickets.