应用于p2p公司,用于判断借贷人的风险程度,判断借贷人能否及时还欠款
数据来源:lending club 2016年Q3和Q4数据:https://www.lendingclub.com/info/download-data.action
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('input/LoanStats_2016Q3.csv',skiprows=1,low_memory=False)
df.head()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | ... | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | 30000.0 | 30000.0 | 30000.0 | 60 months | 13.99% | 697.90 | C | C3 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | 20150.0 | 20150.0 | 20150.0 | 60 months | 24.99% | 591.32 | E | E4 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | 30000.0 | 30000.0 | 30000.0 | 36 months | 10.99% | 982.02 | B | B4 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | 15000.0 | 15000.0 | 15000.0 | 36 months | 13.99% | 512.60 | C | C3 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | 18000.0 | 18000.0 | 18000.0 | 60 months | 14.49% | 423.42 | C | C4 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 145 columns
df.columns
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'term', 'int_rate', 'installment', 'grade', 'sub_grade',
...
'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
'disbursement_method', 'debt_settlement_flag',
'debt_settlement_flag_date', 'settlement_status', 'settlement_date',
'settlement_amount', 'settlement_percentage', 'settlement_term'],
dtype='object', length=145)
df.drop(['id','member_id','zip_code','addr_state'],1,inplace=True)
df.int_rate = pd.Series(df.int_rate).str.replace('%', '').astype(float)
df.dropna(axis=0,how='all',inplace=True)
- load_amnt和funded_amnt
- emp_title:职位信息
- emp_length: 工作年限
- verification_status: 收入信息是否被确认
- loan_status: 贷款情况,即最后的判断结果
df.loan_amnt[pd.isnull(df.loan_amnt)]
Series([], Name: loan_amnt, dtype: float64)
df.query('loan_amnt != funded_amnt')
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | ... | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term |
---|
0 rows × 141 columns
df.loan_amnt.value_counts()
10000.0 7328
12000.0 5358
20000.0 5140
15000.0 5082
5000.0 4429
8000.0 3924
6000.0 3729
35000.0 3306
16000.0 2934
24000.0 2849
30000.0 2365
25000.0 2240
18000.0 2199
7000.0 1893
14000.0 1694
4000.0 1532
3000.0 1496
9000.0 1249
21000.0 1160
28000.0 1152
9600.0 928
2000.0 884
40000.0 878
7200.0 813
13000.0 733
11000.0 732
11200.0 714
8400.0 674
4800.0 670
14400.0 660
...
32325.0 1
24925.0 1
33425.0 1
35750.0 1
36600.0 1
24725.0 1
31100.0 1
33725.0 1
33075.0 1
34275.0 1
30550.0 1
38725.0 1
25025.0 1
38800.0 1
36750.0 1
35300.0 1
36625.0 1
30325.0 1
36675.0 1
38625.0 1
35200.0 1
35050.0 1
30750.0 1
36200.0 1
36950.0 1
39750.0 1
33225.0 1
29075.0 1
39725.0 1
39075.0 1
Name: loan_amnt, Length: 1411, dtype: int64
df.emp_title.value_counts()
Teacher 1931
Manager 1701
Owner 990
Supervisor 785
Driver 756
Registered Nurse 752
RN 731
Sales 664
Project Manager 526
General Manager 483
Office Manager 466
Director 415
owner 384
Engineer 382
President 351
manager 314
Operations Manager 314
Vice President 288
Nurse 284
teacher 284
Attorney 275
Accountant 274
Sales Manager 263
Analyst 246
Administrative Assistant 243
Police Officer 230
driver 222
Account Manager 213
Technician 211
Executive Assistant 205
...
Bartender/Production floor tech 1
Massage Therapist/server 1
Financial/Insurance Coordinator 1
Paint Tech Team Lead 1
Delivery supervisor 1
Car Maintainer 1
compliance tester 1
RN. Clinical Quality Manager 1
Retail Marketing Designer 1
MANAGER, MEMBER SERVICES 1
Clinical Nurse Lead 1
Loan Coordinator 1
EHR Project Manager 1
Database Specialist 1
Consumer Loan Officer 1
Assistant Administrator - Human Resource 1
Sr. Web Developer 1
Asst sales manager 1
Engineering Tech 3 1
WARRANTY ADMIN 1
laborer assembly 1
Logistics Operations Manager 1
Healthcare technology consultant 1
Clinical Trial Specialist II 1
Union Autoworker 1
Adult Trauma Coordinator RN 1
ELIG CLERK 1
Driver/ firefighter 1
Marina Manager 1
CDL Licensing trainer/Driver 1
Name: emp_title, Length: 37420, dtype: int64
df.emp_title.head()
0 General Manager
1 Server
2 server
3 Fiscal Director
4 utility technician 2
Name: emp_title, dtype: object
# 职位名称太多太杂,所以删掉这一列,防止对结果产生干扰
df.drop('emp_title',1,inplace=True)
df.emp_length.value_counts()
10+ years 34219
2 years 9066
3 years 7925
< 1 year 7104
1 year 6991
5 years 6170
4 years 6022
6 years 4406
8 years 4168
9 years 3922
7 years 3205
Name: emp_length, dtype: int64
# 进行正则替换
df.replace('n/a',np.nan,inplace=True)
df['emp_length'].fillna(value=0,inplace=True)
df['emp_length'].replace(to_replace='[^0-9]+',value='',inplace=True,regex=True)
df['emp_length'] = df['emp_length'].astype(int)
df.emp_length.value_counts()
10 34219
1 14095
2 9066
3 7925
5 6170
4 6022
0 5922
6 4406
8 4168
9 3922
7 3205
Name: emp_length, dtype: int64
df.verification_status.value_counts()
Source Verified 42253
Verified 31356
Not Verified 25511
Name: verification_status, dtype: int64
df.loan_status.value_counts()
Current 52061
Fully Paid 32435
Charged Off 11317
Late (31-120 days) 1769
In Grace Period 928
Late (16-30 days) 505
Default 105
Name: loan_status, dtype: int64
pd.unique(df.loan_status.values.ravel())
array(['Fully Paid', 'Current', 'Charged Off', 'Late (31-120 days)',
'In Grace Period', 'Late (16-30 days)', 'Default'], dtype=object)
- current:代表还在还款中
- fully paid代表还款完成
- 其他的都是逾期用户,属于信誉不好的用户
所以通过replace函数,将这一列改为0和1的变量,让问题变成二分类问题
df.loan_status.replace('Fully Paid', int(1),inplace=True)
df.loan_status.replace('Current', int(1),inplace=True)
df.loan_status.replace('Late (16-30 days)', int(0),inplace=True)
df.loan_status.replace('Late (31-120 days)', int(0),inplace=True)
df.loan_status.replace('Charged Off', np.nan,inplace=True)
df.loan_status.replace('In Grace Period', np.nan,inplace=True)
df.loan_status.replace('Default', np.nan,inplace=True)
df.loan_status.value_counts()
1.0 84496
0.0 2274
Name: loan_status, dtype: int64
df.dropna(subset=['loan_status'],inplace=True)
df.loan_status.value_counts()
1.0 84496
0.0 2274
Name: loan_status, dtype: int64
df.loan_status.isnull()
0 False
1 False
2 False
3 False
5 False
6 False
7 False
8 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
31 False
32 False
...
99087 False
99088 False
99089 False
99090 False
99091 False
99092 False
99093 False
99094 False
99095 False
99097 False
99098 False
99099 False
99100 False
99101 False
99102 False
99103 False
99104 False
99105 False
99106 False
99107 False
99108 False
99109 False
99111 False
99112 False
99113 False
99115 False
99116 False
99117 False
99118 False
99119 False
Name: loan_status, Length: 86770, dtype: bool
df.revol_util = pd.Series(df.revol_util).str.replace('%', '').astype(float)
for col in df.select_dtypes(include=['object']).columns:
print("Column {} has {} unique instances".format( col, len(df[col].unique())))
Column term has 2 unique instances
Column grade has 7 unique instances
Column sub_grade has 35 unique instances
Column home_ownership has 4 unique instances
Column verification_status has 3 unique instances
Column issue_d has 3 unique instances
Column pymnt_plan has 2 unique instances
Column desc has 6 unique instances
Column purpose has 13 unique instances
Column title has 13 unique instances
Column earliest_cr_line has 608 unique instances
Column initial_list_status has 2 unique instances
Column last_pymnt_d has 25 unique instances
Column next_pymnt_d has 3 unique instances
Column last_credit_pull_d has 27 unique instances
Column application_type has 2 unique instances
Column verification_status_joint has 2 unique instances
Column hardship_flag has 2 unique instances
Column hardship_type has 2 unique instances
Column hardship_reason has 10 unique instances
Column hardship_status has 4 unique instances
Column hardship_start_date has 16 unique instances
Column hardship_end_date has 17 unique instances
Column payment_plan_start_date has 16 unique instances
Column hardship_loan_status has 5 unique instances
Column disbursement_method has 2 unique instances
Column debt_settlement_flag has 2 unique instances
Column debt_settlement_flag_date has 15 unique instances
Column settlement_status has 2 unique instances
Column settlement_date has 16 unique instances
# 判断缺失数据较多的列,并进行删除操作
df.select_dtypes(include=['object']).describe().T.assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | unique | top | freq | missing_pct | |
---|---|---|---|---|---|
term | 86770 | 2 | 36 months | 65451 | 0.000000 |
grade | 86770 | 7 | B | 29812 | 0.000000 |
sub_grade | 86770 | 35 | B5 | 7472 | 0.000000 |
home_ownership | 86770 | 4 | MORTGAGE | 41716 | 0.000000 |
verification_status | 86770 | 3 | Source Verified | 37066 | 0.000000 |
issue_d | 86770 | 3 | Aug-2016 | 31728 | 0.000000 |
pymnt_plan | 86770 | 2 | n | 86724 | 0.000000 |
desc | 5 | 5 | I have recently purchased and built a new home... | 1 | 0.999942 |
purpose | 86770 | 13 | debt_consolidation | 50030 | 0.000000 |
title | 82301 | 12 | Debt consolidation | 47030 | 0.051504 |
earliest_cr_line | 86770 | 608 | Aug-2004 | 704 | 0.000000 |
initial_list_status | 86770 | 2 | w | 62962 | 0.000000 |
last_pymnt_d | 86770 | 25 | Jun-2018 | 35077 | 0.000000 |
next_pymnt_d | 54335 | 2 | Jul-2018 | 38520 | 0.373804 |
last_credit_pull_d | 86766 | 26 | Jun-2018 | 63805 | 0.000046 |
application_type | 86770 | 2 | Individual | 86310 | 0.000000 |
verification_status_joint | 460 | 1 | Not Verified | 460 | 0.994699 |
hardship_flag | 86770 | 2 | N | 86712 | 0.000000 |
hardship_type | 589 | 1 | INTEREST ONLY-3 MONTHS DEFERRAL | 589 | 0.993212 |
hardship_reason | 589 | 9 | NATURAL_DISASTER | 263 | 0.993212 |
hardship_status | 589 | 3 | COMPLETED | 493 | 0.993212 |
hardship_start_date | 589 | 15 | Sep-2017 | 214 | 0.993212 |
hardship_end_date | 589 | 16 | Dec-2017 | 154 | 0.993212 |
payment_plan_start_date | 589 | 15 | Sep-2017 | 152 | 0.993212 |
hardship_loan_status | 589 | 4 | Late (16-30 days) | 234 | 0.993212 |
disbursement_method | 86770 | 2 | Cash | 86745 | 0.000000 |
debt_settlement_flag | 86770 | 2 | N | 86683 | 0.000000 |
debt_settlement_flag_date | 87 | 14 | Jun-2018 | 29 | 0.998997 |
settlement_status | 87 | 1 | ACTIVE | 87 | 0.998997 |
settlement_date | 87 | 15 | Jun-2018 | 29 | 0.998997 |
根据missing_pct和count这两列可以看出很多缺失的列,需要对这些列进行删除操作
df.drop('desc',1,inplace=True)
df.drop('verification_status_joint',1,inplace=True)
df.drop(['hardship_type','hardship_reason','hardship_status','hardship_start_date','hardship_end_date','payment_plan_start_date','hardship_loan_status','debt_settlement_flag_date','settlement_date'],1,inplace=True)
df.drop('earliest_cr_line',1,inplace=True)
df.drop('revol_util',1,inplace=True)
df.drop('purpose',1,inplace=True)
df.drop('title',1,inplace=True)
df.drop('term',1,inplace=True)
df.drop('issue_d',1,inplace=True)
# df.drop('',1,inplace=True)
# 贷后相关的字段
df.drop(['out_prncp','out_prncp_inv','total_pymnt',
'total_pymnt_inv','total_rec_prncp', 'grade', 'sub_grade'] ,1, inplace=True)
df.drop(['total_rec_int','total_rec_late_fee',
'recoveries','collection_recovery_fee',
'collection_recovery_fee' ],1, inplace=True)
df.drop(['last_pymnt_d','last_pymnt_amnt',
'next_pymnt_d','last_credit_pull_d'],1, inplace=True)
df.drop(['policy_code'],1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 86770 entries, 0 to 99119
Columns: 107 entries, loan_amnt to settlement_term
dtypes: float64(97), int32(1), object(9)
memory usage: 71.2+ MB
df.select_dtypes(include=['float']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | mean | std | min | 25% | 50% | 75% | max | missing_pct | |
---|---|---|---|---|---|---|---|---|---|
loan_amnt | 86770.0 | 13992.648381 | 8860.658420 | 1000.00 | 7000.0000 | 12000.00 | 20000.0000 | 40000.00 | 0.000000 |
funded_amnt | 86770.0 | 13992.648381 | 8860.658420 | 1000.00 | 7000.0000 | 12000.00 | 20000.0000 | 40000.00 | 0.000000 |
funded_amnt_inv | 86770.0 | 13988.150570 | 8857.893095 | 1000.00 | 7000.0000 | 12000.00 | 20000.0000 | 40000.00 | 0.000000 |
int_rate | 86770.0 | 13.339744 | 4.662115 | 5.32 | 10.4900 | 12.79 | 15.5900 | 30.99 | 0.000000 |
installment | 86770.0 | 425.346229 | 269.732279 | 30.12 | 230.8000 | 352.32 | 558.3200 | 1535.71 | 0.000000 |
annual_inc | 86770.0 | 79304.560449 | 74486.852068 | 0.00 | 48000.0000 | 67000.00 | 95000.0000 | 8400000.00 | 0.000000 |
loan_status | 86770.0 | 0.973793 | 0.159752 | 0.00 | 1.0000 | 1.00 | 1.0000 | 1.00 | 0.000000 |
url | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
dti | 86767.0 | 17.732272 | 8.891494 | 0.00 | 11.7200 | 17.35 | 23.5900 | 999.00 | 0.000035 |
delinq_2yrs | 86770.0 | 0.378541 | 0.982088 | 0.00 | 0.0000 | 0.00 | 0.0000 | 21.00 | 0.000000 |
inq_last_6mths | 86770.0 | 0.548047 | 0.844625 | 0.00 | 0.0000 | 0.00 | 1.0000 | 5.00 | 0.000000 |
mths_since_last_delinq | 46645.0 | 33.288112 | 21.837896 | 0.00 | 15.0000 | 30.00 | 48.0000 | 142.00 | 0.462429 |
mths_since_last_record | 16967.0 | 67.008428 | 24.329262 | 0.00 | 51.0000 | 70.00 | 84.0000 | 119.00 | 0.804460 |
open_acc | 86770.0 | 11.677089 | 5.730229 | 1.00 | 8.0000 | 11.00 | 14.0000 | 86.00 | 0.000000 |
pub_rec | 86770.0 | 0.260620 | 0.711654 | 0.00 | 0.0000 | 0.00 | 0.0000 | 61.00 | 0.000000 |
revol_bal | 86770.0 | 15700.789201 | 21787.950268 | 0.00 | 5672.2500 | 10539.50 | 18653.0000 | 876178.00 | 0.000000 |
total_acc | 86770.0 | 24.009623 | 11.914658 | 2.00 | 15.0000 | 22.00 | 31.0000 | 119.00 | 0.000000 |
collections_12_mths_ex_med | 86770.0 | 0.020871 | 0.166700 | 0.00 | 0.0000 | 0.00 | 0.0000 | 10.00 | 0.000000 |
mths_since_last_major_derog | 25416.0 | 43.956996 | 21.772484 | 0.00 | 27.0000 | 43.00 | 62.0000 | 165.00 | 0.707088 |
annual_inc_joint | 460.0 | 119731.355543 | 51847.018868 | 26943.12 | 85000.0000 | 111000.00 | 145000.0000 | 400000.00 | 0.994699 |
dti_joint | 460.0 | 18.422152 | 6.675737 | 2.56 | 13.9375 | 18.26 | 22.7875 | 48.58 | 0.994699 |
acc_now_delinq | 86770.0 | 0.006546 | 0.085499 | 0.00 | 0.0000 | 0.00 | 0.0000 | 4.00 | 0.000000 |
tot_coll_amt | 86770.0 | 280.983439 | 1825.358766 | 0.00 | 0.0000 | 0.00 | 0.0000 | 172575.00 | 0.000000 |
tot_cur_bal | 86770.0 | 141618.435081 | 159319.917606 | 0.00 | 28983.0000 | 79803.50 | 211297.7500 | 3764968.00 | 0.000000 |
open_acc_6m | 86770.0 | 0.949084 | 1.154786 | 0.00 | 0.0000 | 1.00 | 1.0000 | 13.00 | 0.000000 |
open_act_il | 86770.0 | 2.816123 | 3.114175 | 0.00 | 1.0000 | 2.00 | 3.0000 | 43.00 | 0.000000 |
open_il_12m | 86770.0 | 0.695678 | 0.952599 | 0.00 | 0.0000 | 0.00 | 1.0000 | 11.00 | 0.000000 |
open_il_24m | 86770.0 | 1.578092 | 1.621432 | 0.00 | 0.0000 | 1.00 | 2.0000 | 26.00 | 0.000000 |
mths_since_rcnt_il | 84447.0 | 21.811160 | 26.764207 | 0.00 | 7.0000 | 13.00 | 25.0000 | 503.00 | 0.026772 |
total_bal_il | 86770.0 | 34920.427164 | 42123.969735 | 0.00 | 9014.2500 | 22997.00 | 45513.5000 | 1547285.00 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
num_tl_op_past_12m | 86770.0 | 2.191898 | 1.931044 | 0.00 | 1.0000 | 2.00 | 3.0000 | 24.00 | 0.000000 |
pct_tl_nvr_dlq | 86770.0 | 93.250410 | 9.712188 | 0.00 | 90.0000 | 96.95 | 100.0000 | 100.00 | 0.000000 |
percent_bc_gt_75 | 85791.0 | 42.178564 | 36.295768 | 0.00 | 0.0000 | 33.30 | 70.0000 | 100.00 | 0.011283 |
pub_rec_bankruptcies | 86770.0 | 0.146318 | 0.403204 | 0.00 | 0.0000 | 0.00 | 0.0000 | 8.00 | 0.000000 |
tax_liens | 86770.0 | 0.074104 | 0.511770 | 0.00 | 0.0000 | 0.00 | 0.0000 | 61.00 | 0.000000 |
tot_hi_cred_lim | 86770.0 | 175777.763858 | 178392.180843 | 2500.00 | 49905.7500 | 111976.00 | 253691.5000 | 3953111.00 | 0.000000 |
total_bal_ex_mort | 86770.0 | 50888.933214 | 49440.559933 | 0.00 | 20803.0000 | 37731.00 | 64243.2500 | 1548128.00 | 0.000000 |
total_bc_limit | 86770.0 | 21251.124029 | 21050.690545 | 0.00 | 7800.0000 | 15000.00 | 27500.0000 | 520500.00 | 0.000000 |
total_il_high_credit_limit | 86770.0 | 44113.311882 | 44747.921766 | 0.00 | 15709.2500 | 33192.50 | 59000.0000 | 2000000.00 | 0.000000 |
revol_bal_joint | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_earliest_cr_line | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_inq_last_6mths | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_mort_acc | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_open_acc | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_revol_util | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_open_act_il | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_num_rev_accts | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_chargeoff_within_12_mths | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_collections_12_mths_ex_med | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
sec_app_mths_since_last_major_derog | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
deferral_term | 589.0 | 3.000000 | 0.000000 | 3.00 | 3.0000 | 3.00 | 3.0000 | 3.00 | 0.993212 |
hardship_amount | 589.0 | 163.295331 | 131.454229 | 6.35 | 67.0700 | 125.10 | 219.1100 | 769.03 | 0.993212 |
hardship_length | 589.0 | 3.000000 | 0.000000 | 3.00 | 3.0000 | 3.00 | 3.0000 | 3.00 | 0.993212 |
hardship_dpd | 589.0 | 11.607810 | 10.522442 | 0.00 | 0.0000 | 12.00 | 22.0000 | 37.00 | 0.993212 |
orig_projected_additional_accrued_interest | 551.0 | 495.658911 | 393.610753 | 19.05 | 208.3950 | 387.63 | 666.8700 | 2307.09 | 0.993650 |
hardship_payoff_balance_amount | 589.0 | 12088.823175 | 7282.730741 | 750.93 | 6528.8700 | 10569.61 | 16276.7300 | 33494.15 | 0.993212 |
hardship_last_payment_amount | 589.0 | 192.151002 | 207.123799 | 0.02 | 35.2400 | 121.84 | 285.4700 | 1223.33 | 0.993212 |
settlement_amount | 87.0 | 5355.673333 | 4012.832699 | 508.00 | 2082.0000 | 4351.00 | 7535.5000 | 17191.00 | 0.998997 |
settlement_percentage | 87.0 | 53.376092 | 4.694325 | 49.99 | 50.0000 | 50.01 | 55.0000 | 65.00 | 0.998997 |
settlement_term | 87.0 | 13.747126 | 4.394019 | 1.00 | 10.0000 | 15.00 | 18.0000 | 18.00 | 0.998997 |
97 rows × 9 columns
df.drop('annual_inc_joint',1,inplace=True)
df.drop('dti_joint',1,inplace=True)
df.drop('url',1,inplace=True)
df.drop(['revol_bal_joint','sec_app_earliest_cr_line','sec_app_inq_last_6mths','sec_app_mort_acc','sec_app_open_acc','sec_app_revol_util','sec_app_open_act_il','sec_app_num_rev_accts','sec_app_chargeoff_within_12_mths','sec_app_collections_12_mths_ex_med','sec_app_mths_since_last_major_derog'],1,inplace=True)
df.select_dtypes(include=['float']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | mean | std | min | 25% | 50% | 75% | max | missing_pct | |
---|---|---|---|---|---|---|---|---|---|
loan_amnt | 86770.0 | 13992.648381 | 8860.658420 | 1000.00 | 7000.000 | 12000.00 | 20000.00 | 40000.00 | 0.000000 |
funded_amnt | 86770.0 | 13992.648381 | 8860.658420 | 1000.00 | 7000.000 | 12000.00 | 20000.00 | 40000.00 | 0.000000 |
funded_amnt_inv | 86770.0 | 13988.150570 | 8857.893095 | 1000.00 | 7000.000 | 12000.00 | 20000.00 | 40000.00 | 0.000000 |
int_rate | 86770.0 | 13.339744 | 4.662115 | 5.32 | 10.490 | 12.79 | 15.59 | 30.99 | 0.000000 |
installment | 86770.0 | 425.346229 | 269.732279 | 30.12 | 230.800 | 352.32 | 558.32 | 1535.71 | 0.000000 |
annual_inc | 86770.0 | 79304.560449 | 74486.852068 | 0.00 | 48000.000 | 67000.00 | 95000.00 | 8400000.00 | 0.000000 |
loan_status | 86770.0 | 0.973793 | 0.159752 | 0.00 | 1.000 | 1.00 | 1.00 | 1.00 | 0.000000 |
dti | 86767.0 | 17.732272 | 8.891494 | 0.00 | 11.720 | 17.35 | 23.59 | 999.00 | 0.000035 |
delinq_2yrs | 86770.0 | 0.378541 | 0.982088 | 0.00 | 0.000 | 0.00 | 0.00 | 21.00 | 0.000000 |
inq_last_6mths | 86770.0 | 0.548047 | 0.844625 | 0.00 | 0.000 | 0.00 | 1.00 | 5.00 | 0.000000 |
mths_since_last_delinq | 46645.0 | 33.288112 | 21.837896 | 0.00 | 15.000 | 30.00 | 48.00 | 142.00 | 0.462429 |
mths_since_last_record | 16967.0 | 67.008428 | 24.329262 | 0.00 | 51.000 | 70.00 | 84.00 | 119.00 | 0.804460 |
open_acc | 86770.0 | 11.677089 | 5.730229 | 1.00 | 8.000 | 11.00 | 14.00 | 86.00 | 0.000000 |
pub_rec | 86770.0 | 0.260620 | 0.711654 | 0.00 | 0.000 | 0.00 | 0.00 | 61.00 | 0.000000 |
revol_bal | 86770.0 | 15700.789201 | 21787.950268 | 0.00 | 5672.250 | 10539.50 | 18653.00 | 876178.00 | 0.000000 |
total_acc | 86770.0 | 24.009623 | 11.914658 | 2.00 | 15.000 | 22.00 | 31.00 | 119.00 | 0.000000 |
collections_12_mths_ex_med | 86770.0 | 0.020871 | 0.166700 | 0.00 | 0.000 | 0.00 | 0.00 | 10.00 | 0.000000 |
mths_since_last_major_derog | 25416.0 | 43.956996 | 21.772484 | 0.00 | 27.000 | 43.00 | 62.00 | 165.00 | 0.707088 |
acc_now_delinq | 86770.0 | 0.006546 | 0.085499 | 0.00 | 0.000 | 0.00 | 0.00 | 4.00 | 0.000000 |
tot_coll_amt | 86770.0 | 280.983439 | 1825.358766 | 0.00 | 0.000 | 0.00 | 0.00 | 172575.00 | 0.000000 |
tot_cur_bal | 86770.0 | 141618.435081 | 159319.917606 | 0.00 | 28983.000 | 79803.50 | 211297.75 | 3764968.00 | 0.000000 |
open_acc_6m | 86770.0 | 0.949084 | 1.154786 | 0.00 | 0.000 | 1.00 | 1.00 | 13.00 | 0.000000 |
open_act_il | 86770.0 | 2.816123 | 3.114175 | 0.00 | 1.000 | 2.00 | 3.00 | 43.00 | 0.000000 |
open_il_12m | 86770.0 | 0.695678 | 0.952599 | 0.00 | 0.000 | 0.00 | 1.00 | 11.00 | 0.000000 |
open_il_24m | 86770.0 | 1.578092 | 1.621432 | 0.00 | 0.000 | 1.00 | 2.00 | 26.00 | 0.000000 |
mths_since_rcnt_il | 84447.0 | 21.811160 | 26.764207 | 0.00 | 7.000 | 13.00 | 25.00 | 503.00 | 0.026772 |
total_bal_il | 86770.0 | 34920.427164 | 42123.969735 | 0.00 | 9014.250 | 22997.00 | 45513.50 | 1547285.00 | 0.000000 |
il_util | 74592.0 | 71.086001 | 23.439148 | 0.00 | 58.000 | 74.00 | 87.00 | 1000.00 | 0.140348 |
open_rv_12m | 86770.0 | 1.372882 | 1.546456 | 0.00 | 0.000 | 1.00 | 2.00 | 20.00 | 0.000000 |
open_rv_24m | 86770.0 | 2.884430 | 2.635246 | 0.00 | 1.000 | 2.00 | 4.00 | 60.00 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
num_actv_rev_tl | 86770.0 | 5.580143 | 3.375133 | 0.00 | 3.000 | 5.00 | 7.00 | 59.00 | 0.000000 |
num_bc_sats | 86770.0 | 4.626288 | 2.999892 | 0.00 | 3.000 | 4.00 | 6.00 | 61.00 | 0.000000 |
num_bc_tl | 86770.0 | 7.407053 | 4.529108 | 0.00 | 4.000 | 7.00 | 10.00 | 67.00 | 0.000000 |
num_il_tl | 86770.0 | 8.568918 | 7.510092 | 0.00 | 3.000 | 7.00 | 11.00 | 107.00 | 0.000000 |
num_op_rev_tl | 86770.0 | 8.155215 | 4.699856 | 0.00 | 5.000 | 7.00 | 10.00 | 79.00 | 0.000000 |
num_rev_accts | 86770.0 | 13.695909 | 7.937062 | 2.00 | 8.000 | 12.00 | 18.00 | 104.00 | 0.000000 |
num_rev_tl_bal_gt_0 | 86770.0 | 5.524144 | 3.262540 | 0.00 | 3.000 | 5.00 | 7.00 | 59.00 | 0.000000 |
num_sats | 86770.0 | 11.633537 | 5.709296 | 1.00 | 8.000 | 11.00 | 14.00 | 85.00 | 0.000000 |
num_tl_120dpd_2m | 83916.0 | 0.001096 | 0.035858 | 0.00 | 0.000 | 0.00 | 0.00 | 4.00 | 0.032892 |
num_tl_30dpd | 86770.0 | 0.004218 | 0.067253 | 0.00 | 0.000 | 0.00 | 0.00 | 3.00 | 0.000000 |
num_tl_90g_dpd_24m | 86770.0 | 0.100000 | 0.563002 | 0.00 | 0.000 | 0.00 | 0.00 | 20.00 | 0.000000 |
num_tl_op_past_12m | 86770.0 | 2.191898 | 1.931044 | 0.00 | 1.000 | 2.00 | 3.00 | 24.00 | 0.000000 |
pct_tl_nvr_dlq | 86770.0 | 93.250410 | 9.712188 | 0.00 | 90.000 | 96.95 | 100.00 | 100.00 | 0.000000 |
percent_bc_gt_75 | 85791.0 | 42.178564 | 36.295768 | 0.00 | 0.000 | 33.30 | 70.00 | 100.00 | 0.011283 |
pub_rec_bankruptcies | 86770.0 | 0.146318 | 0.403204 | 0.00 | 0.000 | 0.00 | 0.00 | 8.00 | 0.000000 |
tax_liens | 86770.0 | 0.074104 | 0.511770 | 0.00 | 0.000 | 0.00 | 0.00 | 61.00 | 0.000000 |
tot_hi_cred_lim | 86770.0 | 175777.763858 | 178392.180843 | 2500.00 | 49905.750 | 111976.00 | 253691.50 | 3953111.00 | 0.000000 |
total_bal_ex_mort | 86770.0 | 50888.933214 | 49440.559933 | 0.00 | 20803.000 | 37731.00 | 64243.25 | 1548128.00 | 0.000000 |
total_bc_limit | 86770.0 | 21251.124029 | 21050.690545 | 0.00 | 7800.000 | 15000.00 | 27500.00 | 520500.00 | 0.000000 |
total_il_high_credit_limit | 86770.0 | 44113.311882 | 44747.921766 | 0.00 | 15709.250 | 33192.50 | 59000.00 | 2000000.00 | 0.000000 |
deferral_term | 589.0 | 3.000000 | 0.000000 | 3.00 | 3.000 | 3.00 | 3.00 | 3.00 | 0.993212 |
hardship_amount | 589.0 | 163.295331 | 131.454229 | 6.35 | 67.070 | 125.10 | 219.11 | 769.03 | 0.993212 |
hardship_length | 589.0 | 3.000000 | 0.000000 | 3.00 | 3.000 | 3.00 | 3.00 | 3.00 | 0.993212 |
hardship_dpd | 589.0 | 11.607810 | 10.522442 | 0.00 | 0.000 | 12.00 | 22.00 | 37.00 | 0.993212 |
orig_projected_additional_accrued_interest | 551.0 | 495.658911 | 393.610753 | 19.05 | 208.395 | 387.63 | 666.87 | 2307.09 | 0.993650 |
hardship_payoff_balance_amount | 589.0 | 12088.823175 | 7282.730741 | 750.93 | 6528.870 | 10569.61 | 16276.73 | 33494.15 | 0.993212 |
hardship_last_payment_amount | 589.0 | 192.151002 | 207.123799 | 0.02 | 35.240 | 121.84 | 285.47 | 1223.33 | 0.993212 |
settlement_amount | 87.0 | 5355.673333 | 4012.832699 | 508.00 | 2082.000 | 4351.00 | 7535.50 | 17191.00 | 0.998997 |
settlement_percentage | 87.0 | 53.376092 | 4.694325 | 49.99 | 50.000 | 50.01 | 55.00 | 65.00 | 0.998997 |
settlement_term | 87.0 | 13.747126 | 4.394019 | 1.00 | 10.000 | 15.00 | 18.00 | 18.00 | 0.998997 |
83 rows × 9 columns
df.select_dtypes(include=['float']).describe().T[df.select_dtypes(include=['float']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x)))).isnull()]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
loan_amnt | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
int_rate | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
annual_inc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
loan_status | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
dti | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
delinq_2yrs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
inq_last_6mths | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mths_since_last_delinq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mths_since_last_record | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_acc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
pub_rec | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
revol_bal | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
total_acc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
collections_12_mths_ex_med | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mths_since_last_major_derog | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
acc_now_delinq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
tot_coll_amt | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
tot_cur_bal | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_acc_6m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_act_il | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_il_12m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_il_24m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mths_since_rcnt_il | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
total_bal_il | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
il_util | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_rv_12m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
open_rv_24m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max_bal_bc | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
all_util | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
total_rev_hi_lim | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
num_actv_rev_tl | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_bc_sats | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_bc_tl | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_il_tl | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_op_rev_tl | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_rev_accts | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_rev_tl_bal_gt_0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_sats | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_tl_120dpd_2m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_tl_30dpd | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_tl_90g_dpd_24m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
num_tl_op_past_12m | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
pct_tl_nvr_dlq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
percent_bc_gt_75 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
pub_rec_bankruptcies | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
tax_liens | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
tot_hi_cred_lim | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
total_bal_ex_mort | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
total_bc_limit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
total_il_high_credit_limit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
deferral_term | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
hardship_amount | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
hardship_length | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
hardship_dpd | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
orig_projected_additional_accrued_interest | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
hardship_payoff_balance_amount | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
hardship_last_payment_amount | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
settlement_amount | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
settlement_percentage | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
settlement_term | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
80 rows × 8 columns
df.select_dtypes(include=['int']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | mean | std | min | 25% | 50% | 75% | max | missing_pct | |
---|---|---|---|---|---|---|---|---|---|
emp_length | 86770.0 | 5.793604 | 3.761932 | 0.0 | 2.0 | 6.0 | 10.0 | 10.0 | 0.0 |
cor = df.corr()
cor.loc[:,:] = np.tril(cor,k=-1)
cor
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
loan_amnt | funded_amnt | funded_amnt_inv | int_rate | installment | emp_length | annual_inc | loan_status | dti | delinq_2yrs | ... | deferral_term | hardship_amount | hardship_length | hardship_dpd | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | settlement_amount | settlement_percentage | settlement_term | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
loan_amnt | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
funded_amnt | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
funded_amnt_inv | 0.999993 | 0.999993 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
int_rate | 0.183818 | 0.183818 | 0.183990 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
installment | 0.953658 | 0.953658 | 0.953570 | 0.207638 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
emp_length | 0.110753 | 0.110753 | 0.110799 | -0.018173 | 0.096081 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
annual_inc | 0.314069 | 0.314069 | 0.314033 | -0.066962 | 0.295867 | 0.094171 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
loan_status | -0.013152 | -0.013152 | -0.013179 | -0.080925 | -0.018661 | 0.013513 | 0.016563 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
dti | 0.038334 | 0.038334 | 0.038344 | 0.167446 | 0.048354 | 0.011855 | -0.145502 | -0.022552 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
delinq_2yrs | -0.008369 | -0.008369 | -0.008387 | 0.029667 | -0.002644 | 0.026023 | 0.037214 | -0.020069 | -0.012498 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
inq_last_6mths | -0.016169 | -0.016169 | -0.016198 | 0.170607 | 0.007442 | 0.001574 | 0.034509 | -0.019708 | -0.001975 | 0.025702 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
mths_since_last_delinq | 0.002074 | 0.002074 | 0.002101 | -0.018567 | -0.004217 | -0.016531 | -0.033823 | 0.013422 | 0.009132 | -0.551436 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
mths_since_last_record | -0.015390 | -0.015390 | -0.015358 | -0.013327 | -0.023256 | 0.027087 | -0.080178 | -0.000778 | 0.049695 | -0.075914 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
open_acc | 0.189283 | 0.189283 | 0.189256 | -0.006746 | 0.176980 | 0.051260 | 0.140723 | -0.006664 | 0.261796 | 0.056525 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
pub_rec | -0.039875 | -0.039875 | -0.039901 | 0.049172 | -0.029240 | 0.012645 | 0.000980 | -0.011186 | -0.051349 | -0.039383 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
revol_bal | 0.323861 | 0.323861 | 0.323821 | -0.023577 | 0.306485 | 0.088163 | 0.308800 | 0.018380 | 0.131510 | -0.025537 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
total_acc | 0.207745 | 0.207745 | 0.207719 | -0.046995 | 0.183350 | 0.088303 | 0.176989 | 0.003432 | 0.219191 | 0.121100 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
collections_12_mths_ex_med | -0.020581 | -0.020581 | -0.020595 | 0.015218 | -0.019196 | -0.009854 | -0.010283 | -0.011052 | -0.006429 | 0.082467 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
mths_since_last_major_derog | 0.029221 | 0.029221 | 0.029253 | -0.010362 | 0.023953 | 0.011823 | -0.006633 | 0.008867 | 0.022977 | -0.439524 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
acc_now_delinq | -0.000643 | -0.000643 | -0.000652 | 0.011462 | 0.000688 | 0.012119 | 0.011594 | -0.000940 | -0.002557 | 0.122841 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
tot_coll_amt | -0.024839 | -0.024839 | -0.024834 | 0.007813 | -0.020357 | 0.002719 | -0.003110 | 0.003976 | -0.016754 | -0.003642 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
tot_cur_bal | 0.306866 | 0.306866 | 0.306888 | -0.067493 | 0.270260 | 0.117314 | 0.392770 | 0.022304 | 0.022050 | 0.056121 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
open_acc_6m | -0.017690 | -0.017690 | -0.017694 | 0.155748 | 0.003937 | 0.017539 | 0.046791 | -0.024538 | 0.040895 | 0.000350 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
open_act_il | 0.021932 | 0.021932 | 0.021922 | 0.031291 | 0.016411 | -0.087070 | 0.072141 | -0.005725 | 0.222380 | 0.071720 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
open_il_12m | 0.009230 | 0.009230 | 0.009251 | 0.185548 | 0.023938 | 0.042698 | 0.093255 | -0.021131 | 0.166589 | -0.005855 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
open_il_24m | 0.037502 | 0.037502 | 0.037524 | 0.163407 | 0.044889 | 0.046481 | 0.115768 | -0.021197 | 0.229569 | -0.023145 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
mths_since_rcnt_il | -0.026022 | -0.026022 | -0.026030 | -0.097319 | -0.027290 | -0.033338 | -0.076912 | 0.015779 | -0.244153 | 0.011909 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
total_bal_il | 0.145235 | 0.145235 | 0.145235 | 0.037161 | 0.131976 | -0.023716 | 0.224390 | -0.000659 | 0.239179 | 0.059929 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
il_util | -0.088801 | -0.088801 | -0.088763 | 0.146080 | -0.080543 | -0.088790 | -0.034212 | -0.027497 | -0.009064 | -0.006509 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
open_rv_12m | -0.039301 | -0.039301 | -0.039299 | 0.134080 | -0.010669 | 0.004968 | -0.000835 | -0.028553 | 0.010479 | -0.026754 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
num_actv_rev_tl | 0.175949 | 0.175949 | 0.175932 | 0.056059 | 0.176815 | 0.109311 | 0.088747 | -0.019766 | 0.229290 | -0.003791 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_bc_sats | 0.236664 | 0.236664 | 0.236621 | -0.045445 | 0.224395 | 0.067978 | 0.135832 | -0.001078 | 0.110538 | -0.035852 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_bc_tl | 0.221586 | 0.221586 | 0.221538 | -0.085157 | 0.206882 | 0.092169 | 0.145879 | 0.007720 | 0.083480 | 0.029768 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_il_tl | 0.071276 | 0.071276 | 0.071275 | 0.015775 | 0.056311 | -0.021343 | 0.103582 | -0.006861 | 0.209211 | 0.092056 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_op_rev_tl | 0.180233 | 0.180233 | 0.180201 | -0.016180 | 0.174405 | 0.097503 | 0.085403 | -0.007875 | 0.168069 | 0.008976 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_rev_accts | 0.187665 | 0.187665 | 0.187622 | -0.064875 | 0.173955 | 0.116064 | 0.110810 | 0.005822 | 0.135595 | 0.075864 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_rev_tl_bal_gt_0 | 0.173031 | 0.173031 | 0.173023 | 0.057363 | 0.175158 | 0.109516 | 0.087066 | -0.019129 | 0.232890 | -0.006032 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_sats | 0.189243 | 0.189243 | 0.189216 | -0.006886 | 0.177005 | 0.050780 | 0.140348 | -0.006524 | 0.261779 | 0.054582 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_tl_120dpd_2m | -0.003400 | -0.003400 | -0.003416 | 0.001534 | -0.003172 | 0.003219 | 0.003622 | 0.002894 | -0.011306 | 0.046662 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_tl_30dpd | 0.005174 | 0.005174 | 0.005163 | 0.010414 | 0.006622 | 0.012415 | 0.012151 | -0.001511 | 0.002556 | 0.102156 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_tl_90g_dpd_24m | -0.023556 | -0.023556 | -0.023574 | 0.021237 | -0.019500 | -0.002221 | 0.006915 | -0.012634 | -0.015192 | 0.670425 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
num_tl_op_past_12m | -0.014006 | -0.014006 | -0.013991 | 0.196019 | 0.013791 | 0.027795 | 0.059326 | -0.032077 | 0.080556 | -0.029705 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
pct_tl_nvr_dlq | 0.083337 | 0.083337 | 0.083360 | -0.041184 | 0.068789 | -0.024941 | -0.008379 | 0.022224 | 0.086247 | -0.460886 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
percent_bc_gt_75 | 0.035651 | 0.035651 | 0.035702 | 0.196573 | 0.051238 | 0.032716 | 0.003882 | -0.017767 | 0.145310 | -0.004343 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
pub_rec_bankruptcies | -0.070945 | -0.070945 | -0.070962 | 0.051258 | -0.062869 | 0.003886 | -0.044893 | -0.011858 | -0.030585 | -0.073138 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
tax_liens | 0.015755 | 0.015755 | 0.015733 | 0.015073 | 0.021572 | 0.009393 | 0.039479 | -0.006130 | -0.035036 | 0.004700 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
tot_hi_cred_lim | 0.332977 | 0.332977 | 0.332989 | -0.097597 | 0.293538 | 0.130913 | 0.411099 | 0.025668 | 0.034374 | 0.057963 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
total_bal_ex_mort | 0.270811 | 0.270811 | 0.270801 | 0.021583 | 0.252577 | 0.018146 | 0.336079 | 0.008061 | 0.262438 | 0.038416 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
total_bc_limit | 0.372026 | 0.372026 | 0.371955 | -0.185556 | 0.334458 | 0.066976 | 0.276032 | 0.030958 | 0.049492 | -0.082722 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
total_il_high_credit_limit | 0.203510 | 0.203510 | 0.203500 | 0.013691 | 0.186510 | 0.008036 | 0.283449 | 0.004152 | 0.296579 | 0.066601 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
deferral_term | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
hardship_amount | 0.827591 | 0.827591 | 0.827475 | 0.639130 | 0.781581 | 0.035424 | 0.303210 | -0.010632 | 0.083784 | -0.006715 | ... | NaN | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
hardship_length | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
hardship_dpd | 0.034837 | 0.034837 | 0.034631 | 0.142229 | 0.051037 | -0.048772 | -0.007155 | -0.288293 | 0.017312 | 0.049492 | ... | NaN | 0.062390 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
orig_projected_additional_accrued_interest | 0.829480 | 0.829480 | 0.829360 | 0.633758 | 0.783187 | 0.042427 | 0.312575 | -0.017930 | 0.079196 | 0.001766 | ... | NaN | 1.000000 | NaN | 0.078457 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
hardship_payoff_balance_amount | 0.958502 | 0.958502 | 0.958435 | 0.324071 | 0.862753 | 0.072547 | 0.432046 | 0.041810 | 0.003633 | 0.001576 | ... | NaN | 0.900165 | NaN | 0.011203 | 0.901693 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 |
hardship_last_payment_amount | 0.489119 | 0.489119 | 0.489032 | 0.190434 | 0.493799 | 0.002568 | 0.284256 | 0.004410 | 0.002036 | 0.028093 | ... | NaN | 0.405107 | NaN | 0.070852 | 0.394808 | 0.459846 | 0.0 | 0.000000 | 0.000000 | 0.0 |
settlement_amount | 0.917878 | 0.917878 | 0.917875 | 0.388751 | 0.814741 | 0.079187 | 0.492165 | 0.068815 | 0.082613 | 0.059831 | ... | NaN | 1.000000 | NaN | NaN | 1.000000 | 1.000000 | -1.0 | 0.000000 | 0.000000 | 0.0 |
settlement_percentage | -0.226310 | -0.226310 | -0.226299 | -0.148911 | -0.224885 | -0.095090 | -0.051463 | 0.362477 | -0.097457 | -0.051380 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.018362 | 0.000000 | 0.0 |
settlement_term | 0.575292 | 0.575292 | 0.575291 | 0.286584 | 0.507058 | 0.111925 | 0.226042 | -0.263670 | 0.318564 | 0.026849 | ... | NaN | -1.000000 | NaN | NaN | -1.000000 | -1.000000 | 1.0 | 0.563137 | -0.050615 | 0.0 |
84 rows × 84 columns
cor.stack()
loan_amnt loan_amnt 0.000000
funded_amnt 0.000000
funded_amnt_inv 0.000000
int_rate 0.000000
installment 0.000000
emp_length 0.000000
annual_inc 0.000000
loan_status 0.000000
dti 0.000000
delinq_2yrs 0.000000
inq_last_6mths 0.000000
mths_since_last_delinq 0.000000
mths_since_last_record 0.000000
open_acc 0.000000
pub_rec 0.000000
revol_bal 0.000000
total_acc 0.000000
collections_12_mths_ex_med 0.000000
mths_since_last_major_derog 0.000000
acc_now_delinq 0.000000
tot_coll_amt 0.000000
tot_cur_bal 0.000000
open_acc_6m 0.000000
open_act_il 0.000000
open_il_12m 0.000000
open_il_24m 0.000000
mths_since_rcnt_il 0.000000
total_bal_il 0.000000
il_util 0.000000
open_rv_12m 0.000000
...
settlement_term mths_since_recent_inq 0.030340
mths_since_recent_revol_delinq 0.103703
num_accts_ever_120_pd 0.027002
num_actv_bc_tl 0.181307
num_actv_rev_tl 0.220559
num_bc_sats 0.162077
num_bc_tl 0.178956
num_il_tl 0.114414
num_op_rev_tl 0.166385
num_rev_accts 0.197364
num_rev_tl_bal_gt_0 0.247732
num_sats 0.193849
num_tl_30dpd -0.096457
num_tl_90g_dpd_24m 0.068411
num_tl_op_past_12m 0.110434
pct_tl_nvr_dlq 0.121179
percent_bc_gt_75 0.027080
pub_rec_bankruptcies -0.080801
tax_liens 0.039254
tot_hi_cred_lim 0.213624
total_bal_ex_mort 0.219922
total_bc_limit 0.184810
total_il_high_credit_limit 0.218572
hardship_amount -1.000000
orig_projected_additional_accrued_interest -1.000000
hardship_payoff_balance_amount -1.000000
hardship_last_payment_amount 1.000000
settlement_amount 0.563137
settlement_percentage -0.050615
settlement_term 0.000000
Length: 6876, dtype: float64
cor[(cor>0.55) | (cor<-0.55)].stack()
funded_amnt loan_amnt 1.000000
funded_amnt_inv loan_amnt 0.999993
funded_amnt 0.999993
installment loan_amnt 0.953658
funded_amnt 0.953658
funded_amnt_inv 0.953570
mths_since_last_delinq delinq_2yrs -0.551436
total_acc open_acc 0.723418
mths_since_last_major_derog mths_since_last_delinq 0.691480
open_il_24m open_il_12m 0.758545
total_bal_il open_act_il 0.567060
open_rv_12m open_acc_6m 0.623141
open_rv_24m open_rv_12m 0.776720
all_util il_util 0.597521
total_rev_hi_lim revol_bal 0.815654
inq_last_12m inq_fi 0.559390
acc_open_past_24mths open_acc_6m 0.553789
open_il_24m 0.570438
open_rv_12m 0.659344
open_rv_24m 0.847494
avg_cur_bal tot_cur_bal 0.828143
bc_open_to_buy total_rev_hi_lim 0.624163
bc_util all_util 0.572605
mo_sin_rcnt_tl mo_sin_rcnt_rev_tl_op 0.604777
mths_since_recent_bc mo_sin_rcnt_rev_tl_op 0.615410
mths_since_recent_bc_dlq mths_since_last_delinq 0.750604
mths_since_last_major_derog 0.560775
mths_since_recent_revol_delinq mths_since_last_delinq 0.852881
mths_since_recent_bc_dlq 0.879875
num_actv_bc_tl open_acc 0.553697
...
hardship_amount installment 0.781581
orig_projected_additional_accrued_interest loan_amnt 0.829480
funded_amnt 0.829480
funded_amnt_inv 0.829360
int_rate 0.633758
installment 0.783187
hardship_amount 1.000000
hardship_payoff_balance_amount loan_amnt 0.958502
funded_amnt 0.958502
funded_amnt_inv 0.958435
installment 0.862753
hardship_amount 0.900165
orig_projected_additional_accrued_interest 0.901693
settlement_amount loan_amnt 0.917878
funded_amnt 0.917878
funded_amnt_inv 0.917875
installment 0.814741
mths_since_last_record -0.600571
hardship_amount 1.000000
orig_projected_additional_accrued_interest 1.000000
hardship_payoff_balance_amount 1.000000
hardship_last_payment_amount -1.000000
settlement_term loan_amnt 0.575292
funded_amnt 0.575292
funded_amnt_inv 0.575291
hardship_amount -1.000000
orig_projected_additional_accrued_interest -1.000000
hardship_payoff_balance_amount -1.000000
hardship_last_payment_amount 1.000000
settlement_amount 0.563137
Length: 125, dtype: float64
df.drop(['funded_amnt','funded_amnt_inv', 'installment'], axis=1, inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import ensemble
from sklearn.preprocessing import OneHotEncoder
Y = df.loan_status
X = df.drop('loan_status',1,inplace=False)
print(Y.shape)
print(X.shape)
(86770,)
(86770, 89)
X = pd.get_dummies(X)
print(X.shape)
(86770, 100)
X.fillna(0.0,inplace=True)
X.fillna(0,inplace=True)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.3, random_state=666)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(60739, 100)
(60739,)
(26031, 100)
(26031,)
print (y_train.value_counts())
print (y_test.value_counts())
1.0 59148
0.0 1591
Name: loan_status, dtype: int64
1.0 25348
0.0 683
Name: loan_status, dtype: int64
param_grid = {'learning_rate': np.linspace(0.01,0.5,5),
'max_depth': [i for i in range(2,6)],
'min_samples_split': [50,100],
'n_estimators': [100,200]
}
grid_search = GridSearchCV(ensemble.GradientBoostingRegressor(),
param_grid, n_jobs=4, refit=True,verbose=1)
%%time
grid_search.fit(x_train,y_train)
bst_clf = grid_search.best_estimator_
Fitting 3 folds for each of 80 candidates, totalling 240 fits
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 5.3min
[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 31.5min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 40.6min finished
Wall time: 41min 17s
%%time
bst_clf.fit(x_train, y_train)
Wall time: 43.2 s
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.01, loss='ls', max_depth=4, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=100, min_weight_fraction_leaf=0.0,
n_estimators=200, presort='auto', random_state=None,
subsample=1.0, verbose=0, warm_start=False)
bst_clf.score(x_test,y_test)
0.07240211533402607
def compute_ks(data):
sorted_list = data.sort_values(['predict'], ascending=[True])
total_bad = sorted_list['label'].sum(axis=None, skipna=None, level=None, numeric_only=None) / 3
total_good = sorted_list.shape[0] - total_bad
max_ks = 0.0
good_count = 0.0
bad_count = 0.0
for index, row in sorted_list.iterrows():
if row['label'] == 3:
bad_count += 1.0
else:
good_count += 1.0
val = bad_count/total_bad - good_count/total_good
max_ks = max(max_ks, val)
return max_ks
test_pd = pd.DataFrame()
test_pd['predict'] = bst_clf.predict(x_test)
test_pd['label'] = y_test
print( compute_ks(test_pd[['label','predict']]))
0.0
feature_importance = bst_clf.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
indices = np.argsort(feature_importance)[-10:]
plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4)
plt.yticks(np.arange(10 + 0.25), np.array(X.columns)[indices])
_ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables')