Skip to content

应用于p2p公司,用于判断借贷人的风险程度,判断借贷人能否及时还欠款

License

Notifications You must be signed in to change notification settings

Expert68/risk_control

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

risk_control

应用于p2p公司,用于判断借贷人的风险程度,判断借贷人能否及时还欠款

金融借贷风控

数据来源:lending club 2016年Q3和Q4数据:https://www.lendingclub.com/info/download-data.action

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('input/LoanStats_2016Q3.csv',skiprows=1,low_memory=False)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
0 NaN NaN 30000.0 30000.0 30000.0 60 months 13.99% 697.90 C C3 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
1 NaN NaN 20150.0 20150.0 20150.0 60 months 24.99% 591.32 E E4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
2 NaN NaN 30000.0 30000.0 30000.0 36 months 10.99% 982.02 B B4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
3 NaN NaN 15000.0 15000.0 15000.0 36 months 13.99% 512.60 C C3 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
4 NaN NaN 18000.0 18000.0 18000.0 60 months 14.49% 423.42 C C4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN

5 rows × 145 columns

获取列名,对一些不重要的或具有干扰意义的列进行直接删除

df.columns
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'disbursement_method', 'debt_settlement_flag',
       'debt_settlement_flag_date', 'settlement_status', 'settlement_date',
       'settlement_amount', 'settlement_percentage', 'settlement_term'],
      dtype='object', length=145)
df.drop(['id','member_id','zip_code','addr_state'],1,inplace=True)
df.int_rate = pd.Series(df.int_rate).str.replace('%', '').astype(float)

将所有空行删除

df.dropna(axis=0,how='all',inplace=True)

重点特征关注

  1. load_amnt和funded_amnt
  2. emp_title:职位信息
  3. emp_length: 工作年限
  4. verification_status: 收入信息是否被确认
  5. loan_status: 贷款情况,即最后的判断结果

loan_amnt 和 funded_amnt

df.loan_amnt[pd.isnull(df.loan_amnt)]
Series([], Name: loan_amnt, dtype: float64)
df.query('loan_amnt != funded_amnt')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length ... hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term

0 rows × 141 columns

df.loan_amnt.value_counts()
10000.0    7328
12000.0    5358
20000.0    5140
15000.0    5082
5000.0     4429
8000.0     3924
6000.0     3729
35000.0    3306
16000.0    2934
24000.0    2849
30000.0    2365
25000.0    2240
18000.0    2199
7000.0     1893
14000.0    1694
4000.0     1532
3000.0     1496
9000.0     1249
21000.0    1160
28000.0    1152
9600.0      928
2000.0      884
40000.0     878
7200.0      813
13000.0     733
11000.0     732
11200.0     714
8400.0      674
4800.0      670
14400.0     660
           ... 
32325.0       1
24925.0       1
33425.0       1
35750.0       1
36600.0       1
24725.0       1
31100.0       1
33725.0       1
33075.0       1
34275.0       1
30550.0       1
38725.0       1
25025.0       1
38800.0       1
36750.0       1
35300.0       1
36625.0       1
30325.0       1
36675.0       1
38625.0       1
35200.0       1
35050.0       1
30750.0       1
36200.0       1
36950.0       1
39750.0       1
33225.0       1
29075.0       1
39725.0       1
39075.0       1
Name: loan_amnt, Length: 1411, dtype: int64

emp_title: 职位信息

df.emp_title.value_counts()
Teacher                                     1931
Manager                                     1701
Owner                                        990
Supervisor                                   785
Driver                                       756
Registered Nurse                             752
RN                                           731
Sales                                        664
Project Manager                              526
General Manager                              483
Office Manager                               466
Director                                     415
owner                                        384
Engineer                                     382
President                                    351
manager                                      314
Operations Manager                           314
Vice President                               288
Nurse                                        284
teacher                                      284
Attorney                                     275
Accountant                                   274
Sales Manager                                263
Analyst                                      246
Administrative Assistant                     243
Police Officer                               230
driver                                       222
Account Manager                              213
Technician                                   211
Executive Assistant                          205
                                            ... 
Bartender/Production floor tech                1
Massage Therapist/server                       1
Financial/Insurance Coordinator                1
Paint Tech Team Lead                           1
Delivery supervisor                            1
Car Maintainer                                 1
compliance tester                              1
RN. Clinical Quality Manager                   1
Retail Marketing Designer                      1
MANAGER, MEMBER SERVICES                       1
Clinical Nurse Lead                            1
Loan Coordinator                               1
EHR Project Manager                            1
Database Specialist                            1
Consumer Loan Officer                          1
Assistant Administrator - Human Resource       1
Sr. Web Developer                              1
Asst sales manager                             1
Engineering Tech 3                             1
WARRANTY ADMIN                                 1
laborer assembly                               1
Logistics Operations Manager                   1
Healthcare technology consultant               1
Clinical Trial Specialist II                   1
Union Autoworker                               1
Adult Trauma Coordinator RN                    1
ELIG CLERK                                     1
Driver/ firefighter                            1
Marina Manager                                 1
CDL Licensing trainer/Driver                   1
Name: emp_title, Length: 37420, dtype: int64
df.emp_title.head()
0          General Manager
1                   Server
2                   server
3          Fiscal Director
4     utility technician 2
Name: emp_title, dtype: object
# 职位名称太多太杂,所以删掉这一列,防止对结果产生干扰
df.drop('emp_title',1,inplace=True)

emp_length: 工作年限属于重要特征,所以需要保留,一个人的工作年限越长,一般也意味着这个人的贷款信誉越好

df.emp_length.value_counts()
10+ years    34219
2 years       9066
3 years       7925
< 1 year      7104
1 year        6991
5 years       6170
4 years       6022
6 years       4406
8 years       4168
9 years       3922
7 years       3205
Name: emp_length, dtype: int64
# 进行正则替换
df.replace('n/a',np.nan,inplace=True)
df['emp_length'].fillna(value=0,inplace=True)
df['emp_length'].replace(to_replace='[^0-9]+',value='',inplace=True,regex=True)
df['emp_length'] = df['emp_length'].astype(int)
df.emp_length.value_counts()
10    34219
1     14095
2      9066
3      7925
5      6170
4      6022
0      5922
6      4406
8      4168
9      3922
7      3205
Name: emp_length, dtype: int64

verification_status:收入信息确认 收入信息确认是能够确认贷款人是否诚信的重要标志,所以对于该列信息需要进行研究

df.verification_status.value_counts()
Source Verified    42253
Verified           31356
Not Verified       25511
Name: verification_status, dtype: int64

Loan_status:计算目标 Loan_status作为计算目标,需要对这列进行研究,并转化为比较明确的数值型或者类别型标签,从而将问题转化为分类问题或者是回归问题

df.loan_status.value_counts()
Current               52061
Fully Paid            32435
Charged Off           11317
Late (31-120 days)     1769
In Grace Period         928
Late (16-30 days)       505
Default                 105
Name: loan_status, dtype: int64
pd.unique(df.loan_status.values.ravel())
array(['Fully Paid', 'Current', 'Charged Off', 'Late (31-120 days)',
       'In Grace Period', 'Late (16-30 days)', 'Default'], dtype=object)
  • current:代表还在还款中
  • fully paid代表还款完成
  • 其他的都是逾期用户,属于信誉不好的用户

所以通过replace函数,将这一列改为0和1的变量,让问题变成二分类问题

df.loan_status.replace('Fully Paid', int(1),inplace=True)
df.loan_status.replace('Current', int(1),inplace=True)
df.loan_status.replace('Late (16-30 days)', int(0),inplace=True)
df.loan_status.replace('Late (31-120 days)', int(0),inplace=True)
df.loan_status.replace('Charged Off', np.nan,inplace=True)
df.loan_status.replace('In Grace Period', np.nan,inplace=True)
df.loan_status.replace('Default', np.nan,inplace=True)
df.loan_status.value_counts()
1.0    84496
0.0     2274
Name: loan_status, dtype: int64
df.dropna(subset=['loan_status'],inplace=True)
df.loan_status.value_counts()
1.0    84496
0.0     2274
Name: loan_status, dtype: int64
df.loan_status.isnull()
0        False
1        False
2        False
3        False
5        False
6        False
7        False
8        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
31       False
32       False
         ...  
99087    False
99088    False
99089    False
99090    False
99091    False
99092    False
99093    False
99094    False
99095    False
99097    False
99098    False
99099    False
99100    False
99101    False
99102    False
99103    False
99104    False
99105    False
99106    False
99107    False
99108    False
99109    False
99111    False
99112    False
99113    False
99115    False
99116    False
99117    False
99118    False
99119    False
Name: loan_status, Length: 86770, dtype: bool
df.revol_util = pd.Series(df.revol_util).str.replace('%', '').astype(float)

处理缺失值过多的类别型变量

for col in df.select_dtypes(include=['object']).columns:
    print("Column {} has {} unique instances".format( col, len(df[col].unique())))
Column term has 2 unique instances
Column grade has 7 unique instances
Column sub_grade has 35 unique instances
Column home_ownership has 4 unique instances
Column verification_status has 3 unique instances
Column issue_d has 3 unique instances
Column pymnt_plan has 2 unique instances
Column desc has 6 unique instances
Column purpose has 13 unique instances
Column title has 13 unique instances
Column earliest_cr_line has 608 unique instances
Column initial_list_status has 2 unique instances
Column last_pymnt_d has 25 unique instances
Column next_pymnt_d has 3 unique instances
Column last_credit_pull_d has 27 unique instances
Column application_type has 2 unique instances
Column verification_status_joint has 2 unique instances
Column hardship_flag has 2 unique instances
Column hardship_type has 2 unique instances
Column hardship_reason has 10 unique instances
Column hardship_status has 4 unique instances
Column hardship_start_date has 16 unique instances
Column hardship_end_date has 17 unique instances
Column payment_plan_start_date has 16 unique instances
Column hardship_loan_status has 5 unique instances
Column disbursement_method has 2 unique instances
Column debt_settlement_flag has 2 unique instances
Column debt_settlement_flag_date has 15 unique instances
Column settlement_status has 2 unique instances
Column settlement_date has 16 unique instances
# 判断缺失数据较多的列,并进行删除操作
df.select_dtypes(include=['object']).describe().T.assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count unique top freq missing_pct
term 86770 2 36 months 65451 0.000000
grade 86770 7 B 29812 0.000000
sub_grade 86770 35 B5 7472 0.000000
home_ownership 86770 4 MORTGAGE 41716 0.000000
verification_status 86770 3 Source Verified 37066 0.000000
issue_d 86770 3 Aug-2016 31728 0.000000
pymnt_plan 86770 2 n 86724 0.000000
desc 5 5 I have recently purchased and built a new home... 1 0.999942
purpose 86770 13 debt_consolidation 50030 0.000000
title 82301 12 Debt consolidation 47030 0.051504
earliest_cr_line 86770 608 Aug-2004 704 0.000000
initial_list_status 86770 2 w 62962 0.000000
last_pymnt_d 86770 25 Jun-2018 35077 0.000000
next_pymnt_d 54335 2 Jul-2018 38520 0.373804
last_credit_pull_d 86766 26 Jun-2018 63805 0.000046
application_type 86770 2 Individual 86310 0.000000
verification_status_joint 460 1 Not Verified 460 0.994699
hardship_flag 86770 2 N 86712 0.000000
hardship_type 589 1 INTEREST ONLY-3 MONTHS DEFERRAL 589 0.993212
hardship_reason 589 9 NATURAL_DISASTER 263 0.993212
hardship_status 589 3 COMPLETED 493 0.993212
hardship_start_date 589 15 Sep-2017 214 0.993212
hardship_end_date 589 16 Dec-2017 154 0.993212
payment_plan_start_date 589 15 Sep-2017 152 0.993212
hardship_loan_status 589 4 Late (16-30 days) 234 0.993212
disbursement_method 86770 2 Cash 86745 0.000000
debt_settlement_flag 86770 2 N 86683 0.000000
debt_settlement_flag_date 87 14 Jun-2018 29 0.998997
settlement_status 87 1 ACTIVE 87 0.998997
settlement_date 87 15 Jun-2018 29 0.998997

根据missing_pct和count这两列可以看出很多缺失的列,需要对这些列进行删除操作

df.drop('desc',1,inplace=True)
df.drop('verification_status_joint',1,inplace=True)
df.drop(['hardship_type','hardship_reason','hardship_status','hardship_start_date','hardship_end_date','payment_plan_start_date','hardship_loan_status','debt_settlement_flag_date','settlement_date'],1,inplace=True)
df.drop('earliest_cr_line',1,inplace=True)
df.drop('revol_util',1,inplace=True)
df.drop('purpose',1,inplace=True)
df.drop('title',1,inplace=True)
df.drop('term',1,inplace=True)
df.drop('issue_d',1,inplace=True)
# df.drop('',1,inplace=True)
# 贷后相关的字段
df.drop(['out_prncp','out_prncp_inv','total_pymnt',
         'total_pymnt_inv','total_rec_prncp', 'grade', 'sub_grade'] ,1, inplace=True)
df.drop(['total_rec_int','total_rec_late_fee',
         'recoveries','collection_recovery_fee',
         'collection_recovery_fee' ],1, inplace=True)
df.drop(['last_pymnt_d','last_pymnt_amnt',
         'next_pymnt_d','last_credit_pull_d'],1, inplace=True)
df.drop(['policy_code'],1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 86770 entries, 0 to 99119
Columns: 107 entries, loan_amnt to settlement_term
dtypes: float64(97), int32(1), object(9)
memory usage: 71.2+ MB

将数值型的缺失值较多的列进行删除

df.select_dtypes(include=['float']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max missing_pct
loan_amnt 86770.0 13992.648381 8860.658420 1000.00 7000.0000 12000.00 20000.0000 40000.00 0.000000
funded_amnt 86770.0 13992.648381 8860.658420 1000.00 7000.0000 12000.00 20000.0000 40000.00 0.000000
funded_amnt_inv 86770.0 13988.150570 8857.893095 1000.00 7000.0000 12000.00 20000.0000 40000.00 0.000000
int_rate 86770.0 13.339744 4.662115 5.32 10.4900 12.79 15.5900 30.99 0.000000
installment 86770.0 425.346229 269.732279 30.12 230.8000 352.32 558.3200 1535.71 0.000000
annual_inc 86770.0 79304.560449 74486.852068 0.00 48000.0000 67000.00 95000.0000 8400000.00 0.000000
loan_status 86770.0 0.973793 0.159752 0.00 1.0000 1.00 1.0000 1.00 0.000000
url 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
dti 86767.0 17.732272 8.891494 0.00 11.7200 17.35 23.5900 999.00 0.000035
delinq_2yrs 86770.0 0.378541 0.982088 0.00 0.0000 0.00 0.0000 21.00 0.000000
inq_last_6mths 86770.0 0.548047 0.844625 0.00 0.0000 0.00 1.0000 5.00 0.000000
mths_since_last_delinq 46645.0 33.288112 21.837896 0.00 15.0000 30.00 48.0000 142.00 0.462429
mths_since_last_record 16967.0 67.008428 24.329262 0.00 51.0000 70.00 84.0000 119.00 0.804460
open_acc 86770.0 11.677089 5.730229 1.00 8.0000 11.00 14.0000 86.00 0.000000
pub_rec 86770.0 0.260620 0.711654 0.00 0.0000 0.00 0.0000 61.00 0.000000
revol_bal 86770.0 15700.789201 21787.950268 0.00 5672.2500 10539.50 18653.0000 876178.00 0.000000
total_acc 86770.0 24.009623 11.914658 2.00 15.0000 22.00 31.0000 119.00 0.000000
collections_12_mths_ex_med 86770.0 0.020871 0.166700 0.00 0.0000 0.00 0.0000 10.00 0.000000
mths_since_last_major_derog 25416.0 43.956996 21.772484 0.00 27.0000 43.00 62.0000 165.00 0.707088
annual_inc_joint 460.0 119731.355543 51847.018868 26943.12 85000.0000 111000.00 145000.0000 400000.00 0.994699
dti_joint 460.0 18.422152 6.675737 2.56 13.9375 18.26 22.7875 48.58 0.994699
acc_now_delinq 86770.0 0.006546 0.085499 0.00 0.0000 0.00 0.0000 4.00 0.000000
tot_coll_amt 86770.0 280.983439 1825.358766 0.00 0.0000 0.00 0.0000 172575.00 0.000000
tot_cur_bal 86770.0 141618.435081 159319.917606 0.00 28983.0000 79803.50 211297.7500 3764968.00 0.000000
open_acc_6m 86770.0 0.949084 1.154786 0.00 0.0000 1.00 1.0000 13.00 0.000000
open_act_il 86770.0 2.816123 3.114175 0.00 1.0000 2.00 3.0000 43.00 0.000000
open_il_12m 86770.0 0.695678 0.952599 0.00 0.0000 0.00 1.0000 11.00 0.000000
open_il_24m 86770.0 1.578092 1.621432 0.00 0.0000 1.00 2.0000 26.00 0.000000
mths_since_rcnt_il 84447.0 21.811160 26.764207 0.00 7.0000 13.00 25.0000 503.00 0.026772
total_bal_il 86770.0 34920.427164 42123.969735 0.00 9014.2500 22997.00 45513.5000 1547285.00 0.000000
... ... ... ... ... ... ... ... ... ...
num_tl_op_past_12m 86770.0 2.191898 1.931044 0.00 1.0000 2.00 3.0000 24.00 0.000000
pct_tl_nvr_dlq 86770.0 93.250410 9.712188 0.00 90.0000 96.95 100.0000 100.00 0.000000
percent_bc_gt_75 85791.0 42.178564 36.295768 0.00 0.0000 33.30 70.0000 100.00 0.011283
pub_rec_bankruptcies 86770.0 0.146318 0.403204 0.00 0.0000 0.00 0.0000 8.00 0.000000
tax_liens 86770.0 0.074104 0.511770 0.00 0.0000 0.00 0.0000 61.00 0.000000
tot_hi_cred_lim 86770.0 175777.763858 178392.180843 2500.00 49905.7500 111976.00 253691.5000 3953111.00 0.000000
total_bal_ex_mort 86770.0 50888.933214 49440.559933 0.00 20803.0000 37731.00 64243.2500 1548128.00 0.000000
total_bc_limit 86770.0 21251.124029 21050.690545 0.00 7800.0000 15000.00 27500.0000 520500.00 0.000000
total_il_high_credit_limit 86770.0 44113.311882 44747.921766 0.00 15709.2500 33192.50 59000.0000 2000000.00 0.000000
revol_bal_joint 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_earliest_cr_line 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_inq_last_6mths 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_mort_acc 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_open_acc 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_revol_util 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_open_act_il 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_num_rev_accts 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_chargeoff_within_12_mths 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_collections_12_mths_ex_med 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
sec_app_mths_since_last_major_derog 0.0 NaN NaN NaN NaN NaN NaN NaN 1.000000
deferral_term 589.0 3.000000 0.000000 3.00 3.0000 3.00 3.0000 3.00 0.993212
hardship_amount 589.0 163.295331 131.454229 6.35 67.0700 125.10 219.1100 769.03 0.993212
hardship_length 589.0 3.000000 0.000000 3.00 3.0000 3.00 3.0000 3.00 0.993212
hardship_dpd 589.0 11.607810 10.522442 0.00 0.0000 12.00 22.0000 37.00 0.993212
orig_projected_additional_accrued_interest 551.0 495.658911 393.610753 19.05 208.3950 387.63 666.8700 2307.09 0.993650
hardship_payoff_balance_amount 589.0 12088.823175 7282.730741 750.93 6528.8700 10569.61 16276.7300 33494.15 0.993212
hardship_last_payment_amount 589.0 192.151002 207.123799 0.02 35.2400 121.84 285.4700 1223.33 0.993212
settlement_amount 87.0 5355.673333 4012.832699 508.00 2082.0000 4351.00 7535.5000 17191.00 0.998997
settlement_percentage 87.0 53.376092 4.694325 49.99 50.0000 50.01 55.0000 65.00 0.998997
settlement_term 87.0 13.747126 4.394019 1.00 10.0000 15.00 18.0000 18.00 0.998997

97 rows × 9 columns

df.drop('annual_inc_joint',1,inplace=True)
df.drop('dti_joint',1,inplace=True)
df.drop('url',1,inplace=True)
df.drop(['revol_bal_joint','sec_app_earliest_cr_line','sec_app_inq_last_6mths','sec_app_mort_acc','sec_app_open_acc','sec_app_revol_util','sec_app_open_act_il','sec_app_num_rev_accts','sec_app_chargeoff_within_12_mths','sec_app_collections_12_mths_ex_med','sec_app_mths_since_last_major_derog'],1,inplace=True)
df.select_dtypes(include=['float']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max missing_pct
loan_amnt 86770.0 13992.648381 8860.658420 1000.00 7000.000 12000.00 20000.00 40000.00 0.000000
funded_amnt 86770.0 13992.648381 8860.658420 1000.00 7000.000 12000.00 20000.00 40000.00 0.000000
funded_amnt_inv 86770.0 13988.150570 8857.893095 1000.00 7000.000 12000.00 20000.00 40000.00 0.000000
int_rate 86770.0 13.339744 4.662115 5.32 10.490 12.79 15.59 30.99 0.000000
installment 86770.0 425.346229 269.732279 30.12 230.800 352.32 558.32 1535.71 0.000000
annual_inc 86770.0 79304.560449 74486.852068 0.00 48000.000 67000.00 95000.00 8400000.00 0.000000
loan_status 86770.0 0.973793 0.159752 0.00 1.000 1.00 1.00 1.00 0.000000
dti 86767.0 17.732272 8.891494 0.00 11.720 17.35 23.59 999.00 0.000035
delinq_2yrs 86770.0 0.378541 0.982088 0.00 0.000 0.00 0.00 21.00 0.000000
inq_last_6mths 86770.0 0.548047 0.844625 0.00 0.000 0.00 1.00 5.00 0.000000
mths_since_last_delinq 46645.0 33.288112 21.837896 0.00 15.000 30.00 48.00 142.00 0.462429
mths_since_last_record 16967.0 67.008428 24.329262 0.00 51.000 70.00 84.00 119.00 0.804460
open_acc 86770.0 11.677089 5.730229 1.00 8.000 11.00 14.00 86.00 0.000000
pub_rec 86770.0 0.260620 0.711654 0.00 0.000 0.00 0.00 61.00 0.000000
revol_bal 86770.0 15700.789201 21787.950268 0.00 5672.250 10539.50 18653.00 876178.00 0.000000
total_acc 86770.0 24.009623 11.914658 2.00 15.000 22.00 31.00 119.00 0.000000
collections_12_mths_ex_med 86770.0 0.020871 0.166700 0.00 0.000 0.00 0.00 10.00 0.000000
mths_since_last_major_derog 25416.0 43.956996 21.772484 0.00 27.000 43.00 62.00 165.00 0.707088
acc_now_delinq 86770.0 0.006546 0.085499 0.00 0.000 0.00 0.00 4.00 0.000000
tot_coll_amt 86770.0 280.983439 1825.358766 0.00 0.000 0.00 0.00 172575.00 0.000000
tot_cur_bal 86770.0 141618.435081 159319.917606 0.00 28983.000 79803.50 211297.75 3764968.00 0.000000
open_acc_6m 86770.0 0.949084 1.154786 0.00 0.000 1.00 1.00 13.00 0.000000
open_act_il 86770.0 2.816123 3.114175 0.00 1.000 2.00 3.00 43.00 0.000000
open_il_12m 86770.0 0.695678 0.952599 0.00 0.000 0.00 1.00 11.00 0.000000
open_il_24m 86770.0 1.578092 1.621432 0.00 0.000 1.00 2.00 26.00 0.000000
mths_since_rcnt_il 84447.0 21.811160 26.764207 0.00 7.000 13.00 25.00 503.00 0.026772
total_bal_il 86770.0 34920.427164 42123.969735 0.00 9014.250 22997.00 45513.50 1547285.00 0.000000
il_util 74592.0 71.086001 23.439148 0.00 58.000 74.00 87.00 1000.00 0.140348
open_rv_12m 86770.0 1.372882 1.546456 0.00 0.000 1.00 2.00 20.00 0.000000
open_rv_24m 86770.0 2.884430 2.635246 0.00 1.000 2.00 4.00 60.00 0.000000
... ... ... ... ... ... ... ... ... ...
num_actv_rev_tl 86770.0 5.580143 3.375133 0.00 3.000 5.00 7.00 59.00 0.000000
num_bc_sats 86770.0 4.626288 2.999892 0.00 3.000 4.00 6.00 61.00 0.000000
num_bc_tl 86770.0 7.407053 4.529108 0.00 4.000 7.00 10.00 67.00 0.000000
num_il_tl 86770.0 8.568918 7.510092 0.00 3.000 7.00 11.00 107.00 0.000000
num_op_rev_tl 86770.0 8.155215 4.699856 0.00 5.000 7.00 10.00 79.00 0.000000
num_rev_accts 86770.0 13.695909 7.937062 2.00 8.000 12.00 18.00 104.00 0.000000
num_rev_tl_bal_gt_0 86770.0 5.524144 3.262540 0.00 3.000 5.00 7.00 59.00 0.000000
num_sats 86770.0 11.633537 5.709296 1.00 8.000 11.00 14.00 85.00 0.000000
num_tl_120dpd_2m 83916.0 0.001096 0.035858 0.00 0.000 0.00 0.00 4.00 0.032892
num_tl_30dpd 86770.0 0.004218 0.067253 0.00 0.000 0.00 0.00 3.00 0.000000
num_tl_90g_dpd_24m 86770.0 0.100000 0.563002 0.00 0.000 0.00 0.00 20.00 0.000000
num_tl_op_past_12m 86770.0 2.191898 1.931044 0.00 1.000 2.00 3.00 24.00 0.000000
pct_tl_nvr_dlq 86770.0 93.250410 9.712188 0.00 90.000 96.95 100.00 100.00 0.000000
percent_bc_gt_75 85791.0 42.178564 36.295768 0.00 0.000 33.30 70.00 100.00 0.011283
pub_rec_bankruptcies 86770.0 0.146318 0.403204 0.00 0.000 0.00 0.00 8.00 0.000000
tax_liens 86770.0 0.074104 0.511770 0.00 0.000 0.00 0.00 61.00 0.000000
tot_hi_cred_lim 86770.0 175777.763858 178392.180843 2500.00 49905.750 111976.00 253691.50 3953111.00 0.000000
total_bal_ex_mort 86770.0 50888.933214 49440.559933 0.00 20803.000 37731.00 64243.25 1548128.00 0.000000
total_bc_limit 86770.0 21251.124029 21050.690545 0.00 7800.000 15000.00 27500.00 520500.00 0.000000
total_il_high_credit_limit 86770.0 44113.311882 44747.921766 0.00 15709.250 33192.50 59000.00 2000000.00 0.000000
deferral_term 589.0 3.000000 0.000000 3.00 3.000 3.00 3.00 3.00 0.993212
hardship_amount 589.0 163.295331 131.454229 6.35 67.070 125.10 219.11 769.03 0.993212
hardship_length 589.0 3.000000 0.000000 3.00 3.000 3.00 3.00 3.00 0.993212
hardship_dpd 589.0 11.607810 10.522442 0.00 0.000 12.00 22.00 37.00 0.993212
orig_projected_additional_accrued_interest 551.0 495.658911 393.610753 19.05 208.395 387.63 666.87 2307.09 0.993650
hardship_payoff_balance_amount 589.0 12088.823175 7282.730741 750.93 6528.870 10569.61 16276.73 33494.15 0.993212
hardship_last_payment_amount 589.0 192.151002 207.123799 0.02 35.240 121.84 285.47 1223.33 0.993212
settlement_amount 87.0 5355.673333 4012.832699 508.00 2082.000 4351.00 7535.50 17191.00 0.998997
settlement_percentage 87.0 53.376092 4.694325 49.99 50.000 50.01 55.00 65.00 0.998997
settlement_term 87.0 13.747126 4.394019 1.00 10.000 15.00 18.00 18.00 0.998997

83 rows × 9 columns

df.select_dtypes(include=['float']).describe().T[df.select_dtypes(include=['float']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x)))).isnull()]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max
loan_amnt NaN NaN NaN NaN NaN NaN NaN NaN
int_rate NaN NaN NaN NaN NaN NaN NaN NaN
annual_inc NaN NaN NaN NaN NaN NaN NaN NaN
loan_status NaN NaN NaN NaN NaN NaN NaN NaN
dti NaN NaN NaN NaN NaN NaN NaN NaN
delinq_2yrs NaN NaN NaN NaN NaN NaN NaN NaN
inq_last_6mths NaN NaN NaN NaN NaN NaN NaN NaN
mths_since_last_delinq NaN NaN NaN NaN NaN NaN NaN NaN
mths_since_last_record NaN NaN NaN NaN NaN NaN NaN NaN
open_acc NaN NaN NaN NaN NaN NaN NaN NaN
pub_rec NaN NaN NaN NaN NaN NaN NaN NaN
revol_bal NaN NaN NaN NaN NaN NaN NaN NaN
total_acc NaN NaN NaN NaN NaN NaN NaN NaN
collections_12_mths_ex_med NaN NaN NaN NaN NaN NaN NaN NaN
mths_since_last_major_derog NaN NaN NaN NaN NaN NaN NaN NaN
acc_now_delinq NaN NaN NaN NaN NaN NaN NaN NaN
tot_coll_amt NaN NaN NaN NaN NaN NaN NaN NaN
tot_cur_bal NaN NaN NaN NaN NaN NaN NaN NaN
open_acc_6m NaN NaN NaN NaN NaN NaN NaN NaN
open_act_il NaN NaN NaN NaN NaN NaN NaN NaN
open_il_12m NaN NaN NaN NaN NaN NaN NaN NaN
open_il_24m NaN NaN NaN NaN NaN NaN NaN NaN
mths_since_rcnt_il NaN NaN NaN NaN NaN NaN NaN NaN
total_bal_il NaN NaN NaN NaN NaN NaN NaN NaN
il_util NaN NaN NaN NaN NaN NaN NaN NaN
open_rv_12m NaN NaN NaN NaN NaN NaN NaN NaN
open_rv_24m NaN NaN NaN NaN NaN NaN NaN NaN
max_bal_bc NaN NaN NaN NaN NaN NaN NaN NaN
all_util NaN NaN NaN NaN NaN NaN NaN NaN
total_rev_hi_lim NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
num_actv_rev_tl NaN NaN NaN NaN NaN NaN NaN NaN
num_bc_sats NaN NaN NaN NaN NaN NaN NaN NaN
num_bc_tl NaN NaN NaN NaN NaN NaN NaN NaN
num_il_tl NaN NaN NaN NaN NaN NaN NaN NaN
num_op_rev_tl NaN NaN NaN NaN NaN NaN NaN NaN
num_rev_accts NaN NaN NaN NaN NaN NaN NaN NaN
num_rev_tl_bal_gt_0 NaN NaN NaN NaN NaN NaN NaN NaN
num_sats NaN NaN NaN NaN NaN NaN NaN NaN
num_tl_120dpd_2m NaN NaN NaN NaN NaN NaN NaN NaN
num_tl_30dpd NaN NaN NaN NaN NaN NaN NaN NaN
num_tl_90g_dpd_24m NaN NaN NaN NaN NaN NaN NaN NaN
num_tl_op_past_12m NaN NaN NaN NaN NaN NaN NaN NaN
pct_tl_nvr_dlq NaN NaN NaN NaN NaN NaN NaN NaN
percent_bc_gt_75 NaN NaN NaN NaN NaN NaN NaN NaN
pub_rec_bankruptcies NaN NaN NaN NaN NaN NaN NaN NaN
tax_liens NaN NaN NaN NaN NaN NaN NaN NaN
tot_hi_cred_lim NaN NaN NaN NaN NaN NaN NaN NaN
total_bal_ex_mort NaN NaN NaN NaN NaN NaN NaN NaN
total_bc_limit NaN NaN NaN NaN NaN NaN NaN NaN
total_il_high_credit_limit NaN NaN NaN NaN NaN NaN NaN NaN
deferral_term NaN NaN NaN NaN NaN NaN NaN NaN
hardship_amount NaN NaN NaN NaN NaN NaN NaN NaN
hardship_length NaN NaN NaN NaN NaN NaN NaN NaN
hardship_dpd NaN NaN NaN NaN NaN NaN NaN NaN
orig_projected_additional_accrued_interest NaN NaN NaN NaN NaN NaN NaN NaN
hardship_payoff_balance_amount NaN NaN NaN NaN NaN NaN NaN NaN
hardship_last_payment_amount NaN NaN NaN NaN NaN NaN NaN NaN
settlement_amount NaN NaN NaN NaN NaN NaN NaN NaN
settlement_percentage NaN NaN NaN NaN NaN NaN NaN NaN
settlement_term NaN NaN NaN NaN NaN NaN NaN NaN

80 rows × 8 columns

处理整数变量的列

df.select_dtypes(include=['int']).describe().T.\
assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max missing_pct
emp_length 86770.0 5.793604 3.761932 0.0 2.0 6.0 10.0 10.0 0.0

处理高度线性相关的列

cor = df.corr()
cor.loc[:,:] = np.tril(cor,k=-1)
cor
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
loan_amnt funded_amnt funded_amnt_inv int_rate installment emp_length annual_inc loan_status dti delinq_2yrs ... deferral_term hardship_amount hardship_length hardship_dpd orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount settlement_amount settlement_percentage settlement_term
loan_amnt 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
funded_amnt 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
funded_amnt_inv 0.999993 0.999993 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
int_rate 0.183818 0.183818 0.183990 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
installment 0.953658 0.953658 0.953570 0.207638 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
emp_length 0.110753 0.110753 0.110799 -0.018173 0.096081 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
annual_inc 0.314069 0.314069 0.314033 -0.066962 0.295867 0.094171 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
loan_status -0.013152 -0.013152 -0.013179 -0.080925 -0.018661 0.013513 0.016563 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
dti 0.038334 0.038334 0.038344 0.167446 0.048354 0.011855 -0.145502 -0.022552 0.000000 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
delinq_2yrs -0.008369 -0.008369 -0.008387 0.029667 -0.002644 0.026023 0.037214 -0.020069 -0.012498 0.000000 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
inq_last_6mths -0.016169 -0.016169 -0.016198 0.170607 0.007442 0.001574 0.034509 -0.019708 -0.001975 0.025702 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
mths_since_last_delinq 0.002074 0.002074 0.002101 -0.018567 -0.004217 -0.016531 -0.033823 0.013422 0.009132 -0.551436 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
mths_since_last_record -0.015390 -0.015390 -0.015358 -0.013327 -0.023256 0.027087 -0.080178 -0.000778 0.049695 -0.075914 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
open_acc 0.189283 0.189283 0.189256 -0.006746 0.176980 0.051260 0.140723 -0.006664 0.261796 0.056525 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
pub_rec -0.039875 -0.039875 -0.039901 0.049172 -0.029240 0.012645 0.000980 -0.011186 -0.051349 -0.039383 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
revol_bal 0.323861 0.323861 0.323821 -0.023577 0.306485 0.088163 0.308800 0.018380 0.131510 -0.025537 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
total_acc 0.207745 0.207745 0.207719 -0.046995 0.183350 0.088303 0.176989 0.003432 0.219191 0.121100 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
collections_12_mths_ex_med -0.020581 -0.020581 -0.020595 0.015218 -0.019196 -0.009854 -0.010283 -0.011052 -0.006429 0.082467 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
mths_since_last_major_derog 0.029221 0.029221 0.029253 -0.010362 0.023953 0.011823 -0.006633 0.008867 0.022977 -0.439524 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
acc_now_delinq -0.000643 -0.000643 -0.000652 0.011462 0.000688 0.012119 0.011594 -0.000940 -0.002557 0.122841 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
tot_coll_amt -0.024839 -0.024839 -0.024834 0.007813 -0.020357 0.002719 -0.003110 0.003976 -0.016754 -0.003642 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
tot_cur_bal 0.306866 0.306866 0.306888 -0.067493 0.270260 0.117314 0.392770 0.022304 0.022050 0.056121 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
open_acc_6m -0.017690 -0.017690 -0.017694 0.155748 0.003937 0.017539 0.046791 -0.024538 0.040895 0.000350 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
open_act_il 0.021932 0.021932 0.021922 0.031291 0.016411 -0.087070 0.072141 -0.005725 0.222380 0.071720 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
open_il_12m 0.009230 0.009230 0.009251 0.185548 0.023938 0.042698 0.093255 -0.021131 0.166589 -0.005855 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
open_il_24m 0.037502 0.037502 0.037524 0.163407 0.044889 0.046481 0.115768 -0.021197 0.229569 -0.023145 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
mths_since_rcnt_il -0.026022 -0.026022 -0.026030 -0.097319 -0.027290 -0.033338 -0.076912 0.015779 -0.244153 0.011909 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
total_bal_il 0.145235 0.145235 0.145235 0.037161 0.131976 -0.023716 0.224390 -0.000659 0.239179 0.059929 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
il_util -0.088801 -0.088801 -0.088763 0.146080 -0.080543 -0.088790 -0.034212 -0.027497 -0.009064 -0.006509 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
open_rv_12m -0.039301 -0.039301 -0.039299 0.134080 -0.010669 0.004968 -0.000835 -0.028553 0.010479 -0.026754 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
num_actv_rev_tl 0.175949 0.175949 0.175932 0.056059 0.176815 0.109311 0.088747 -0.019766 0.229290 -0.003791 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_bc_sats 0.236664 0.236664 0.236621 -0.045445 0.224395 0.067978 0.135832 -0.001078 0.110538 -0.035852 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_bc_tl 0.221586 0.221586 0.221538 -0.085157 0.206882 0.092169 0.145879 0.007720 0.083480 0.029768 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_il_tl 0.071276 0.071276 0.071275 0.015775 0.056311 -0.021343 0.103582 -0.006861 0.209211 0.092056 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_op_rev_tl 0.180233 0.180233 0.180201 -0.016180 0.174405 0.097503 0.085403 -0.007875 0.168069 0.008976 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_rev_accts 0.187665 0.187665 0.187622 -0.064875 0.173955 0.116064 0.110810 0.005822 0.135595 0.075864 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_rev_tl_bal_gt_0 0.173031 0.173031 0.173023 0.057363 0.175158 0.109516 0.087066 -0.019129 0.232890 -0.006032 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_sats 0.189243 0.189243 0.189216 -0.006886 0.177005 0.050780 0.140348 -0.006524 0.261779 0.054582 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_tl_120dpd_2m -0.003400 -0.003400 -0.003416 0.001534 -0.003172 0.003219 0.003622 0.002894 -0.011306 0.046662 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_tl_30dpd 0.005174 0.005174 0.005163 0.010414 0.006622 0.012415 0.012151 -0.001511 0.002556 0.102156 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_tl_90g_dpd_24m -0.023556 -0.023556 -0.023574 0.021237 -0.019500 -0.002221 0.006915 -0.012634 -0.015192 0.670425 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
num_tl_op_past_12m -0.014006 -0.014006 -0.013991 0.196019 0.013791 0.027795 0.059326 -0.032077 0.080556 -0.029705 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
pct_tl_nvr_dlq 0.083337 0.083337 0.083360 -0.041184 0.068789 -0.024941 -0.008379 0.022224 0.086247 -0.460886 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
percent_bc_gt_75 0.035651 0.035651 0.035702 0.196573 0.051238 0.032716 0.003882 -0.017767 0.145310 -0.004343 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
pub_rec_bankruptcies -0.070945 -0.070945 -0.070962 0.051258 -0.062869 0.003886 -0.044893 -0.011858 -0.030585 -0.073138 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
tax_liens 0.015755 0.015755 0.015733 0.015073 0.021572 0.009393 0.039479 -0.006130 -0.035036 0.004700 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
tot_hi_cred_lim 0.332977 0.332977 0.332989 -0.097597 0.293538 0.130913 0.411099 0.025668 0.034374 0.057963 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
total_bal_ex_mort 0.270811 0.270811 0.270801 0.021583 0.252577 0.018146 0.336079 0.008061 0.262438 0.038416 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
total_bc_limit 0.372026 0.372026 0.371955 -0.185556 0.334458 0.066976 0.276032 0.030958 0.049492 -0.082722 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
total_il_high_credit_limit 0.203510 0.203510 0.203500 0.013691 0.186510 0.008036 0.283449 0.004152 0.296579 0.066601 ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
deferral_term NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
hardship_amount 0.827591 0.827591 0.827475 0.639130 0.781581 0.035424 0.303210 -0.010632 0.083784 -0.006715 ... NaN 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
hardship_length NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
hardship_dpd 0.034837 0.034837 0.034631 0.142229 0.051037 -0.048772 -0.007155 -0.288293 0.017312 0.049492 ... NaN 0.062390 NaN 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0
orig_projected_additional_accrued_interest 0.829480 0.829480 0.829360 0.633758 0.783187 0.042427 0.312575 -0.017930 0.079196 0.001766 ... NaN 1.000000 NaN 0.078457 0.000000 0.000000 0.0 0.000000 0.000000 0.0
hardship_payoff_balance_amount 0.958502 0.958502 0.958435 0.324071 0.862753 0.072547 0.432046 0.041810 0.003633 0.001576 ... NaN 0.900165 NaN 0.011203 0.901693 0.000000 0.0 0.000000 0.000000 0.0
hardship_last_payment_amount 0.489119 0.489119 0.489032 0.190434 0.493799 0.002568 0.284256 0.004410 0.002036 0.028093 ... NaN 0.405107 NaN 0.070852 0.394808 0.459846 0.0 0.000000 0.000000 0.0
settlement_amount 0.917878 0.917878 0.917875 0.388751 0.814741 0.079187 0.492165 0.068815 0.082613 0.059831 ... NaN 1.000000 NaN NaN 1.000000 1.000000 -1.0 0.000000 0.000000 0.0
settlement_percentage -0.226310 -0.226310 -0.226299 -0.148911 -0.224885 -0.095090 -0.051463 0.362477 -0.097457 -0.051380 ... NaN NaN NaN NaN NaN NaN NaN 0.018362 0.000000 0.0
settlement_term 0.575292 0.575292 0.575291 0.286584 0.507058 0.111925 0.226042 -0.263670 0.318564 0.026849 ... NaN -1.000000 NaN NaN -1.000000 -1.000000 1.0 0.563137 -0.050615 0.0

84 rows × 84 columns

cor.stack()
loan_amnt        loan_amnt                                     0.000000
                 funded_amnt                                   0.000000
                 funded_amnt_inv                               0.000000
                 int_rate                                      0.000000
                 installment                                   0.000000
                 emp_length                                    0.000000
                 annual_inc                                    0.000000
                 loan_status                                   0.000000
                 dti                                           0.000000
                 delinq_2yrs                                   0.000000
                 inq_last_6mths                                0.000000
                 mths_since_last_delinq                        0.000000
                 mths_since_last_record                        0.000000
                 open_acc                                      0.000000
                 pub_rec                                       0.000000
                 revol_bal                                     0.000000
                 total_acc                                     0.000000
                 collections_12_mths_ex_med                    0.000000
                 mths_since_last_major_derog                   0.000000
                 acc_now_delinq                                0.000000
                 tot_coll_amt                                  0.000000
                 tot_cur_bal                                   0.000000
                 open_acc_6m                                   0.000000
                 open_act_il                                   0.000000
                 open_il_12m                                   0.000000
                 open_il_24m                                   0.000000
                 mths_since_rcnt_il                            0.000000
                 total_bal_il                                  0.000000
                 il_util                                       0.000000
                 open_rv_12m                                   0.000000
                                                                 ...   
settlement_term  mths_since_recent_inq                         0.030340
                 mths_since_recent_revol_delinq                0.103703
                 num_accts_ever_120_pd                         0.027002
                 num_actv_bc_tl                                0.181307
                 num_actv_rev_tl                               0.220559
                 num_bc_sats                                   0.162077
                 num_bc_tl                                     0.178956
                 num_il_tl                                     0.114414
                 num_op_rev_tl                                 0.166385
                 num_rev_accts                                 0.197364
                 num_rev_tl_bal_gt_0                           0.247732
                 num_sats                                      0.193849
                 num_tl_30dpd                                 -0.096457
                 num_tl_90g_dpd_24m                            0.068411
                 num_tl_op_past_12m                            0.110434
                 pct_tl_nvr_dlq                                0.121179
                 percent_bc_gt_75                              0.027080
                 pub_rec_bankruptcies                         -0.080801
                 tax_liens                                     0.039254
                 tot_hi_cred_lim                               0.213624
                 total_bal_ex_mort                             0.219922
                 total_bc_limit                                0.184810
                 total_il_high_credit_limit                    0.218572
                 hardship_amount                              -1.000000
                 orig_projected_additional_accrued_interest   -1.000000
                 hardship_payoff_balance_amount               -1.000000
                 hardship_last_payment_amount                  1.000000
                 settlement_amount                             0.563137
                 settlement_percentage                        -0.050615
                 settlement_term                               0.000000
Length: 6876, dtype: float64
cor[(cor>0.55) | (cor<-0.55)].stack()
funded_amnt                                 loan_amnt                                     1.000000
funded_amnt_inv                             loan_amnt                                     0.999993
                                            funded_amnt                                   0.999993
installment                                 loan_amnt                                     0.953658
                                            funded_amnt                                   0.953658
                                            funded_amnt_inv                               0.953570
mths_since_last_delinq                      delinq_2yrs                                  -0.551436
total_acc                                   open_acc                                      0.723418
mths_since_last_major_derog                 mths_since_last_delinq                        0.691480
open_il_24m                                 open_il_12m                                   0.758545
total_bal_il                                open_act_il                                   0.567060
open_rv_12m                                 open_acc_6m                                   0.623141
open_rv_24m                                 open_rv_12m                                   0.776720
all_util                                    il_util                                       0.597521
total_rev_hi_lim                            revol_bal                                     0.815654
inq_last_12m                                inq_fi                                        0.559390
acc_open_past_24mths                        open_acc_6m                                   0.553789
                                            open_il_24m                                   0.570438
                                            open_rv_12m                                   0.659344
                                            open_rv_24m                                   0.847494
avg_cur_bal                                 tot_cur_bal                                   0.828143
bc_open_to_buy                              total_rev_hi_lim                              0.624163
bc_util                                     all_util                                      0.572605
mo_sin_rcnt_tl                              mo_sin_rcnt_rev_tl_op                         0.604777
mths_since_recent_bc                        mo_sin_rcnt_rev_tl_op                         0.615410
mths_since_recent_bc_dlq                    mths_since_last_delinq                        0.750604
                                            mths_since_last_major_derog                   0.560775
mths_since_recent_revol_delinq              mths_since_last_delinq                        0.852881
                                            mths_since_recent_bc_dlq                      0.879875
num_actv_bc_tl                              open_acc                                      0.553697
                                                                                            ...   
hardship_amount                             installment                                   0.781581
orig_projected_additional_accrued_interest  loan_amnt                                     0.829480
                                            funded_amnt                                   0.829480
                                            funded_amnt_inv                               0.829360
                                            int_rate                                      0.633758
                                            installment                                   0.783187
                                            hardship_amount                               1.000000
hardship_payoff_balance_amount              loan_amnt                                     0.958502
                                            funded_amnt                                   0.958502
                                            funded_amnt_inv                               0.958435
                                            installment                                   0.862753
                                            hardship_amount                               0.900165
                                            orig_projected_additional_accrued_interest    0.901693
settlement_amount                           loan_amnt                                     0.917878
                                            funded_amnt                                   0.917878
                                            funded_amnt_inv                               0.917875
                                            installment                                   0.814741
                                            mths_since_last_record                       -0.600571
                                            hardship_amount                               1.000000
                                            orig_projected_additional_accrued_interest    1.000000
                                            hardship_payoff_balance_amount                1.000000
                                            hardship_last_payment_amount                 -1.000000
settlement_term                             loan_amnt                                     0.575292
                                            funded_amnt                                   0.575292
                                            funded_amnt_inv                               0.575291
                                            hardship_amount                              -1.000000
                                            orig_projected_additional_accrued_interest   -1.000000
                                            hardship_payoff_balance_amount               -1.000000
                                            hardship_last_payment_amount                  1.000000
                                            settlement_amount                             0.563137
Length: 125, dtype: float64
df.drop(['funded_amnt','funded_amnt_inv', 'installment'], axis=1, inplace=True)

建立模型

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import ensemble
from sklearn.preprocessing import OneHotEncoder
Y = df.loan_status
X = df.drop('loan_status',1,inplace=False)
print(Y.shape)
print(X.shape)
(86770,)
(86770, 89)
X = pd.get_dummies(X)
print(X.shape)
(86770, 100)
X.fillna(0.0,inplace=True)
X.fillna(0,inplace=True)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.3, random_state=666)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(60739, 100)
(60739,)
(26031, 100)
(26031,)
print (y_train.value_counts())
print (y_test.value_counts())
1.0    59148
0.0     1591
Name: loan_status, dtype: int64
1.0    25348
0.0      683
Name: loan_status, dtype: int64
param_grid = {'learning_rate': np.linspace(0.01,0.5,5),
              'max_depth': [i for i in range(2,6)],
              'min_samples_split': [50,100],
              'n_estimators': [100,200]
              }
grid_search = GridSearchCV(ensemble.GradientBoostingRegressor(),
                   param_grid, n_jobs=4, refit=True,verbose=1)
%%time
grid_search.fit(x_train,y_train)
bst_clf = grid_search.best_estimator_
Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  5.3min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 31.5min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 40.6min finished


Wall time: 41min 17s
%%time
bst_clf.fit(x_train, y_train)
Wall time: 43.2 s





GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.01, loss='ls', max_depth=4, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=100, min_weight_fraction_leaf=0.0,
             n_estimators=200, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)
bst_clf.score(x_test,y_test)
0.07240211533402607
def compute_ks(data):

    sorted_list = data.sort_values(['predict'], ascending=[True])

    total_bad = sorted_list['label'].sum(axis=None, skipna=None, level=None, numeric_only=None) / 3
    total_good = sorted_list.shape[0] - total_bad

    max_ks = 0.0
    good_count = 0.0
    bad_count = 0.0
    for index, row in sorted_list.iterrows():
        if row['label'] == 3:
            bad_count += 1.0
        else:
            good_count += 1.0

        val = bad_count/total_bad - good_count/total_good
        max_ks = max(max_ks, val)

    return max_ks
test_pd = pd.DataFrame()
test_pd['predict'] = bst_clf.predict(x_test)
test_pd['label'] = y_test
print( compute_ks(test_pd[['label','predict']]))
0.0
feature_importance = bst_clf.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

indices = np.argsort(feature_importance)[-10:]
plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4)
plt.yticks(np.arange(10 + 0.25), np.array(X.columns)[indices])
_ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables')

png

About

应用于p2p公司,用于判断借贷人的风险程度,判断借贷人能否及时还欠款

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages