Skip to content

Commit

Permalink
新思路
Browse files Browse the repository at this point in the history
  • Loading branch information
sunlanchang committed May 23, 2020
1 parent 9eaa190 commit 0ec05a7
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 10 deletions.
7 changes: 2 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# TODO

- [x] 传统机器学习如随机森林、决策树、SVM、朴素贝叶斯、贝叶斯网络、逻辑回归、AdaBoost等 (accuracy < 1.0)
- [x] 传统机器学习如随机森林、决策树、SVM、朴素贝叶斯、贝叶斯网络、逻辑回归、AdaBoost等 (accuracy < 1.0,随机森林:0.89)
- [ ] 直接对category feature和numeric feature使用全连接网络
- [x] LightGBM
- [x] +Voting (accuracy: 0.91)
Expand All @@ -25,10 +25,6 @@
- [ ] DeepFM、DeepFFM等
- [ ] 集成学习:比赛最后阶段使用上分

## 传统机器学习

- 随机森林:0.89

## 处理成序列问题

把每个点击的creative_id或者ad_id当作一个词,把一个人90天内点击的creative_id或者ad_id列表当作一个句子,使用word2vec来构造creative_id或者ad_id嵌入表示。最后进行简单的统计操作得到用户的向量表示。这种序列简单聚合导致信息损失,显得是非常的粗糙,需要进一步引入attention等方法。
Expand All @@ -38,6 +34,7 @@
## TF-IDF

NLP中常用的做法,将用户点击序列中的creative_id或者ad_id集合看作一篇文档,将每个creative_id或者ad_id视为文档中的文字,然后使用tfidf。当然这也下来维度也非常高,可以通过参数调整来降低维度,比如sklearn中的TfidfVectorizer,可以使用max_df和min_df进行调整。
- df(document frequency):某一个creative_id在所有用户的creative_id序列出现的频率。

## DeepFM、DeepFFM、xDeepFM

Expand Down
14 changes: 9 additions & 5 deletions tf_idf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import pandas as pd
import numpy as np
import lightgbm as lgb
from mail import mail
# %%
user = pd.read_csv(
'data/train_preliminary/user.csv').sort_values(['user_id'], ascending=(True,))
Expand Down Expand Up @@ -32,12 +33,14 @@
# %%
vectorizer = TfidfVectorizer(
token_pattern=r"(?u)\b\w+\b",
min_df=1,
min_df=100,
max_df=0.1,
# max_features=128,
dtype=np.float32,
)
all_data = vectorizer.fit_transform(corpus)
print(all_data.shape)
print('(examples, features)', all_data.shape)
mail('train tfidf done!')
# %%
train_val = all_data[:train_examples, :]
# %%
Expand Down Expand Up @@ -134,10 +137,11 @@ def LGBM_age(epoch, early_stopping_rounds):


# %%
gbm_gender = LGBM_gender(epoch=5000, early_stopping_rounds=1000)
gbm_gender = LGBM_gender(epoch=1500, early_stopping_rounds=500)
# %%
gbm_age = LGBM_age(epoch=5000, early_stopping_rounds=1000)

mail('train gender done!')
gbm_age = LGBM_age(epoch=2000, early_stopping_rounds=500)
mail('train age done!')
# %%


Expand Down

0 comments on commit 0ec05a7

Please sign in to comment.