Skip to content

Latest commit

 

History

History
82 lines (63 loc) · 2.7 KB

README.md

File metadata and controls

82 lines (63 loc) · 2.7 KB

Deep Ensemble Shape Calibration

Deep Ensemble Shape Calibration: Multiple Fields Post-hoc Calibration in Online Advertising

Introduction

We proposed an new DESC method for calibration on CTR prediction task in Shopee.

Requirements and Installation

We recommended the following dependencies.

  • Python 3.8
  • PyTorch 1.8.0
  • Details shown in requirements.txt

Download public data

  1. CRITEO data set can be downloaded from this link.
  2. AliCCP data set can be downloaded from this link.
  3. Industrial data set will be published soon.

Non-calibrated model

We use DeepFM to train the non-calibrated models with all fields in training data set. DeepFM code can be downloaded from here. After training DeepFM model, we use this model to predict the non-calibrated scores for all samples, including the training, validation and test data.

Preprocess data

#!/bin/bash
python3 preprocess/split_pctr.py  # input train.csv and test.csv, output: pctr_split.json (100 bin pCTR information)

Train DESC model

#!/bin/bash

set -x

cd DESC

train_path=$1  # train.csv
test_path=$2   # test.csv
pctr_info_path=$3  # pctr_split.json (obtained in preprocess step)
output_model_folder=$4  # output model folder
outpath=$5  # output path (res.csv), add a new column for 'calibrated score'
need_keys='101 121 122 124 125 126 127 128 129 205 206 207 210 216 508 509 702 853 301 109_14 110_14 127_14 150_14' # for AliCCP data
#need_keys='C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25 C26'  # for CRETIO data

CUDA_VISIBLE_DEVICES=0 python -u train_desc.py \
    --sample-path ${train_path} \
    --test-sample-path ${test_path} \
    --split-path ${pctr_info_path} \
    --need-keys ${need_keys} \
    --label-name 'click' \
    --pctr-label-name 'pctr' \
    --batch-size 16384 \
    --epoches 1 \
    --workers 0 \
    --learning-rate 1e-3 \
    --model-folder ${output_model_folder} \
    --emb-size 128 \
    --eval-freq 1.1 \
    --dropout 0.2 \
    --lambda-v 1.0 \
    --seed 44 \
    --outpath ${outpath} \
    --fc-hidden-size-str '128,64,32,1'

Evaluate results

#!/bin/bash

set -x

inpath=$1  # input 'res.csv'
eval_field_name='C1'
# each field in "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25 C26" for CRETIO data
# each field in "101 121 122 124 125 126 127 128 129 205 206 207 210 216 508 509 702 853 301 109_14 110_14 127_14 150_14" for AliCCP data
python3 eval/eval_res.py ${inpath} ${eval_field_name}