KoELECTRA์ Korean Hate Speech Dataset์ ์ด์ฉํ Bias & Hate Classification
# of data | |
---|---|
train | 7,896 |
validate | 471 |
test | 974 |
- Bias (gender, other, none), Hate (hate, offensive, none)
- torch==1.5.0
- transformers==2.11.0
- soynlp==0.0.493
[CLS]
token์์ bias
์ hate
๋ฅผ ๋์์ ์์ธกํ๋ Joint Architecture
- loss = bias_coef * bias_loss + hate_coef * hate_loss (
bias_loss_coef
,hate_loss_coef
๋ณ๊ฒฝ ๊ฐ๋ฅ) - model.py์
ElectraForBiasClassification
์ฐธ๊ณ
[CLS] comment [SEP] title [SEP]
์ผ๋ก comment์ title์ ์ด์ด ๋ถ์ฌ Input์ผ๋ก ๋ฃ์- ์ ์ฒ๋ฆฌ์ ๊ฒฝ์ฐ
[]
๋ฑ์ brace๋ก ๋ฌถ์ธ ๋จ์ด ์ ๊ฑฐ, ๋ฐ์ดํ ํต์ผ, ๋ถํ์ํ ๋ฐ์ดํ ์ ๊ฑฐ, normalization ๋ฑ ๊ฐ๋จํ ๊ฒ๋ง ์ ์ฉ- data_loader.py์
preprocess
ํจ์ ์ฐธ๊ณ
- data_loader.py์
Parameters | |
---|---|
Batch Size | 16 |
Learning Rate | 5e-5 |
Epochs | 10 |
Warmup Proportion | 0.1 |
Max Seq Length | 100 |
Bias Loss Coefficient | 0.5 |
Hate Loss Coefficient | 1.0 |
๊ฐ ์นดํ ๊ณ ๋ฆฌ(Bias, Hate)์ Weighted F1 ์ฐ์ถ ํ ์ฐ์ ํ๊ท
- mean_weighted_f1 = (bias_weighted_f1 + hate_weighted_f1) / 2
Dev dataset
๊ธฐ์ค์ผ๋กmean_weighted_f1
์ ๊ฐ์ด ๊ฐ์ฅ ๋์ ๋ชจ๋ธ์ ์ต์ข ์ ์ผ๋ก ์ ์ฅ
$ python3 main.py --model_type koelectra-base-v2 \
--model_name_or_path monologg/koelectra-base-v2-discriminator \
--model_dir {$MODEL_DIR} \
--prediction_file prediction.csv \
--do_train
Test file์ ๋ํ ์์ธก๊ฐ์ csv ํํ๋ก ์ ์ฅ
$ python3 main.py --model_type koelectra-base-v2 \
--model_name_or_path {$MODEL_DIR} \
--pred_dir preds \
--prediction_file prediction.csv \
--do_pred
bias,hate
none,offensive
gender,hate
none,none
others,none
...
(๊ฐ๋ณ๊ฒ ์ ์ํ Baseline์ด์ฌ์ ์ ์ ๊ฐ์ ์ ์ฌ์ง๊ฐ ์กด์ฌํฉ๋๋ค)
(Weighted F1) | Bias F1 | Hate F1 | Mean F1 |
---|---|---|---|
Dev Dataset | 82.28 | 67.25 | 74.77 |