Doubts on the evaluation. #8

songtaoshi · 2018-11-20T08:45:25Z

hello zhou. Thanks a lot for ur contribution on the work of fine-tuning. But I have a question about the evaluation metrics. It seems that in ur evaluation metrics, it evaluates the precision,recall separately for each class(B-person,I-person,B-MISC,I-MISC,.....). If so, the results may not be accurate enough? Thanks a lot!

kyzhouhzau · 2018-11-21T08:42:21Z

@songtaoshi
Yes, you are right, I updated the script to write the test results to result files. This way we can use the official script to evaluate. If I am free, I will update the results.
Thanks a lot!

FallakAsad · 2019-08-30T10:22:15Z

@kyzhouhzau After training, I run the script with do_train=False, do_eval=True and do_predict=True . My dev.txt and test.txt contains the same data I trained my model on (i.e train.txt is same as test.txt and dev.txt file). However, evaluation results shows:
***** Eval results *****
BERT_NER.py:687] ***********************************************
BERT_NER.py:688] P = 0.9166096085894354*
BERT_NER.py:689] R = 0.9166096085894354*
BERT_NER.py:690] F = 0.9166096085889771*

But if I run conlleval.pl on label_test.txt file that was generated after running the script I see following results:
processed 139671 tokens with 9649 phrases; found: 9650 phrases; correct: 9648.
accuracy: 100.00%; precision: 99.98%; recall: 99.99%; FB1: 99.98
label_1: precision: 100.00%; recall: 100.00%; FB1: 100.00 1728
label_2: precision: 100.00%; recall: 100.00%; FB1: 100.00 370
label_3: precision: 100.00%; recall: 100.00%; FB1: 100.00 2258
label_4: precision: 100.00%; recall: 100.00%; FB1: 100.00 706
label_5: precision: 100.00%; recall: 100.00%; FB1: 100.00 729
label_6: precision: 99.73%; recall: 99.86%; FB1: 99.80 736
label_7: precision: 100.00%; recall: 100.00%; FB1: 100.00 911
label_8: precision: 100.00%; recall: 100.00%; FB1: 100.00 412
label_9: precision: 100.00%; recall: 100.00%; FB1: 100.00 1375
label_10: precision: 100.00%; recall: 100.00%; FB1: 100.00 425

How Precision, recall and F score is different in evaluation result and predicted results, even though I evaluated and predicted on same dataset?

lyyang01 · 2020-03-19T13:04:10Z

@kyzhouhzau After training, I run the script with do_train=False, do_eval=True and do_predict=True . My dev.txt and test.txt contains the same data I trained my model on (i.e train.txt is same as test.txt and dev.txt file). However, evaluation results shows:
***** Eval results *****
BERT_NER.py:687] ***********************************************
BERT_NER.py:688] P = 0.9166096085894354*
BERT_NER.py:689] R = 0.9166096085894354*
BERT_NER.py:690] F = 0.9166096085889771*

But if I run conlleval.pl on label_test.txt file that was generated after running the script I see following results:
processed 139671 tokens with 9649 phrases; found: 9650 phrases; correct: 9648.
accuracy: 100.00%; precision: 99.98%; recall: 99.99%; FB1: 99.98
label_1: precision: 100.00%; recall: 100.00%; FB1: 100.00 1728
label_2: precision: 100.00%; recall: 100.00%; FB1: 100.00 370
label_3: precision: 100.00%; recall: 100.00%; FB1: 100.00 2258
label_4: precision: 100.00%; recall: 100.00%; FB1: 100.00 706
label_5: precision: 100.00%; recall: 100.00%; FB1: 100.00 729
label_6: precision: 99.73%; recall: 99.86%; FB1: 99.80 736
label_7: precision: 100.00%; recall: 100.00%; FB1: 100.00 911
label_8: precision: 100.00%; recall: 100.00%; FB1: 100.00 412
label_9: precision: 100.00%; recall: 100.00%; FB1: 100.00 1375
label_10: precision: 100.00%; recall: 100.00%; FB1: 100.00 425

How Precision, recall and F score is different in evaluation result and predicted results, even though I evaluated and predicted on same dataset?

Hi, Do you solve the problem? I met the same problem, I use the same data set to evaluate and predict, however, their results are very different. I don't know why.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubts on the evaluation. #8

Doubts on the evaluation. #8

songtaoshi commented Nov 20, 2018

kyzhouhzau commented Nov 21, 2018 •

edited

Loading

FallakAsad commented Aug 30, 2019

lyyang01 commented Mar 19, 2020

Doubts on the evaluation. #8

Doubts on the evaluation. #8

Comments

songtaoshi commented Nov 20, 2018

kyzhouhzau commented Nov 21, 2018 • edited Loading

FallakAsad commented Aug 30, 2019

lyyang01 commented Mar 19, 2020

kyzhouhzau commented Nov 21, 2018 •

edited

Loading