Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubts on the evaluation. #8

Open
songtaoshi opened this issue Nov 20, 2018 · 3 comments
Open

Doubts on the evaluation. #8

songtaoshi opened this issue Nov 20, 2018 · 3 comments

Comments

@songtaoshi
Copy link

hello zhou. Thanks a lot for ur contribution on the work of fine-tuning. But I have a question about the evaluation metrics. It seems that in ur evaluation metrics, it evaluates the precision,recall separately for each class(B-person,I-person,B-MISC,I-MISC,.....). If so, the results may not be accurate enough? Thanks a lot!

@kyzhouhzau
Copy link
Owner

kyzhouhzau commented Nov 21, 2018

@songtaoshi
Yes, you are right, I updated the script to write the test results to result files. This way we can use the official script to evaluate. If I am free, I will update the results.
Thanks a lot!

@FallakAsad
Copy link

@kyzhouhzau After training, I run the script with do_train=False, do_eval=True and do_predict=True . My dev.txt and test.txt contains the same data I trained my model on (i.e train.txt is same as test.txt and dev.txt file). However, evaluation results shows:
***** Eval results *****
BERT_NER.py:687] ***********************************************
BERT_NER.py:688] P = 0.9166096085894354*
BERT_NER.py:689] R = 0.9166096085894354*
BERT_NER.py:690] F = 0.9166096085889771*

But if I run conlleval.pl on label_test.txt file that was generated after running the script I see following results:
processed 139671 tokens with 9649 phrases; found: 9650 phrases; correct: 9648.
accuracy: 100.00%; precision: 99.98%; recall: 99.99%; FB1: 99.98
label_1: precision: 100.00%; recall: 100.00%; FB1: 100.00 1728
label_2: precision: 100.00%; recall: 100.00%; FB1: 100.00 370
label_3: precision: 100.00%; recall: 100.00%; FB1: 100.00 2258
label_4: precision: 100.00%; recall: 100.00%; FB1: 100.00 706
label_5: precision: 100.00%; recall: 100.00%; FB1: 100.00 729
label_6: precision: 99.73%; recall: 99.86%; FB1: 99.80 736
label_7: precision: 100.00%; recall: 100.00%; FB1: 100.00 911
label_8: precision: 100.00%; recall: 100.00%; FB1: 100.00 412
label_9: precision: 100.00%; recall: 100.00%; FB1: 100.00 1375
label_10: precision: 100.00%; recall: 100.00%; FB1: 100.00 425

How Precision, recall and F score is different in evaluation result and predicted results, even though I evaluated and predicted on same dataset?

@lyyang01
Copy link

@kyzhouhzau After training, I run the script with do_train=False, do_eval=True and do_predict=True . My dev.txt and test.txt contains the same data I trained my model on (i.e train.txt is same as test.txt and dev.txt file). However, evaluation results shows:
***** Eval results *****
BERT_NER.py:687] ***********************************************
BERT_NER.py:688] P = 0.9166096085894354*
BERT_NER.py:689] R = 0.9166096085894354*
BERT_NER.py:690] F = 0.9166096085889771*

But if I run conlleval.pl on label_test.txt file that was generated after running the script I see following results:
processed 139671 tokens with 9649 phrases; found: 9650 phrases; correct: 9648.
accuracy: 100.00%; precision: 99.98%; recall: 99.99%; FB1: 99.98
label_1: precision: 100.00%; recall: 100.00%; FB1: 100.00 1728
label_2: precision: 100.00%; recall: 100.00%; FB1: 100.00 370
label_3: precision: 100.00%; recall: 100.00%; FB1: 100.00 2258
label_4: precision: 100.00%; recall: 100.00%; FB1: 100.00 706
label_5: precision: 100.00%; recall: 100.00%; FB1: 100.00 729
label_6: precision: 99.73%; recall: 99.86%; FB1: 99.80 736
label_7: precision: 100.00%; recall: 100.00%; FB1: 100.00 911
label_8: precision: 100.00%; recall: 100.00%; FB1: 100.00 412
label_9: precision: 100.00%; recall: 100.00%; FB1: 100.00 1375
label_10: precision: 100.00%; recall: 100.00%; FB1: 100.00 425

How Precision, recall and F score is different in evaluation result and predicted results, even though I evaluated and predicted on same dataset?

Hi, Do you solve the problem? I met the same problem, I use the same data set to evaluate and predict, however, their results are very different. I don't know why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants