Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file not found #2

Open
amdongyang opened this issue Dec 30, 2021 · 4 comments
Open

file not found #2

amdongyang opened this issue Dec 30, 2021 · 4 comments

Comments

@amdongyang
Copy link

I can't find the file named /utility/WikiExtractor.py used in initialize.sh. The file seems to be important for synthetic pre-training

@davda54
Copy link
Collaborator

davda54 commented Dec 30, 2021

Hi, you can get the file here, for example: https://github.com/nawnoes/data-preprocess/blob/master/WikiExtractor.py

Note that you actually don't have to download, extract and process the wiki dumps -- we have also released the processed dumps used to train our system here: https://github.com/ufal/multilexnorm2021/releases/tag/v1.0.0

@amdongyang
Copy link
Author

Thanks a lot for your help. I have another question.

After synthetic pre-training, i need to load the saved checkpoint, and fine-tuning the synthetic-pretraining checkpoint with hand-annotated traing data.

This procedure is right or not? Now i fine-tune the byt5 model directly with hand-annotated traing data, and i can only get ERR with 70.15 on En language.

@davda54
Copy link
Collaborator

davda54 commented Jan 3, 2022

That sounds alright. I'm not sure what validation dataset you use, but reducing the error by 70% seems good to me :)

@amdongyang
Copy link
Author

As for the validation dataset, i simply use the test file under path (/data/multilexnorm/test/eval/test/intrinsic_evaluation/en/test.norm.masked), and i am tring to achieve the performance reported in the paper (73.8 on En language)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants