Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data leakage: liveothello and wthor #8

Open
SimJeg opened this issue Nov 13, 2024 · 2 comments
Open

data leakage: liveothello and wthor #8

SimJeg opened this issue Nov 13, 2024 · 2 comments

Comments

@SimJeg
Copy link

SimJeg commented Nov 13, 2024

Hi,

I recently downloaded liveothello (11k games) and wthor (132k games) and noticed that all wthor transcripts start with the move f5. Once taking symmetries into account (there are 4 symmetries in Othello), the overlap between the 2 datasets is 8k games (72% of liveothello is in wthor). Without symmetries the overlap is 3k (27%).

The paper mentions

They [wthor and liveothello games] are combined and split randomly by 8 : 2 into training and validation sets

Hence I think there is a small data leakage between the training and validation set (x4 larger if you take symmetries into account).

@likenneth
Copy link
Owner

Hello!

I did a quick check myself and indeed there are duplicates. Thank you for bring this to my attention!

However, I only found 1664 duplicates by combining Wthor and liveothello games. Please check out my notebook. Maybe it's due to different data sources? I downloaded the data from the link in the readme of this repo. How about you?

@SimJeg
Copy link
Author

SimJeg commented Nov 17, 2024

Hello, indeed I downloaded data from the wthor and liveothello websites directly and a notable difference is that I used data up to 2024. This might explain the 1.4k missing games in the overlap without symmetries. If you take into account symmetries you should find the x4 factor.

Without symmetries, you get 23% of liveothello being in wthor and I get 27% which is close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants