Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Supersedes #287
I added the SubprocVecEnv to allow multiple games to be played at once, so training data is captured about 5x faster. I trained for 4.96 days straight (100,000,000 timesteps) with this configuration and the model.zip file is 1525MB (too big to upload to git, unfortunately). After 5 days of training, PPOPlayer has an 8% win rate against AB-pruning and an 11% win rate against ValueFunctionPlayer. Attached is the wandb graph output. You can see that the episode_reward_mean is not slowing down, but it's simply not training fast enough on my RTX 4070 to realistically surpass the AB-pruning player. Perhaps the model has too many layers, slowing down training, but I've played around quite a bit with different hyperparameters and model sizes and this is the best I've come up with.
The features_extractor CNN doesn't seem to help much in training shorter runs even with much smaller model sizes. I'm starting to think stablebaselines isn't the best way to go. AlphaZero uses a combo of MCTS with this actor/critic neural net, and maybe we need to pursue recreating it for Catan.
Note that if you want to pull the branch and play around with it, you'll have to delete the model.zip before each run to reset the architecture.