Code that produces a comparison between two different learning agents in a classic gridworld game. One that uses the off-policy approach of Q-learning, and the other which uses the on-policy State Action Reward State Action (SARSA) approach.
Image of grid used in game:
Code should produce graphs like the ones below which show the average rewards for the agents over 500 epochs for varying levels of exporation (epsilon value).
The image below compares the two agents for an epsilon value of 0.1:
The image below compares the two agents for an epsilon value of 0.25:
The image below compares the two agents for an epsilon value of 0.75: