Here is a summary of the comparitive study of Four different reward functions were used to achieve the aim of finding minima of the functions of two variables x and y.
The aim is to find minima of the functions of type using policy gradient with Gaussian Distribution.
For every episode the coefficients of the equation are randomly choosen as follows:
- Both a and b are randomly choosen from the interval [0,10]
- Rest of the coefficients are randomly choosen from the interval [-10,10]
- The x and y coordinates were randomly choosen from [-4,4]
- State is an 8 dimensional vector where x and y coordinates and the coefficeints of the function.
Maximum length of an episode is 200 steps.
Total four reward functions were implemented:
- Inverse of the distance from the minima of the current state which only depends on the state.
- Exponent of the sum of the product of the difference of state from the minima and action along that directions. The Reward function is
- The reward function is given by
- The reward function is given by
To run the environment and agent run the main.py
which can be found here. The environment is a custom OpenAI environment for Quadratic 2D functions which can found here.
The agent was trained for 3000 episodes for reward functions 1, 3 and 4. The agent was tested for 1000 episodes. Reward function 2 could not be trained more than 1000 due to overflow in exponent ,therefore, testing was not conducted on reward function 2. A successful episode is defined as the absolute differnce of particular coordinate with that state is less than 0.1. The graphs of Inverse of the distance of the last state from minima was plotted against the episodes. The peak of the graph would indicate successful episodes.
Reward function 1: Successful Training Episodes: 18/3000 Successful Testing Episodes: 6/1000 Notebook
Reward function 3: Successful Training Episodes: 28/3000 Successful Testing Episodes: 6/1000 Notebook
Reward function 4: Successful Training Episodes: 9/3000 Successful Testing Episodes: 4/1000 Notebook
Out of the four reward functions the most successful on training is Reward function 3 which has 28 successful training episodes but does not generalize that well while testing as only 6 episodes were successful which is same as Reward function 1. Reward function 4 did not yield as expected results.The graphs of all functions are present in their respective colab notebooks.
As stated in the enviroment decription, that coefficents and coordinates have been limited to small range therefore some experiments were carried out on the all three reward functions. Experiments invovled changing the length of episodes, providing larger ranges to coefficents and coordinates and changing hyperparameters of the policy gradient algorithm i.e learning rate, gamma value in reward-to-go and using more deeper neural networks. On conducting these experiments it can infered that none of three reward functions generalize that well on higher ranges. It was also observed that reults in section of Analysis of Reward functions is also not consistent.