Starter Code Repo : https://github.com/kpertsch/clvr_impl_starter
reward horizontal_position and vertical_position are swapped in given code
I contructed the model in models.py
And I trained the model using the provided dataset with 6 given rewards.
python train_encoder_all_task.py
I trained the model using only 1 reward (horizontal_position, vertical_position) each.
python train_encoder_single_task.py -r horizontal_position
python train_encoder_single_task.py -r vertical_position
Using these pretrained encoder, I trained decoder for each encoder to see what happens.
python train_decoder.py -r horizontal_position
python train_decoder.py -r vertical_position
The First row is ground truth of current state, Second row is decoded image by encoder-decoder only trained using vertical reward, and the Thrid row is decoded image by encoder-decoder only trained using horizontal_reward. Acutally there are bug in given reward function that vertical reward and horizontal reward flipped. So vertical model learned horizontal reward and horizontal model learned vertical.
Circle Shape is the agent. And encoder-decoders are trained on the rewards based on target's position.
As you see, the decoded images contain information about their rewards. For example, see the third row of the image, the white part contains information on the horziontal coordinates of the agent, but on the ohter coordinates the information has faded.
Therefore, using this model structure, it could be seen that the representation learning containing information about rewards progressed well.
I implemented SAC(Soft Acotr Critic) to compare the performance of image-scratch, cnn baseline and pre-trained encoder, and also oracle.
I first trained oracle version to see my implementation is correct.
Trianing code is train_agent.py
python train_agent.py -m oracle -t SpritesState-v0 -d ./Results/agents
It seems like working well. But some fluctuation happens, maybe the hyperparameter is not perfect.
The reason why agent not following well and keep staying at center more is the environment's time horizon is too short and target is keep moving around randomly.
So for agent, it is efficient to stay at center to get high reward consistently.
The result of trained agent is shown below.
Testing code is test_agent.py
python test_agent.py -m oracle -t SpritesState-v0 -d ./Results/agents -e 50000
I tried 4 baselines
- oracle
- cnn
- image_scratch
- reward_predictor(ours)
oracle can access the ground truth posiiton, so it is upper bound.
Encoder for image-scratch version(CNN) is defined in model.py
SAC using cnn and reward_predictor version is defined in sac.py(oracle), sac.py(cnn), sac.py(image_scratch), sac.py(reward_predictor)
I trained four versions (oracle, cnn, image_scratch, reward_predictor) in three environments(number of distractor 0, 1, 2)
python train_agent.py -m reward_predictor -t Sprites-v0 -d ./Results/agents
python train_agent.py -m reward_predictor -t Sprites-v1 -d ./Results/agents
python train_agent.py -m reward_predictor -t Sprites-v2 -d ./Results/agents
python train_agent.py -m cnn -t Sprites-v0 -d ./Results/agents
python train_agent.py -m cnn -t Sprites-v1 -d ./Results/agents
python train_agent.py -m cnn -t Sprites-v2 -d ./Results/agents
python train_agent.py -m image_scratch -t Sprites-v0 -d ./Results/agents
python train_agent.py -m image_scratch -t Sprites-v1 -d ./Results/agents
python train_agent.py -m image_scratch -t Sprites-v2 -d ./Results/agents
python train_agent.py -m oracle -t SpritesState-v0 -d ./Results/agents
python train_agent.py -m oracle -t SpritesState-v1 -d ./Results/agents
python train_agent.py -m oracle -t SpritesState-v2 -d ./Results/agents
You can test the agent by running these codes.
python test_agent.py -m oracle -t SpritesState-v0 -d ./Results/agents -e 50000
python test_agent.py -m oracle -t SpritesState-v1 -d ./Results/agents -e 20000
python test_agent.py -m oracle -t SpritesState-v2 -d ./Results/agents -e 40000
python test_agent.py -m cnn -t Sprites-v0 -d ./Results/agents -e 30000
python test_agent.py -m cnn -t Sprites-v1 -d ./Results/agents -e 30000
python test_agent.py -m cnn -t Sprites-v2 -d ./Results/agents -e 35000
python test_agent.py -m image_scratch -t Sprites-v0 -d ./Results/agents -e 35000
python test_agent.py -m image_scratch -t Sprites-v1 -d ./Results/agents -e 35000
python test_agent.py -m image_scratch -t Sprites-v2 -d ./Results/agents -e 35000
python test_agent.py -m reward_predictor -t Sprites-v0 -d ./Results/agents -e 35000
python test_agent.py -m reward_predictor -t Sprites-v1 -d ./Results/agents -e 30000
python test_agent.py -m reward_predictor -t Sprites-v2 -d ./Results/agents -e 25000
Gererating Plots
python plotter.py
Generating GIF Demos
python test_agent_gif.py
.
The results is shown below.
As you see, oracle and reward_predictior is trained well. But, cnn learned a little. And image_scratch seems like not trained at all.
.
In v1(one distractor) orale and reward_predictor trained well. But we can't see any progress of cnn and image_scratch.
.
In v2(two distractors) Reward_predictor has slightly less performance than oracle.
.
So, we can see that pre-trained encoder helps RL algorithms to learn efficient, high performance(Almost same as oracle).
But, cnn and image_scratch didn't seem to be learning.
Reward Induced Representation Learning helps RL Agent to train efficiently even though there are some distractors. Because encoder's representation have enough information of ground truth state induced by meta tasks.