From 55d383f25d18d9f0c458ca0374a001c394101066 Mon Sep 17 00:00:00 2001
From: Poorna Hima Vamsi A <himavamsianumula@gmail.com>
Date: Thu, 8 Sep 2022 11:47:44 +0900
Subject: [PATCH 1/6] Updated environment

---
 .../2022/posts/vamsi_anumula/2-Environment.md | 41 ++++++++-----------
 1 file changed, 17 insertions(+), 24 deletions(-)

diff --git a/gsoc/2022/posts/vamsi_anumula/2-Environment.md b/gsoc/2022/posts/vamsi_anumula/2-Environment.md
index 6506bb5e..fcf8899d 100644
--- a/gsoc/2022/posts/vamsi_anumula/2-Environment.md
+++ b/gsoc/2022/posts/vamsi_anumula/2-Environment.md
@@ -1,26 +1,28 @@
 # GSoC'22 RoboComp project: Reinforcement Learning for pick and place operations
 
-30th June 2022
+Created: 30th June 2022
+Updated: 8th September 2022
 
-## Environment 
-The aim of the project is to make a Open AI Gym wrapper for the exisiting robotic arm model in CoppeliaSim. The gym wrapper creation eases the process of training our agent. The currently available library implementations of state-of-the-art Deep RL algos require the custom environment to follow this gym.Env structure. A standard wrapper has been built until now. Curently, the task is for the robot arm to reach the object, grasp it and lift to a desired height. 
+# Objective
+The aim of the project is to make a Open AI Gym wrapper for the exisiting robotic arm model in CoppeliaSim. The gym wrapper creation eases the process of training our agent. The currently available library implementations of state-of-the-art Deep RL algorithms require the custom environment to follow this gym.Env structure. A standard wrapper has been built until now.The environment supports both continuous and discrete action spaces.
 
 # Environment Description
 
 ## State Space
 
-A 26 dimensional continuous state space is considered, comprising of:
+A 29 dimensional continuous state space is considered, comprising of:
 
 |        Info                           |  Dimensions |
 | -------------------------             |  ---|
-| Block pose: 3 coords+ 4 quarternions  |  7  |
+| Block pose: 3 coords+ 4 quaternions  |  7  |
+| Block velocity                        |  3  |
+| Block angular velocity                 |  3  |
 | Gripper tip position corods           |  3  |
 | Relative position of block w.r.t tip  |  3  |
-| Left grip force sensor                |  3  |
-| Right grip force sensor               |  3  |
-| Left finger force sensor              |  3  |
-| Right grip force sensor               |  3  |
-| Gripper info                          |  1  |
+| Grip force sensors (left & right)     |  2  |
+| Finger force sensors (left & right)   |  2  |
+| Rel. position b/w left&right fingers   |  3  |
+| Gripper velocity                      |  3  |
 
 ## Action space
 
@@ -34,27 +36,18 @@ A 26 dimensional continuous state space is considered, comprising of:
 | Move wrist                |  {-1,0,1}  |[-1,1]  |
 | Open/Close the gripper    |  {-1,0,1}  |[-1,1], but will berounded off to {-1,0,1}  |
 
-## Reward
+## Collision Detection
 
-The goal is for the arm to grasp the object and pick it to a certain height. The objct will only be able to achieve the desired height only if the arm was able to successfully grip it. So, the reward function will be a gradually decreasing negative score proprtional to absolute deviation from the current object height to the desired height, in range of [-1,0] and once the desired height is acheived, a positive scoreof +10 is awarded.
-Huge penalty of -100 is awarded in case of an invalid state/collision.
+Collision Detection is a important aspect for the environment as it prevents arm to crash into block, table and such. The force data from the left and right finger sensors is used. The magintude of force sensors is obtained and if that exceeds a certain threshold, a collsion is detected. The threshold is finetuned from observations of various training episodes involving collisions.
 
-## Algorithms
-
-For continuous action space, Soft Actor Critic(SAC) is considered and Deep Q Network is considered for the discrete setting. 
-
-## Training process objectives
-- Train the agent using existing imlementations of SAC, DQN using Stablebaselines3 library.
-- Continuously modify environment to fix bugs encountered during training process.
-- Experiment and investigate agent performace with different reward fucntions
-- Carry out hyperparameter tuning to achieve the desired goal
+## Grasp Detection
+Similar to collision detection, if the force magintudes obtained from the gripper sensors exceeds a certain fiinetuned threshold, a grasp is detected. In the training phase, this would be a very useful feature to have in the reward function, where a certain reward is achieved for a successful grasp.  
 
 # Further steps
 
 ## Goal Environment for goal-conditioning with HER
 
-Since, the task of pick and place is quite complex, we want to use to leverage the idea of goal-conditioning. With goal-conditioning, each episode is considering as a success by treating the achieved terminal state as a virtual goal state. Hindsight Experience Replay(HER) is used to achieved the goal conditioning for our agent. In order to use HER, our environment need to be modified into a gym.goalEnv structure, where the observation space consists of state, achieved goal and desired goal, and the reward for each time step will be computed based on this structure. This goal env will be created and tested. Any off-policy algorithm like SAC, DQN then can be used along with HER to achieve a more robust and sample efficient training of the agent.
-
+Since, the task of pick and place is quite complex, we want to use to leverage the idea of goal-conditioning. With goal-conditioning, each episode is considering as a success by treating the achieved terminal state as a virtual goal state. Hindsight Experience Replay(HER) is used to achieved the goal conditioning for our agent. In order to use HER, our environment need to be modified into a gym.goalEnv structure, where the observation space consists of state, achieved goal and desired goal, and the reward for each time step will be computed based on this structure. This goal env will be created and tested. 
 ...
 
 __Vamsi Anumula__

From 05e0c039df60857945b2bedd2a72dfb998b25c3a Mon Sep 17 00:00:00 2001
From: Poorna Hima Vamsi A <himavamsianumula@gmail.com>
Date: Thu, 8 Sep 2022 11:48:31 +0900
Subject: [PATCH 2/6] small fix

---
 gsoc/2022/posts/vamsi_anumula/2-Environment.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/gsoc/2022/posts/vamsi_anumula/2-Environment.md b/gsoc/2022/posts/vamsi_anumula/2-Environment.md
index fcf8899d..4ea7ed2c 100644
--- a/gsoc/2022/posts/vamsi_anumula/2-Environment.md
+++ b/gsoc/2022/posts/vamsi_anumula/2-Environment.md
@@ -48,6 +48,5 @@ Similar to collision detection, if the force magintudes obtained from the grippe
 ## Goal Environment for goal-conditioning with HER
 
 Since, the task of pick and place is quite complex, we want to use to leverage the idea of goal-conditioning. With goal-conditioning, each episode is considering as a success by treating the achieved terminal state as a virtual goal state. Hindsight Experience Replay(HER) is used to achieved the goal conditioning for our agent. In order to use HER, our environment need to be modified into a gym.goalEnv structure, where the observation space consists of state, achieved goal and desired goal, and the reward for each time step will be computed based on this structure. This goal env will be created and tested. 
-...
 
 __Vamsi Anumula__

From 3b74fe4ff479a9589e8b8a14cd83e09bf6233623 Mon Sep 17 00:00:00 2001
From: Poorna Hima Vamsi A <himavamsianumula@gmail.com>
Date: Thu, 8 Sep 2022 11:50:38 +0900
Subject: [PATCH 3/6] Create Training,md

---
 gsoc/2022/posts/vamsi_anumula/Training,md | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 gsoc/2022/posts/vamsi_anumula/Training,md

diff --git a/gsoc/2022/posts/vamsi_anumula/Training,md b/gsoc/2022/posts/vamsi_anumula/Training,md
new file mode 100644
index 00000000..8b137891
--- /dev/null
+++ b/gsoc/2022/posts/vamsi_anumula/Training,md
@@ -0,0 +1 @@
+

From a8319d2a69a0063463f0bb0b5c9440ae60b7247b Mon Sep 17 00:00:00 2001
From: Poorna Hima Vamsi A <himavamsianumula@gmail.com>
Date: Thu, 8 Sep 2022 13:33:05 +0900
Subject: [PATCH 4/6] Added rewards and training info

---
 gsoc/2022/posts/vamsi_anumula/3-Training.md | 40 +++++++++++++++++++++
 gsoc/2022/posts/vamsi_anumula/Training,md   |  1 -
 2 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 gsoc/2022/posts/vamsi_anumula/3-Training.md
 delete mode 100644 gsoc/2022/posts/vamsi_anumula/Training,md

diff --git a/gsoc/2022/posts/vamsi_anumula/3-Training.md b/gsoc/2022/posts/vamsi_anumula/3-Training.md
new file mode 100644
index 00000000..3b9971bf
--- /dev/null
+++ b/gsoc/2022/posts/vamsi_anumula/3-Training.md
@@ -0,0 +1,40 @@
+
+# GSoC'22 RoboComp project: Reinforcement Learning for pick and place operations
+
+8th September 2022
+
+## Training Objective
+The goal of the current phase is for the robot arm to reach the block, grasp it and lift it to a desired height above ground.
+
+## Reward
+
+The agent will get rewarded as follows:
+
+|        State                           |  Reward | Terminal? |
+| -------------------------             |  ---|   ----   |
+| Arm is far way from the block         |  -100  |  Yes  |
+| Collision detected                        |  -100  |  Yes  |
+| Grasp detected and dh>0                 |  1000\*dh_norm | No |
+| Goal height reached                 |  10,000 | Yes |
+
+### Notation
+dh:= change in object height from ground \
+dh_norm:= Normailzed dh
+
+*\*The reward structure is subject to change*
+
+## Algorithms
+
+Soft Actor Critic (SAC) is chosen for training in continuous action space setting.
+
+## Trained agent demo
+
+*TODO*
+
+## Reward curve
+
+*TODO: Will be added once hyperparameter tuning is done.*
+
+## Futher Steps
+ - The next step would be train the arm to place the block at desired position after grasping, by modifying rewards.
+ - Modify exisitng env to support Goal Conditioning and train the arm using SAC, along with HindSight Experience Replay(HER) replay buffer to achieve a more robust and sample efficient training of the agent.
diff --git a/gsoc/2022/posts/vamsi_anumula/Training,md b/gsoc/2022/posts/vamsi_anumula/Training,md
deleted file mode 100644
index 8b137891..00000000
--- a/gsoc/2022/posts/vamsi_anumula/Training,md
+++ /dev/null
@@ -1 +0,0 @@
-

From eb1612a3028843a1513f24d9da6b478ff8c8fafc Mon Sep 17 00:00:00 2001
From: Poorna Hima Vamsi A <himavamsianumula@gmail.com>
Date: Thu, 8 Sep 2022 13:34:08 +0900
Subject: [PATCH 5/6] Heading adjusted

---
 gsoc/2022/posts/vamsi_anumula/2-Environment.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/gsoc/2022/posts/vamsi_anumula/2-Environment.md b/gsoc/2022/posts/vamsi_anumula/2-Environment.md
index 4ea7ed2c..5d547109 100644
--- a/gsoc/2022/posts/vamsi_anumula/2-Environment.md
+++ b/gsoc/2022/posts/vamsi_anumula/2-Environment.md
@@ -3,12 +3,12 @@
 Created: 30th June 2022
 Updated: 8th September 2022
 
-# Objective
+## Objective
 The aim of the project is to make a Open AI Gym wrapper for the exisiting robotic arm model in CoppeliaSim. The gym wrapper creation eases the process of training our agent. The currently available library implementations of state-of-the-art Deep RL algorithms require the custom environment to follow this gym.Env structure. A standard wrapper has been built until now.The environment supports both continuous and discrete action spaces.
 
-# Environment Description
+## Environment Description
 
-## State Space
+### State Space
 
 A 29 dimensional continuous state space is considered, comprising of:
 
@@ -24,7 +24,7 @@ A 29 dimensional continuous state space is considered, comprising of:
 | Rel. position b/w left&right fingers   |  3  |
 | Gripper velocity                      |  3  |
 
-## Action space
+### Action space
 
 5 dimensional action space in either discrete or continuous setting.
 
@@ -36,16 +36,16 @@ A 29 dimensional continuous state space is considered, comprising of:
 | Move wrist                |  {-1,0,1}  |[-1,1]  |
 | Open/Close the gripper    |  {-1,0,1}  |[-1,1], but will berounded off to {-1,0,1}  |
 
-## Collision Detection
+### Collision Detection
 
 Collision Detection is a important aspect for the environment as it prevents arm to crash into block, table and such. The force data from the left and right finger sensors is used. The magintude of force sensors is obtained and if that exceeds a certain threshold, a collsion is detected. The threshold is finetuned from observations of various training episodes involving collisions.
 
-## Grasp Detection
+### Grasp Detection
 Similar to collision detection, if the force magintudes obtained from the gripper sensors exceeds a certain fiinetuned threshold, a grasp is detected. In the training phase, this would be a very useful feature to have in the reward function, where a certain reward is achieved for a successful grasp.  
 
-# Further steps
+## Further steps
 
-## Goal Environment for goal-conditioning with HER
+### Goal Environment for goal-conditioning with HER
 
 Since, the task of pick and place is quite complex, we want to use to leverage the idea of goal-conditioning. With goal-conditioning, each episode is considering as a success by treating the achieved terminal state as a virtual goal state. Hindsight Experience Replay(HER) is used to achieved the goal conditioning for our agent. In order to use HER, our environment need to be modified into a gym.goalEnv structure, where the observation space consists of state, achieved goal and desired goal, and the reward for each time step will be computed based on this structure. This goal env will be created and tested. 
 

From e17656db598c72e8fcfc914e4b386e41cb1cae84 Mon Sep 17 00:00:00 2001
From: Poorna Hima Vamsi A <himavamsianumula@gmail.com>
Date: Thu, 8 Sep 2022 13:49:23 +0900
Subject: [PATCH 6/6] Added post on training

---
 gsoc/2022/posts/index.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gsoc/2022/posts/index.md b/gsoc/2022/posts/index.md
index 8ad071d5..1fd4c931 100644
--- a/gsoc/2022/posts/index.md
+++ b/gsoc/2022/posts/index.md
@@ -52,4 +52,5 @@ Mentors: Mario Haut, Pilar Bachiller
 
 1. [Introduction](/web/gsoc/2022/posts/vamsi_anumula/1-introduction)
 2. [Environment](/web/gsoc/2022/posts/vamsi_anumula/2-Environment)
+3. [Training](/web/gsoc/2022/posts/vamsi_anumula/3-Training.md)