Skip to content

Commit

Permalink
Merge pull request #285 from vamsianumula/master
Browse files Browse the repository at this point in the history
Added post for training
  • Loading branch information
pilarbachiller authored Sep 11, 2022
2 parents a35661a + e17656d commit a3f01c2
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 30 deletions.
1 change: 1 addition & 0 deletions gsoc/2022/posts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,5 @@ Mentors: Mario Haut, Pilar Bachiller

1. [Introduction](/web/gsoc/2022/posts/vamsi_anumula/1-introduction)
2. [Environment](/web/gsoc/2022/posts/vamsi_anumula/2-Environment)
3. [Training](/web/gsoc/2022/posts/vamsi_anumula/3-Training.md)

52 changes: 22 additions & 30 deletions gsoc/2022/posts/vamsi_anumula/2-Environment.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,30 @@
# GSoC'22 RoboComp project: Reinforcement Learning for pick and place operations

30th June 2022
Created: 30th June 2022
Updated: 8th September 2022

## Environment
The aim of the project is to make a Open AI Gym wrapper for the exisiting robotic arm model in CoppeliaSim. The gym wrapper creation eases the process of training our agent. The currently available library implementations of state-of-the-art Deep RL algos require the custom environment to follow this gym.Env structure. A standard wrapper has been built until now. Curently, the task is for the robot arm to reach the object, grasp it and lift to a desired height.
## Objective
The aim of the project is to make a Open AI Gym wrapper for the exisiting robotic arm model in CoppeliaSim. The gym wrapper creation eases the process of training our agent. The currently available library implementations of state-of-the-art Deep RL algorithms require the custom environment to follow this gym.Env structure. A standard wrapper has been built until now.The environment supports both continuous and discrete action spaces.

# Environment Description
## Environment Description

## State Space
### State Space

A 26 dimensional continuous state space is considered, comprising of:
A 29 dimensional continuous state space is considered, comprising of:

| Info | Dimensions |
| ------------------------- | ---|
| Block pose: 3 coords+ 4 quarternions | 7 |
| Block pose: 3 coords+ 4 quaternions | 7 |
| Block velocity | 3 |
| Block angular velocity | 3 |
| Gripper tip position corods | 3 |
| Relative position of block w.r.t tip | 3 |
| Left grip force sensor | 3 |
| Right grip force sensor | 3 |
| Left finger force sensor | 3 |
| Right grip force sensor | 3 |
| Gripper info | 1 |
| Grip force sensors (left & right) | 2 |
| Finger force sensors (left & right) | 2 |
| Rel. position b/w left&right fingers | 3 |
| Gripper velocity | 3 |

## Action space
### Action space

5 dimensional action space in either discrete or continuous setting.

Expand All @@ -34,27 +36,17 @@ A 26 dimensional continuous state space is considered, comprising of:
| Move wrist | {-1,0,1} |[-1,1] |
| Open/Close the gripper | {-1,0,1} |[-1,1], but will berounded off to {-1,0,1} |

## Reward
### Collision Detection

The goal is for the arm to grasp the object and pick it to a certain height. The objct will only be able to achieve the desired height only if the arm was able to successfully grip it. So, the reward function will be a gradually decreasing negative score proprtional to absolute deviation from the current object height to the desired height, in range of [-1,0] and once the desired height is acheived, a positive scoreof +10 is awarded.
Huge penalty of -100 is awarded in case of an invalid state/collision.
Collision Detection is a important aspect for the environment as it prevents arm to crash into block, table and such. The force data from the left and right finger sensors is used. The magintude of force sensors is obtained and if that exceeds a certain threshold, a collsion is detected. The threshold is finetuned from observations of various training episodes involving collisions.

## Algorithms
### Grasp Detection
Similar to collision detection, if the force magintudes obtained from the gripper sensors exceeds a certain fiinetuned threshold, a grasp is detected. In the training phase, this would be a very useful feature to have in the reward function, where a certain reward is achieved for a successful grasp.

For continuous action space, Soft Actor Critic(SAC) is considered and Deep Q Network is considered for the discrete setting.
## Further steps

## Training process objectives
- Train the agent using existing imlementations of SAC, DQN using Stablebaselines3 library.
- Continuously modify environment to fix bugs encountered during training process.
- Experiment and investigate agent performace with different reward fucntions
- Carry out hyperparameter tuning to achieve the desired goal
### Goal Environment for goal-conditioning with HER

# Further steps

## Goal Environment for goal-conditioning with HER

Since, the task of pick and place is quite complex, we want to use to leverage the idea of goal-conditioning. With goal-conditioning, each episode is considering as a success by treating the achieved terminal state as a virtual goal state. Hindsight Experience Replay(HER) is used to achieved the goal conditioning for our agent. In order to use HER, our environment need to be modified into a gym.goalEnv structure, where the observation space consists of state, achieved goal and desired goal, and the reward for each time step will be computed based on this structure. This goal env will be created and tested. Any off-policy algorithm like SAC, DQN then can be used along with HER to achieve a more robust and sample efficient training of the agent.

...
Since, the task of pick and place is quite complex, we want to use to leverage the idea of goal-conditioning. With goal-conditioning, each episode is considering as a success by treating the achieved terminal state as a virtual goal state. Hindsight Experience Replay(HER) is used to achieved the goal conditioning for our agent. In order to use HER, our environment need to be modified into a gym.goalEnv structure, where the observation space consists of state, achieved goal and desired goal, and the reward for each time step will be computed based on this structure. This goal env will be created and tested.

__Vamsi Anumula__
40 changes: 40 additions & 0 deletions gsoc/2022/posts/vamsi_anumula/3-Training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@

# GSoC'22 RoboComp project: Reinforcement Learning for pick and place operations

8th September 2022

## Training Objective
The goal of the current phase is for the robot arm to reach the block, grasp it and lift it to a desired height above ground.

## Reward

The agent will get rewarded as follows:

| State | Reward | Terminal? |
| ------------------------- | ---| ---- |
| Arm is far way from the block | -100 | Yes |
| Collision detected | -100 | Yes |
| Grasp detected and dh>0 | 1000\*dh_norm | No |
| Goal height reached | 10,000 | Yes |

### Notation
dh:= change in object height from ground \
dh_norm:= Normailzed dh

*\*The reward structure is subject to change*

## Algorithms

Soft Actor Critic (SAC) is chosen for training in continuous action space setting.

## Trained agent demo

*TODO*

## Reward curve

*TODO: Will be added once hyperparameter tuning is done.*

## Futher Steps
- The next step would be train the arm to place the block at desired position after grasping, by modifying rewards.
- Modify exisitng env to support Goal Conditioning and train the arm using SAC, along with HindSight Experience Replay(HER) replay buffer to achieve a more robust and sample efficient training of the agent.

0 comments on commit a3f01c2

Please sign in to comment.