Skip to content

ishanivyas/spies-like-us

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# Spies like  us
Directory layout:
	README.txt	-- This file.
	maddpg/spies-like-us/experiments/	-- The saved results of experiments.
		1.0	-- The first experiment.
		1.1	-- The second experiment.
		...
		1.x/
			checkpoint
			.meta
			.index
			.data-00000-of-00001

Training new experiments:
	$ python3 train.py --scenario=spies-like-us            ;# Train a new policy.
	$ python3 train.py --scenario=spies-like-us --restore  ;# Do more training on an existing policy.

Observing the results of the trained experiments:
	$ python3 train.py --scenario=spies-like-us --restore --display


spies-like-us-1.0:
	Simple Pleb reward:
		* Fixed amount of points for picking up food.
		* Higher fixed amount of points for dropping off food.
		* Points in inverse proportion to distance from current objective (either food source or nest).

spies-like-us-1.1:
	Noticed lots of hesitation as Plebs got near food source and nest.  Tried to compensate.

spies-like-us-1.2:
	Added noose so that Plebs must get closer to objective in order to receive any more rewards.  This prohibits procrastination/cheating.

spies-like-us-1.3:
	Adding Spies to the mix:
		* If they contact a Pleb, they steal its food and gain a reward.
		* Rewarded in inverse proportion to distance to all Plebs (up to a bound).
	Outcome:
		* The spy ambushes the Plebs near the food source.
		* The Plebs do not develop any compensating behavior.
	Interpretation:
		The Plebs are happy enough to score rewards by only picking up food
		because the Spy's reward for taking the food does not undo all of the
		Plebs' reward for picking up food, this is a beneficial symbiotic
		relationship.

spies-like-us-1.4:
	* Penalize all Plebs when a Spy steals from a Pleb.
	* Reward all Plebs when any Pleb picks up food or drops it off.

* Watchpoint: Plebs may rely upon the position of an agent in the observation vector in order to assign reputation, i.e. they learn that certain slots holding observations of agents will always hold spies.  They do not learn to adapt when a spy shows up in a different slot.

Observations:
	* Pathological reward function that was found.
	* Pathological reward function seems to imply that Plebs learn to count steps, so if they are too far away from their objective, they know they can gain more points by procrastinating near the objectie than by picking-up/dropping-off and accelerating toward the next objective.

About

Multi-agent-env

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published