Skip to content

Latest commit

 

History

History
129 lines (70 loc) · 12.7 KB

README(old-version).md

File metadata and controls

129 lines (70 loc) · 12.7 KB

API: Boosting Multi-Agent Reinforcement Learning via Agent-Permutation-Invariant Networks

Open-source code for API: Boosting Multi-Agent Reinforcement Learning via Agent-Permutation-Invariant Networks.

[TOC]

1. Motivation

1.1 Permutation Invariance and Equivariance

Permutation Invariant Function. A function where of size is a set consisting of components (each of which is of dimension ), is said to be permutation invariant if permutation of input components does not change the output of the function. Mathematically, , where is the permutation matrix of size , which is a binary matrix that has exactly a single unit value in every row and column and zeros everywhere else.

Permutation Equivariant Function. Similarly, a function is permutation equivariant if permutation of input components permutes the output components with the same permutation . Mathematically, .

1.2 Why Permutation Invariant Matters?

In MARL, the environments typically consist of components, including learning agents and non-player characters. Therefore, the states, observations are factorizable as sets of components , where each component represents an atomic semantic meaning (e.g., agent 's features) whose dimension is . Because shuffling the order of components does not change the information of the set, one would expect many functions, e.g., the policy function , possess permutation invariance and permutation equivariance. These properties can be exploited to design more efficient MARL algorithms, especially when the components are homogeneous, i.e., semantically identical (belonging to the same type, having identical feature spaces, action spaces and reward functions).

Taking as an example, the input is the observation , and the outputs are Q-values of all actions in . Since the components are homogeneous, they have the same feature space, i.e., . Thus, the size of an fixedly ordered representation of is . In contrast, using a permutation invariant representation, i.e., removing the influence of the input order, could reduce the size of the observation space by a factor of . As the number of homogeneous components increases, the removal of these redundancies results in a much smaller search space, upon which we could more easily learn a policy.

Our objective is to design more flexible Agent Permutation Invariant (API) and Agent Permutation Equivariant (APE) models to greatly reduce the sample complexity of MARL algorithms. Also taking as the example, if there is a direct correspondence between the action Q-value in output and the component in input , then for these actions should be permutation equivariant; otherwise, should be permutation invariant.

API_APE_function

Note that this is very common for many multi-agent settings. For example, as illustrated in the above Figure, in the challenging StarCraft II micromanagement benchmark (SMAC), the input set could be divided into 2 groups: an ally group and an enemy group . The output Q-values of the actions could be divided into 2 groups as well: Q-values for move actions , i.e., , and attack actions . Since there is a one-to-one correspondence between the elements in and , the Q-values of should be equivariant to the permutations of , while the Q-values of should be invariant to the permutations of the whole set . Overall, a desired model of should be both permutation invariant and permutation equivariance.

2. Model Architecture of API-HyPerNetwork (API-HPN)

Agent permutation invariant network with hypernetworks

API-HPN incorporates hypernetworks to generate different weights s for different input components s to improve representational capacity while ensuring the same always be assigned with the same weight . The architecture of our API-HPN is shown in the above Figure (b). We also take the as an example. The model mainly composes of two modules:

Agent Permutation Invariant Input Layer. hypernetworks are a family of neural architectures which use one network, known as hypernetwork, to generate the weights for another network. In our setting, the hypernetwork is utilized to generate a different for each of the input set . As shown in above Figure (b), (which can be viewed as a batch of s each of which is of dimension , represented by different shades of blue) is firstly fed into a shared hypernetwork (marked in yellow), whose input size is and output size is . Then, the corresponding outputs are reshaped to and serve as the submodule weights s of the normal FC layer (see Figure (a)). Note that different s will generate different s and the same will always correspond to the same . Then, each is multiplied by and all multiplication results and the bias are summed together to get the output. Since each element is processed separately by its corresponding and then merged by a permutation invariant 'sum' function, the permutation invariance is reserved.

Agent Permutation Equivariance Output Layer. Similarly, to keep the whole network permutation equivariance, the submodular weights and bias of the agent-related actions in the output layer, e.g., of SMAC, are also generated by a hypernetwork. As mentioned above, the input and output of the hypernetwork always correspond one-to-one, so the input order change will result in the same output order change, thus achieving permutation equivariance.

We emphasize that API-HPN is a general design and can be easily integrated into existing MARL algorithms (e.g., VDN, QMIX, MADDPG, MAPPO) to boost the learning speed as well as the converged performance. All parameters of API-HPN are simply trained end-to-end with backpropagation according to the corresponding RL loss function.

3. Experiments

3.1 Experimental Setups

We mainly evaluate our methods in the challenging StarCraft II micromanagement benchmark (SMAC).


StarCraft 2 version: SC2.4.10. difficulty: 7.

3.2 Evaluation Metric

3.3 Code Implementations and Structure

3.4 Results

3.4.1 Comparison with previous SOTA

The Full Comparison of API-HPN with SOTA on SMAC

3.4.2 Comparison with baselines considering permutation invariance and permutation equivariant property

Comparison with Related Baselines

3.4.3 Ablation Studies

ablation

Senarios Difficulty API-QMIX
8m_vs_9m Hard 100%
5m_vs_6m Hard 100%
3s_vs_5z Hard 100%
bane_vs_bane Hard 100%
2c_vs_64zg Hard 100%
corridor Super Hard 100%
MMM2 Super Hard 100%
3s5z_vs_3s6z Super Hard 100%
27m_vs_30m Super Hard 100%
6h_vs_8z Super Hard 98%

4. How to use the code?

4.1 Detailed Command line tool to reproduce all experimental results

Run an experiment

# For SMAC, take the 5m_vs_6m scenario for example.
CUDA_VISIBLE_DEVICES="0" python src/main.py --config=api_vdn --env-config=sc2 with env_args.map_name=5m_vs_6m obs_agent_id=True obs_last_action=False runner=parallel batch_size_run=8 buffer_size=5000 t_max=10050000 epsilon_anneal_time=100000 batch_size=128 td_lambda=0.6

CUDA_VISIBLE_DEVICES="1" python src/main.py --config=api_qmix --env-config=sc2 with env_args.map_name=5m_vs_6m obs_agent_id=True obs_last_action=False runner=parallel batch_size_run=8 buffer_size=5000 t_max=10050000 epsilon_anneal_time=100000 batch_size=128 td_lambda=0.6

The config files act as defaults for an algorithm or environment.

They are all located in src/config. --config refers to the config files in src/config/algs --env-config refers to the config files in src/config/envs

Citation

@article{,
      title={API: Boosting Multi-Agent Reinforcement Learning via Agent-Permutation-Invariant Networks}, 
      author={},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}