This is a set of experiments for the mapping of Transformer-based models on spatial architectures. Timeloop is being used for both its map-space exploration capabilities and its analytical model, with also the support of Accelergy. Target Transformer-based models include BERT (base and large) and GPT-2. Target architectures span from Gemmini-like systolic arrays and Eyeriss like matrices of processing elements to analog and digital in-memory computing.
DISCLAIMER: the structure of this repository is very likely to change in the future, anything you find here now is very much WORK-IN-PROGRESS and is NOT guaranteed to work, nor to produce correct results even if it does, hence critically judge any results you might get.
The steps required to utilize the hereby examples are as follow:
- Install the Accelergy-Timeloop Infrastructure, instructions are provided under "Native Install" in the linked repository's README. The only tested version of the tools is that as of commit a9b57b6.
- Clone this repository:
git clone https://github.com/EMJzero/Timeloop-Experiments.git
- Install Accelergy's estimation tables:
Remember now that to modify any metric fetched from those tables, the path above is where you need to modify them, ignore the copies of the same folders under
accelergyTables -r Timeloop-Experiments/transformer_gemmini/free_data accelergyTables -r Timeloop-Experiments/transformer_pim/PIM_estimation_tables
transformer_pim_backup
.
For details on the the structure and function of the various .yaml
files and the meaning of all parameters used within them, please refer to those sources:
The documentation, as of the time of writing, is anything but exhaustive, henceforth be prepared to read some code to understand all the details. In particular Timeloop's src/model folder contains all the components used in the arch.yaml
files, and searching through their respective ParseSpecs
methods (ctrl+f them) often helps where the documentation doesn't.
The two main active examples are transformer_pim
and transformer_gemmini
, all other ones may make use of the outdated Timeloop-v3 or not work entirely.
Before running any example, check out its mapper.yaml
file, usually under the mapper
folder. In there, you should set num_threads
to the number of logical processors in your machine. Furthermore, consider also setting victory_condition
to a few thousand attempts, this the number of not-improving mappings to encounter before a thread stops, the higher this is, the more likely Timeloop is to find the optimal mapping.
Unfortunately root privileges are required to run those scripts, due to the inner workings of Accelergy. This also implies that any Python package required shall be installed with sudo pip install
(no matter how much Pip hates it 😅).
Also, while it is possible to invoke Timeloop directly, without using its Python fronted (timeloopfe
), this is the preferred method as of Timeloop-v4.
To run MSE, you just need to run its Python main file in the relative example folder:
cd Timeloop-Experiments/transformer_pim
sudo python3 run.py <layer> [optional-fusion-level]
This will run Timeloop's mapper for the provided layer of the BERT Transformer architecture, a directory like outputs_<layer>_<optional-fusion-level>
will be created for all results. The most relevant outputs being timeloop-mapper.accelergy.log
for debugging in case the mapper did not run, timeloop-mapper.map.txt
for the best mapping found (also printed at the end of the Python script), and timeloop-mapper.stats.txt
for the full mapping statistics.
Please note that unless victory_condition
is specified and low enough, forcefully terminating Timeloop (ctrl+C) is the intended way to stop the exploration.
Lastly, for a full list of the functionalities of run.py
, please run:
python3 run.py
python3 run.py -h
To run an example, first of all edit the hwdse_run.py
script to select the HW configurations to test, look for a dictionary array called hw_configs
:
cd Timeloop-Experiments/transformer_pim
vim +$(grep -n -m1 "hw_configs" hwdse_run.py | cut -d: -f1) hwdse_run.py
The provided example configurations include all the architectural parameters that the script can automatically alter.
As of the latest update, you can also provide the configurations via a file formatted as JSON, see the
-c
option and thehw_configs.json
files for details. Note that onfigurations provided in this way override those hard-coded in the script.# an example usage: sudo python3 hwdse_run.py KTQ VScores softmax -c hw_configs.json -vc 3200 -t
Then, you just need to run the Python script:
# to run some selected layers
sudo python3 hwdse_run.py <layer1> ... <layerN> [optional-fusion-level]
# to run all layers
sudo python3 hwdse_run.py all [optional-fusion-level]
This will run Timeloop's mapper for each HW configuration and each provided layer of the BERT Transformer architecture, a outputs_hwdse
directory will be created for all results. In there, there will be a folder for each tried configuration, numbered from 0 onward by default, each folder containing a sub-folder for each mapped layer (whose contents will be the same as in MSE) and a hw_config.json
file with a summary of the configuration and its final metrics.
Relevant tuning nobs are the -vs <num>
option, which allows to specify the victory_condition
for Timeloop (a "good enough" value is 3200, while the default is 160 to speedup testing), and the -mx
option, which rather than testing only the provided HW configurations, also considers all their possible cross-overs (use with care, as the number of configurations can blow up very quickly 😬).
Finally, -so
can be used to view the results from a previous run of HWDSE, while -t
presents the results in a more comfortable table.
Lastly, for a full list of the functionalities of hwdse_run.py
, please run:
python3 hwdse_run.py
python3 hwdse_run.py -h