This guide is meant for those interested in contributing to the ABLATOR library.
The current document is in progress. The document explains how to set up the development environment and run tests. The repository follows a test-driven development where every PR must include one or more corresponding tests.
NOTE Several tests are written for multi-node environments, the minimum requirement for running the full tests is a machine with 2 GPUs running Ubuntu. The tests can also run with 1 GPU or No GPUs and on Windows and Mac but will not be comprehensive.
The main library is intended for Prototyping on a local environment e.g. Windows / Mac / Ubuntu but distributed execution on a multi-node cluster, only Ubuntu. When developing features related to a multi-node cluster a 2 < GPU machine will be required.
The development version of Ablator can be installed via pip pip install -e ."[dev]"
The -e
option automatically updates the library based on the folder contents, while the [dev]
option installs
additional developer depedencies for ABLATOR. It is always a good idea to re-run the above when pulling from a branch to install any missing depedencies.
You should first make sure that xcode command-line tools are installed using xcode-select --install
Depending on whether you have an Intel or Apple Silicon processor, you might need to follow additional steps in installing ray
Docker is used for running tests and is required to be installed. For detailed instructions on how to install Docker please refer to the official documentation.
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
You will need to install WSL and follow the instructions for Ubuntu (above and anywhere in the documentation) even when using Windows. Windows has poor support for several functionalities that make it non-ideal for the development of distributed applications and thus ablation experiments. The main issues arise when performing the same tests written for multi-node environments on Windows. This is not an expected use case for the library but if you are developing on Windows you will need a way to run your tests.
Even when using WSL the biggest issues encountered are the integration of GPU CUDA, Docker and Process Management.
For example, setproctitle
does not work, pynvml
has poor support for Windows with several bugs, e.g. nvmlSystemGetProcessName and Docker has network permission issues between host and container, while ray
crashes unexpectedly.
IMPORTANT
Do not install Windows Docker using Docker Desktop
for Windows. If you already have please uninstall and follow the instructions above. Docker
Desktop` is error-prone in the way it communicates with the WSL over the local network and the ray connection breaks randomly and unexplainably.
You should follow the official instructions on how to install docker.
You can verify that Docker is running by using the following command
docker --version
You should see the version number of Docker if the installation was successful.
To verify whether Docker is running, run the below Docker command in the terminal
docker run hello-world
You will need to set up docker to run in root-less
mode. For example, the system user that will be executing the tests should be able to execute: docker run hello-world
without running into errors. For instructions specific to your system please refer to the official documentation.
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
You will need to reboot / or log out and log in again for the changes to take effect.
The tests require the latest ABLATOR image to be built. The image must have the same python version as the running environment.
The easiest way to build a docker image is to run the script **with your development virtual environment active (as it is used to identify the python version you are using):
bash scripts/make_docker.sh
You will need to make the docker image in the main ablator directory every time before running the tests (as the code in the docker image is updated from the current repository)
NOTE While building the image, you may encounter this error: Docker-credential-desktop.exe executable file not found in $PATH
. A quick fix is to change credsStore
to credStore
in the config file at ~/.docker/config.json
. (ref)
You might encounter errors using the script above or you might be working on something that requires you to play around with different python versions. You can inspect make_docker.sh or simply play around with:
docker build --build-arg="PY_VERSION=3.xx.xx" --tag ablator .
The Dockerfile is used to build the image.
To run the same image in interactive mode for debugging.
docker run -it ablator /bin/bash
NOTE Dockers are ephemeral, any changes you make to the docker container will disappear once the container terminates regardless of what mode you use to execute the container. You can run a container in detached mode by adding the option -d
which will keep the container active in the background.
docker run -it -d ablator /bin/bash
To connect to an existing image, first find the container_id of a running image you want to connect
docker ps
then
docker exec -it <container_id> bash
IF a GPU is detected on the system, Docker tests in ABLATOR will try to start NVIDIA Docker image. To install nvidia container toolkit on your system please refer to the official documentation.
Optional (Not Recommended): To disable CUDA for the tests you can set export CUDA_VISIBLE_DEVICES=''
To install CUDA:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update \
&& sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo reboot
You must use WSL (see above) and you should follow the instructions for Ubuntu.
Natively torch does not support support CUDA on Mac. On Apple Silicon Mac there is MPS accelator support. However, ABLATOR does not currently officially support MPS, as such you will only be able to run experiments and tests on CPU.
docker run --rm --runtime=nvidia --gpus all ablator nvidia-smi
During the execution of tests, a mock ray cluster is set up. Due to interruptions or unexpected errors zombie
docker containers can remain running. The zombie containers can interact with tests running on the system and it is BEST to terminate all running containers to avoid unwanted interactions. The run_tests.sh
script automatically takes care of that. If you have other running containers please modify the script to disable killing your containers.
SSH should be enabled on the current system. It is recommended for security reasons that you configure SSH to be inaccessible outside your local network, a good guide.
The easiest thing would be to disable ssh-server when you are not running tests i.e. for Ubuntu sudo systemctl disable ssh
and stop sudo systemctl stop ssh
. Additional security options can include preventing access to SSH outside your local network. The risk is present for when your user account has a weak password, your ssh-keys are leaked and you are connected to an insecure WiFi network.
To disable password login for ssh
You can modify /etc/ssh/sshd_config
and set
PasswordAuthentication no
PubkeyAuthentication yes
To install a local ssh server you can use:
sudo apt install openssh-server
sudo systemctl start ssh
sudo systemctl status ssh
You should follow the official guide on how to set-up remote computer access to your Mac.
Make sure your authorized_keys
file has the correct permissions i.e. chmod 600 ~/.ssh/authorized_keys
Make sure that ray is not currently running on your local environment. e.g. by running ray status
To test changes you can run them in the main directory:
pytest .
You can also specify additional threads to use for tests e.g.
pytest -n 10 .
where -n
is the number of parallel tests to run.
As the tests are slow (especially the ones that test for multi-processing) when developing it is a better idea to only run the tests that affect your changes and reserve running all tests at the end of your work. e.g. pytest -n 10 tests/your_tests.py
To avoid polluting the commit history, each commit should be tested before pushing. Each commit should pass the tests, pylint, mypy and flake8 and have a specific purpose i.e. you should not be making test commits, you can experiment in a different branch and then use a separate branch for committing your working changes. This can help other people track the commit history to specific issues in the future.
NOTE As there is currently no widely available GPU support in Github actions, you must test your code on a machine that has GPUs as well as run your tests inside a Docker container without GPUS. It might seem unnecessary but there have been many cases where test cases fail either when CUDA is present or not present, even if your changes seem unrelated to the entire workflow of the app.
In the main directory (after activating the correct environment):
$ make test
It is always a good idea to run tests with and without GPU support
$ make test-cpu
Before committing, make sure that the static code checks pass.
$ make static-checks
On Linux you can also run docker tests which will set-up a clean environment for the tests to execute and will test mp functions (especially important if using WSL)
$ make docker-test
Please inspect Makefile for additional targets.
There are two test categories, fast tests and mp (slow) tests. To run mp tests you can specify --mp
flag and to run fast tests you can specify --fast
flag.
$ pytest --fast
or
$ pytest --mp
NOTE The MP tests are naturally flaky. Distributed systems can have many things go wrong. Many of the tests are end-to-end and try to cover as much ground as possible. If for whatever reason a test is failing you can execute it isolated. If the test passes isolated but not when run together with other tests it could mean that a test or code change you made makes the tests flaky, which is not a good sign. In a realistic scenario, we want to reduce flakyness of our distributed library.
Tests are slow and the above workflow is very slow to execute every time you make a change to the code. It is better to reserve it for the very end of your development process. To execute a specific test you can run
$ pytest tests/<test_folder>/<test_file.py>::<test_function_name>
To debug a test you can execute the debugger in the same file, there is logic implemented in each test file that will allow you to test a specific test by automatically passing test fixtures for you. For example:
if __name__ == "__main__":
from tests.conftest import run_tests_local
_locals = locals()
fn_names = [fn for fn in _locals if fn.startswith("test_")]
# OR a specific test
fn_names = ["test_name"]
test_fns = [_locals[fn] for fn in fn_names]
# Special locally defined fixtures or function parameters.
kwargs = {
"config": copy.deepcopy(_config),
"train_config": copy.deepcopy(_train_config),
}
run_tests_local(test_fns, kwargs)
By default pytests run codecov which will generate a coverage.xml
in the main repo directory. The coverage report is by default uploaded to codecov every time you perform an update. To visualize it locally you can use the official guide.