chore(benchmarks): tidy up benchmark (awslabs#292)

Delete leftover script files. Add minor change to Jupyter Notebook (`"ec2_metadata"` key in results table). Simplify pyproject.toml dependencies list. Change some example parameters in the Hydra config files, add clarifying comments. Rework the READMEs. Tune the utils/prepare_nvme.sh to work for both Amazon Linux and Ubuntu EC2 instances. Update global .gitignore. Delete utils/prepare_ec2_instance.sh, and add its content to the README. For dataset scenario, add training time measurement around epochs. Minor Python code improvements.
IsaevIlya · Jan 10, 2025 · 58de12a · 58de12a
1 parent 79582bd
commit 58de12a
Show file tree

Hide file tree

Showing 25 changed files with 315 additions and 556 deletions.
diff --git a/.gitignore b/.gitignore
@@ -60,8 +60,10 @@ venv.bak/
 .dmypy.json
 dmypy.json
 
-# Hydra (https://hydra.cc/)
-multirun/
+# PyTorch benchmarks: Hydra, NVMe directory, and CSV results
+s3torchbenchmarking/**/multirun/
+s3torchbenchmarking/**/nvme/
+s3torchbenchmarking/**/*.csv
 
 # Rust .gitignore (https://github.com/github/gitignore/blob/main/Rust.gitignore) -- cherry-picked ######################
 

diff --git a/s3torchbenchmarking/README.md b/s3torchbenchmarking/README.md
@@ -1,179 +1,180 @@
-# Benchmarking the S3 Connector for PyTorch
+# s3torchbenchmarking
 
-This directory contains a modular component for the experimental evaluation of the performance of the Amazon S3 Connector for
-PyTorch.
-The goal of this component is to be able to run performance benchmarks for PyTorch connectors in an easy-to-reproduce and
-extensible fashion. This way, users can experiment with different settings and arrive at the optimal configuration for their workloads,
-before committing to a setup.
+This Python package houses a set of benchmarks for experimentally evaluating the performance of
+the **Amazon S3 Connector for PyTorch** library.
 
-By managing complex configuration space with [Hydra](https://hydra.cc/) we are able to define modular configuration pieces mapped to various
-stages of the training pipeline. This approach allows one to mix and match configurations and measure the performance 
-impact to the end-to-end training process.
+With the use of the [Hydra](https://hydra.cc/) framework, we are able to define modular configuration pieces mapped to
+various stages of the training pipeline. This approach allows one to mix and match configurations and measure the
+performance impact to the end-to-end training process.
 
-There are **three scenarios** available:
+**Four scenarios** are available:
 
-- **Data loading benchmarks**: measure our connector against other Dataset classes (i.e., classes used to fetch and
-  index actual datasets); all save to S3.
-- **PyTorch Lightning Checkpointing benchmarks**: measure our connector, using the PyTorch Lightning framework, against
-  the latter default implementation of checkpointing.
-- **PyTorch’s Distributed Checkpointing (DCP) benchmarks**: measure our connector against PyTorch default distributed
-  checkpointing mechanism — learn more in [this dedicated README](src/s3torchbenchmarking/dcp/README.md).
+1. **Dataset benchmarks**
+    - Compare our connector against other Dataset classes
+    - All scenarios save data to S3
+    - Measure performance in data fetching and indexing
+2. **PyTorch's Distributed Checkpointing (DCP) benchmarks**
+    - Assess our connector's performance versus PyTorch's default distributed checkpointing mechanism
+    - For detailed information, refer to the [dedicated DCP `README`](src/s3torchbenchmarking/dcp/README.md)
+3. **PyTorch Lightning Checkpointing benchmarks**
+    - Evaluate our connector within the PyTorch Lightning framework
+    - Compare against PyTorch Lightning's default checkpointing implementation
+4. **PyTorch Checkpointing benchmarks**
+    - TODO!
 
-## Getting Started
+## Getting started
 
-The benchmarking code is available within the `src/s3torchbenchmarking` module.
+The benchmarking code is located in the `src/s3torchbenchmarking` module. The scenarios are designed to be run on an EC2
+instance with one (or many) GPU(s).
 
-The tests can be run locally, or you can launch an EC2 instance with a GPU (we used a [g5.2xlarge][g5.2xlarge]),
-choosing the [AWS Deep Learning AMI GPU PyTorch 2.5 (Ubuntu 22.04)][dl-ami] as your AMI.
+### EC2 instance setup (recommended)
 
-First, activate the Conda env within this machine by running:
+From your EC2 AWS Console, launch an instance with one (or many) GPU(s) (e.g., G5 instance type); we recommend using
+an [AWS Deep Learning AMI (DLAMI)][dlami], such
+as [AWS Deep Learning AMI GPU PyTorch 2.5 (Amazon Linux 2023)][dlami-pytorch].
+
+> [!NOTE]
+> Some benchmarks can be long-running. To avoid the shortcomings around expired AWS tokens, we recommend attaching a
+> role to your EC2 instance with:
+>
+> - Full access to S3
+> - (Optional) Full access to DynamoDB — for writing run results
+>
+> See the [Running the benchmarks](#running-the-benchmarks) section for more details.
+
+For optimal results, it is recommended to run the benchmarks on a dedicated EC2 instance _without_ other
+resource-intensive processes.
+
+### Creating a new Conda environment (env)
+
+> [!WARNING]
+> While some DLAMIs provide a pre-configured Conda env (`source activate pytorch`), we have observed compatibility
+> issues with the latest PyTorch versions (2.5.X) at the time of writing. We recommend creating a new one from scratch
+> as detailed below.
+
+Once your instance is running, `ssh` into it, and create a new Conda env:
 
 ```shell
-source activate pytorch
+conda create -n pytorch-benchmarks python=3.12
+conda init
 ```
 
-If running locally you can optionally configure a Python venv:
+Then, activate it (_you will need to log out and in again in the meantime, as signaled by `conda init`_):
 
 ```shell
-python -m venv <ENV-NAME>
-source <PATH-TO-VENV>/bin/activate
+source activate pytorch-benchmarks
 ```
 
-Then, `cd` to the `s3torchbenchmarking` directory, and run the `utils/prepare_ec2_instance.sh` script: the latter will
-take care of updating the instance's packages (through either `yum` or `apt`), install Mountpoint for Amazon S3, and 
-install the required Python packages.
+Finally, from within this directory, install the `s3torchbenchmarking` module:
+
+```shell
+# `-e` so local modifications get picked up, if any
+pip install -e .
+```
 
 > [!NOTE]
-> Some errors may arise while trying to run the benchmarks; below are some workarounds to execute in such cases.
-
-- Error `RuntimeError: operator torchvision::nms does not exist` while trying the run the benchmarks:
-  ```shell
-  conda install -y pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
-  ```
-- Error `TypeError: canonicalize_version() got an unexpected keyword argument 'strip_trailing_zero'` while trying to
-  install `s3torchbenchmarking` package:
-  ```shell
-  pip install "setuptools<71"
-  ```
+> For some scenarios, you may be required to install the [Mountpoint for Amazon S3][mountpoint-s3] file client: please
+> refer to their README for instructions.
 
 ### (Pre-requisite) Configure AWS Credentials
 
-The commands provided below (`datagen.py`, `benchmark.py`) rely on the
-standard [AWS credential discovery mechanism][credentials]. Supplement the command as necessary to ensure the AWS
-credentials are made available to the process, e.g., by setting the `AWS_PROFILE` environment variable.
+The benchmarks and other commands provided below rely on the standard [AWS credential discovery mechanism][credentials].
+Supplement the command as necessary to ensure the AWS credentials are made available to the process, e.g., by setting
+the `AWS_PROFILE` environment variable.
 
-### Configuring the dataset
+### Creating a dataset (optional; for "dataset" benchmarks only)
 
-_Note: This is a one-time setup for each dataset configuration. The dataset configuration files, once created locally
-and can be used in subsequent benchmarks, as long as the dataset on the S3 bucket is intact._
+You can use your own dataset for the benchmarks, or you can generate one on-the-fly using the `s3torch-datagen` command.
 
-If you already have a dataset, you only need upload it to an S3 bucket and set up a YAML file under
-`./conf/dataset/` in the following format:
-
-```yaml
-# custom_dataset.yaml
+Here are some sample dataset configurations that we ran our benchmarks against:
 
-prefix_uri: s3://<S3_BUCKET>/<S3_PREFIX>/
-region: <AWS_REGION>
-sharding: TAR|null # if the samples have been packed into TAR archives.
+```shell
+s3torch-datagen -n 100k --shard-size 128MiB --s3-bucket my-bucket --region us-east-1
 ```
 
-This dataset can then be referenced in an experiment with an entry like `dataset: custom_dataset` (note that we're 
-omitting the *.yaml extension). This will result in running the benchmarks against this dataset. Some experiments have 
-already been defined for reference - see `./conf/dataloading.yaml` or `./conf/sharding.yaml`.
+## Running the benchmarks
 
-_Note: Ensure the bucket is in the same region as the EC2 instance to eliminate network latency effects in your
-measurements._
+You can run the different benchmarks by editing their corresponding config files, then running one of those shell
+script (specifically, you must provide a value for all keys marked with `???`):
 
-Alternatively, you can use the `s3torch-datagen` command to procedurally generate an image dataset and upload it to 
-Amazon S3. The script also creates a Hydra configuration file at the appropriate path.
+```shell
+# Dataset benchmarks
+vim ./conf/dataset.yaml           # 1. edit config
+./utils/run_dataset_benchmarks.sh # 2. run scenario
 
-```
-$ s3torch-datagen --help
-Usage: s3torch-datagen [OPTIONS]
-
-  Synthesizes a dataset that will be used for benchmarking and uploads it to
-  an S3 bucket.
-
-Options:
-  -n, --num-samples FLOAT  Number of samples to generate.  Can be supplied as
-                           an IEC or SI prefix. Eg: 1k, 2M. Note: these are
-                           case-sensitive notations. [default: 1k]
-  --resolution TEXT        Resolution written in 'widthxheight' format
-                           [default: 496x387]
-  --shard-size TEXT        If supplied, the images are grouped into tar files
-                           of the given size. Size can be supplied as an IEC
-                           or SI prefix. Eg: 16Mib, 4Kb, 1Gib. Note: these are
-                           case-sensitive notations.
-  --s3-bucket TEXT         S3 Bucket name. Note: Ensure the credentials are
-                           made available either through environment variables
-                           or a shared credentials file.  [required]
-  --s3-prefix TEXT         Optional S3 Key prefix where the dataset will be
-                           uploaded. Note: a prefix will be autogenerated. eg:
-                           s3://<BUCKET>/1k_256x256_16Mib_sharded/
-  --region TEXT            Region where the S3 bucket is hosted.  [default:
-                           us-east-1]
-  --help                   Show this message and exit.
+# PyTorch Checkpointing benchmarks
+vim ./conf/pytorch_checkpointing.yaml # 1. edit config
+./utils/run_checkpoints_benchmarks.sh # 2. run scenario
+
+# PyTorch Lightning Checkpointing benchmarks
+vim ./conf/lightning_checkpointing.yaml # 1. edit config
+./utils/run_lighning_benchmarks.sh      # 2. run scenario
 
+# PyTorch’s Distributed Checkpointing (DCP) benchmarks
+vim ./conf/dcp.yaml           # 1. edit config
+./utils/run_dcp_benchmarks.sh # 2. run scenario
 ```
 
-Here are some sample dataset configurations that we ran our benchmarks against:
+> [!NOTE]
+> Ensure the bucket is in the same region as the EC2 instance, to eliminate network latency effects in your
+> measurements.
 
-- `-n 20k --resolution 496x387`
-- `-n 20k --resolution 496x387 --shard-size {4, 8, 16, 32, 64}MiB`
+Each of those scripts rely on Hydra config files, located under the [`conf`](conf) directory. You may edit those as you
+see fit to configure the runs: in particular, parameters under the `hydra.sweeper.params` path will create as many jobs
+as the cartesian product of those.
 
-Example:
+Also, as the scripts pass the inline parameters you give them to Hydra, you may override their behaviors this way:
 
-```
-$ s3torch-datagen -n 20k \
-   --resolution 496x387 \
-   --shard-size 4MB \
-   --s3-bucket swift-benchmark-dataset \
-   --region eu-west-2
-
-Generating data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1243.50it/s]
-Uploading to S3: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 3378.87it/s]
-Dataset uploaded to: s3://swift-benchmark-dataset/20k_496x387_images_4MB_shards/
-Dataset Configuration created at: ./conf/dataset/20k_496x387_images_4MB_shards.yaml
-Configure your experiment by setting the entry:
-    dataset: 20k_496x387_images_4MB_shards
-Alternatively, you can run specify it on the cmd-line when running the benchmark like so:
-    s3torch-benchmark -cd conf  -m -cn <CONFIG-NAME> 'dataset=20k_496x387_images_4MB_shards'
+```shell
+./utils/run_dataset_benchmarks.sh +disambiguator=some_key
 ```
 
----
+## Getting the results
 
-Finally, once the dataset and other configuration modules have been defined, you can kick off the benchmark by running:
+### Scenario organization
 
-```shell
-# For data loading benchmarks:
-$ . utils/run_dataset_benchmarks.sh 
+Benchmark results are organized as follows, inside a default `./multirun` directory (e.g.):
+
+```
+./multirun
+└── dataset
+    └── 2024-12-20_13-42-27
+        ├── 0
+        │   ├── benchmark.log
+        │   └── job_results.json
+        ├── 1
+        │   ├── benchmark.log
+        │   └── job_resutls.json
+        ├── multirun.yaml
+        └── run_results.json
+```
 
-# For PyTorch Checkpointing benchmarks:
-$ . utils/run_checkpoints_benchmarks.sh
+Scenarios are organized at the top level, each in its own directory named after the scenario (e.g., `dataset`). Within
+each scenario directory, you'll find individual run directories, automatically named by Hydra using the creation
+timestamp (e.g., `2024-12-20_13-42-27`).
 
-# For PyTorch Lightning Checkpointing benchmarks:
-$ . utils/run_lighning_benchmarks.sh
+Each run directory contains job subdirectories (e.g., `0`, `1`, etc.), corresponding to a specific subset of parameters.
 
-# For PyTorch’s Distributed Checkpointing (DCP) benchmarks:
-$ . utils/run_dcp_benchmarks.sh
-```
+### Experiment reporting
 
-_Note: For overriding any other benchmark parameters, see [Hydra Overrides][hydra-overrides]. You can also run 
-`s3torch-benchmark --hydra-help` to learn more._
+Experiments will report various metrics, such as throughput and processed time — the exact types vary per scenarios.
+Results are stored in two locations:
 
-Experiments will report various metrics, like throughput, processed time, etc. The results for individual jobs and runs 
-(one run will contain 1 to N jobs) will be written out to dedicated files, respectively `job_results.json` and
-`run_results.json`, within their corresponding output directory (see the YAML config files).
+1. In the job subdirectories:
+    - `benchmark.log`: Individual job logs (collected by Hydra)
+    - `job_results.json`: Individual job results
+2. In the run directory:
+    - `multirun.yaml`: Global Hydra configuration for the run
+    - `run_results.json`: Comprehensive run results, including additional metadata
 
-## Next Steps
+If a DynamoDB table is defined in the [`conf/aws/dynamodb.yaml`](conf/aws/dynamodb.yaml) configuration file, results
+will also be written to the specified table.
 
-- Add more models (LLMs?) to monitor training performance.
-- Support plugging in user-defined models and automatic discovery of the same.
+[dlami]: https://docs.aws.amazon.com/dlami/
 
-[g5.2xlarge]: https://aws.amazon.com/ec2/instance-types/g5/
+[dlami-pytorch]: https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-5-amazon-linux-2023/
 
-[dl-ami]: https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html
+[mountpoint-s3]: https://github.com/awslabs/mountpoint-s3/tree/main
 
 [credentials]: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
 

diff --git a/s3torchbenchmarking/benchmark_results_aggregator.ipynb b/s3torchbenchmarking/benchmark_results_aggregator.ipynb
@@ -12,7 +12,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 1,
    "id": "6522fc8a931ffbc3",
    "metadata": {
     "ExecuteTime": {
@@ -39,7 +39,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 2,
    "id": "a371fc9062af6126",
    "metadata": {
     "ExecuteTime": {
@@ -76,7 +76,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 3,
    "id": "e14b9efad6ae3ad6",
    "metadata": {
     "ExecuteTime": {
@@ -127,6 +127,7 @@
     "                ),\n",
     "                **metrics_averaged,\n",
     "                \"config\": job_result[\"config\"],\n",
+    "                \"ec2_metadata\": run_result[\"ec2_metadata\"],\n",
     "            }\n",
     "            rows.append(row)\n",
     "\n",
@@ -143,7 +144,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 4,
    "id": "be008fb6acf09055",
    "metadata": {
     "ExecuteTime": {
@@ -170,14 +171,18 @@
    "source": [
     "import pandas as pd\n",
     "\n",
-    "_data = transform(_run_results)\n",
-    "_table = pd.json_normalize(_data).set_index(\"version\")\n",
+    "_table = pd.DataFrame()\n",
+    "\n",
+    "if _run_results:\n",
+    "    _data = transform(_run_results)\n",
+    "    _table = pd.json_normalize(_data).set_index(\"version\")\n",
+    "\n",
     "_table"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": null,
    "id": "b4eed2752e6add17",
    "metadata": {
     "ExecuteTime": {
@@ -191,7 +196,11 @@
     "import random\n",
     "\n",
     "_suffix = \"\".join(random.choices(string.ascii_letters, k=5))\n",
-    "_table.to_csv(f\"benchmark_results_{_suffix}.csv\")"
+    "_filename = f\"benchmark_results_{_suffix}.csv\"\n",
+    "\n",
+    "if not _table.empty:\n",
+    "    _table.to_csv(_filename)\n",
+    "    print(f\"CSV written to {_filename}\")"
    ]
   }
  ],