-
Notifications
You must be signed in to change notification settings - Fork 200
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'sok_examples' into 'main'
Sok documentation See merge request dl/hugectr/hugectr!1536
- Loading branch information
Showing
107 changed files
with
4,987 additions
and
8,646 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# Benchmark DLRM DCNV2 using TF + SOK + HKV | ||
|
||
We need several steps to run the benchmark. | ||
|
||
## Environment | ||
|
||
1. Select a docker image from Merlin TensorFlow | ||
```bash | ||
docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:nightly | ||
``` | ||
|
||
2. Launch the Merlin TensorFlow container with the following command: | ||
```bash | ||
docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-tensorflow:nightly | ||
``` | ||
|
||
3. Install the SOK+HKV from source code: | ||
```bash | ||
git clone ssh://[email protected]:12051/dl/hugectr.git | ||
cd hugectr | ||
git submodule init && git submodule update | ||
cd sparse_operation_kit | ||
mkdir build && cd build | ||
cmake -DSM={your SM version} .. | ||
make -j && make install | ||
rm -rf /usr/local/lib/python3.10/dist-packages/merlin_sok-1.x-py3.10-linux-x86_64.egg | ||
cp -r ../sparse_operation_kit /usr/local/lib/python3.10/dist-packages/ | ||
``` | ||
## How to Prepare Dataset | ||
Please generate training data according to [the DLRM DCNV2 documentation](https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr#prepare-the-input-dataset). | ||
|
||
## Benchmark | ||
|
||
1. Go to the work directory: | ||
```bash | ||
cd documents/tutorials/DLRM_Benchmark | ||
``` | ||
|
||
2. Prepare Criteo Terabyte dataset | ||
|
||
```bash | ||
# train_data.bin and test_data.bin is the binary dataset generated by hugectr | ||
# {splited_dataset} is the target directory to save the dataset | ||
python3 ./preprocess/split_bin.py /path/to/train_data.bin splited_dataset/train --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]" | ||
python3 ./preprocess/split_bin.py /path/to/test_data.bin splited_dataset/test --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]" | ||
``` | ||
|
||
|
||
3. Run the benchmark: | ||
|
||
Typically one GPU will be allocated per process, so if a server has 4 GPUs, you will run 4 processes. In `horovodrun`, the number of processes is specified with the `-np` flag. | ||
|
||
```bash | ||
# batch size = 65536 | ||
horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=100 --lr=24 | ||
# batch size = 32768 | ||
horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=32768 --epochs=100 --lr=24 | ||
``` | ||
|
||
## Details about Customized tests | ||
|
||
### 1. Initialize the HKV | ||
|
||
There are three key options when want to create a HKV instance: | ||
|
||
- `init_capacity`: The maximum number of KV pairs that HKV can hold when it is first created. | ||
- `max_capacity`: The maximum number of KV pairs that can be held when HKV is stable (until the last of training process). In the training process, if the load factor greater than a threshold, HKV's capacity will be doubled, but it will not exceed `max_capacity`. | ||
- `max_hbm_for_vectors`: The maximum size of HBM which can be used for HKV to store values (vectors, embeddings). However, HKV will not occupy them all at once. Instead, it will apply these resources when it needs, but please be sure that the system can satisfy its requirement, or the program will crash. | ||
### 2. Optimizer | ||
We can also change the optimizer use `--optimizer_name`,now support `sgd`, `adamax`, `adagrad`, `adadelta`, `ftrl` | ||
## DynamicVariable Configuration | ||
### 3.1 Default behavior | ||
When we choose HKV as the backend of SOK, `DynamicVariable` should be initialized in this way: | ||
```python | ||
self._sok_embedding = sok.DynamicVariable( | ||
var_type="hybrid", | ||
dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset | ||
) | ||
``` | ||
By default, the init_capacity and max_capacity of HKV will both be set to 64 * 1024 * 1024, and the max_hbm_for_vectors is 16GB. | ||
### 3.2 Customize | ||
We can also customize the configuration of HKV: | ||
```python | ||
self._sok_embedding = sok.DynamicVariable( | ||
var_type="hybrid", | ||
dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset | ||
init_capacity = 1024 * 1024, | ||
max_capacity = 1024 * 1024, | ||
max_hbm_for_vectors=30, # unit:GB | ||
) | ||
``` | ||
Be careful to set the `max_hbm_for_vectors`, and there are three factors that affect the setting of this value: | ||
- Total HBM size. | ||
- Type of optimizer. | ||
- Batch size. | ||
These factors will limit the HBM memory resource which is available to HKV. If not appropriate, the program will be at risk of Out Of Memory. | ||
By the way, HKV will not consume more resources than it needs. For example, it will only consume `max_capacity * dimension * elementSize` to store embeddings when `max_capacity * dimension * elementSize` is less than `x GB` which `x` equals to `max_hbm_for_vectors`. | ||
| batch size \ optimizer | SGD | Adamax | Adagrad | Addelta | Ftrl | | ||
| --- | --- | --- | --- | --- | --- | | ||
| 32768 | 60G | 20G | 35G | 20G | 20G | | ||
| 65536 | 60G | 20G | 20G | 20G | 20G | | ||
| 131072 | 60G | 20G | 20G | 20G | 10G | | ||
| 262144 | 60G | 20G | 20G | 20G | 10G | | ||
### Performance on 8 x H100 | ||
| batch size | exit criteria | frequent of evaluation | xla | amp | training time (minutes) | evaluating time (minutes) | total time (minutes) | average time of iteration (ms) | throughput(samples/second) | | ||
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | ||
| 65536 | 1 epoch | at end | yes | yes | no | yes | 8.79 | 0.10 | 4.16M | | ||
| 65536 | 1 epoch | at end | yes | yes | yes | no | 6.72 | 0.09 | 3.45M | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Oops, something went wrong.