Name	Name	Last commit message	Last commit date
parent directory ..
mlperf	mlperf	More chagnes to mlperf setup to support llama2 70b and mixtral 8x7b (#…	Feb 13, 2025
tests	tests	Add test for benchmark_serving (#172 )	Feb 7, 2025
README.md	README.md	Revert "internal change" (#169 )	Jan 29, 2025
__init__.py	__init__.py	Revert "internal change" (#169 )	Jan 29, 2025
benchmark_serving.py	benchmark_serving.py	Add test for benchmark_serving (#172 )	Feb 7, 2025
eval_accuracy.py	eval_accuracy.py	Revert "internal change" (#169 )	Jan 29, 2025
metrics.py	metrics.py	Revert "internal change" (#169 )	Jan 29, 2025
open_orca_gpt4_tokenized_llama.calibration_1000.pkl	open_orca_gpt4_tokenized_llama.calibration_1000.pkl	Revert "internal change" (#169 )	Jan 29, 2025
requirements.in	requirements.in	Add test for benchmark_serving (#172 )	Feb 7, 2025

JetStream Benchmark And Eval

Install Dependencies

cd ~/JetStream/benchmarks
pip install -r requirements.in

Benchmark with shareGPT

Prepare DataSet

cd ~/data
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Run Benchmark with maxtext tokenizer

python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024

Run Benchmark for Llama 3

python benchmark_serving.py \
--tokenizer <llama3 tokenizer path> \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024 \
--model llama-3

Save request outputs in Benchmark

Please use --save-request-outputs flag to save predictions to a file.

python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024  \
--save-request-outputs

Automatically run evaluation after Benchmark

To automatically evaluate the outputs against the ROUGE evaluation metric, add the --run-eval true flag. Note: If --save-result is used, the evaluation scores will be saved as well.

python benchmark_serving.py \
--tokenizer /home/{username}/maxtext/assets/tokenizer \
--num-prompts 10  \
--dataset sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-output-length 1024  \
--save-request-outputs \
--run-eval true

Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)

python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer ~/maxtext/assets/tokenizer.llama2  \
--warmup-mode sampled   \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs.json   \
--num-prompts 1000   \
--max-output-length 1024   \
--dataset openorca

Benchmark warmup mode

The benchmark has better performance if it first conducts a warmup of the JetStream server. We currently support sampled and full warmup modes. full mode would warmup up the JetStream server with all the input requests. sampled mode would warmup up the JetStream server with a sampling of the input requests across different bucket sizes of input lengths.

Example to run benchmark with full warmup mode:

python JetStream/benchmarks/benchmark_serving.py   \
--tokenizer ~/maxtext/assets/tokenizer.llama2  \
--warmup-mode full   \
--save-result   \
--save-request-outputs   \
--request-outputs-file-path outputs.json   \
--num-prompts 1000   \
--max-output-length 1024   \
--dataset openorca

Standalone Evaluation Run

If you used --save-request-outputs, you can separately evaluate against the generated outputs.

python eval_accuracy.py outputs.json

With openorca dataset and llama2-chat models (used by MLPerf), here are the reference accuracy numbers:

llama2-7b-chat {'rouge1': 42.0706, 'rouge2': 19.8021, 'rougeL': 26.8474, 'rougeLsum': 39.5952, 'gen_len': 1146679, 'gen_num': 998}
llama2-70b-chat {'rouge1': 44.4312, 'rouge2': 22.0352, 'rougeL': 28.6162}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

benchmarks

benchmarks

README.md

JetStream Benchmark And Eval

Install Dependencies

Benchmark with shareGPT

Prepare DataSet

Run Benchmark with maxtext tokenizer

Run Benchmark for Llama 3

Save request outputs in Benchmark

Automatically run evaluation after Benchmark

Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)

Benchmark warmup mode

Standalone Evaluation Run

Files

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

JetStream Benchmark And Eval

Install Dependencies

Benchmark with shareGPT

Prepare DataSet

Run Benchmark with maxtext tokenizer

Run Benchmark for Llama 3

Save request outputs in Benchmark

Automatically run evaluation after Benchmark

Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)

Benchmark warmup mode

Standalone Evaluation Run