Skip to content

Text Summarization using LLAMA2-70b

MLPerf Reference Implementation in Python

Tip

  • MLCommons reference implementations are only meant to provide a rules compliant reference implementation for the submitters and in most cases are not best performing. If you want to benchmark any system, it is advisable to use the vendor MLPerf implementation for that system like Nvidia, Intel etc.

LLAMA2-70B-99

Datacenter category

In the datacenter category, llama2-70b-99 has Offline, Server scenarios and all the scenarios are mandatory for a closed division submission.

Pytorch framework

CPU device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 700GB
Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=50
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

  • --docker_cm_repo=<Custom CM repo URL>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

  • --docker_os=ubuntu: ubuntu and rhel are supported.
  • --docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL
Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Setup a virtual environment for Python
cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 8x80GB

  • Disk Space: 700GB

Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=50
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

  • --docker_cm_repo=<Custom CM repo URL>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Setup a virtual environment for Python
cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

ROCm device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 700GB
Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Setup a virtual environment for Python
cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

LLAMA2-70B-99.9

Datacenter category

In the datacenter category, llama2-70b-99.9 has Offline, Server scenarios and all the scenarios are mandatory for a closed division submission.

Pytorch framework

CPU device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 700GB
Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 8x80GB

  • Disk Space: 700GB

Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

ROCm device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 700GB
Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
   --test_query_count=50
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • If you want to download the official MLPerf model and dataset for llama2-70b-99.9 you can follow this README.

Nvidia MLPerf Implementation

LLAMA2-70B-99

Datacenter category

In the datacenter category, llama2-70b-99 has Offline, Server scenarios and all the scenarios are mandatory for a closed division submission.

TensorRT framework

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 80GB

  • Disk Space: 700GB

Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

If ran with --all_models=yes, all the benchmark models of NVIDIA implementation could be run within the same container.

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=50 \
   --tp_size=2 \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

  • --docker_cm_repo=<Custom CM repo URL>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

LLAMA2-70B-99.9

Datacenter category

In the datacenter category, llama2-70b-99.9 has Offline, Server scenarios and all the scenarios are mandatory for a closed division submission.

TensorRT framework

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 80GB

  • Disk Space: 700GB

Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50 \
   --tp_size=2 \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Neural Magic MLPerf Implementation

LLAMA2-70B-99

Datacenter category

In the datacenter category, llama2-70b-99 has Offline, Server scenarios and all the scenarios are mandatory for a closed division submission.

pytorch framework

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 700GB
Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Run the Inference Server
cm run script --tags=run,vllm-server \
 --model=nm-testing/Llama-2-70b-chat-hf-FP8 \
 --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
 --quiet

Tip

  • Host and Port number of the server can be configured through --host and --port. Otherwise, server will run on default host localhost and port 8000.
# Docker Container Build and Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=50 \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

  • --docker_cm_repo=<Custom CM repo URL>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

# Run the Inference Server
cm run script --tags=run,vllm-server \
 --model=nm-testing/Llama-2-70b-chat-hf-FP8 \
 --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
 --quiet

Tip

  • Host and Port number of the server can be configured through --host and --port. Otherwise, server will run on default host localhost and port 8000.
# Setup a virtual environment for Python
cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50 \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

LLAMA2-70B-99.9

Datacenter category

In the datacenter category, llama2-70b-99.9 has Offline, Server scenarios and all the scenarios are mandatory for a closed division submission.

pytorch framework

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 700GB
Docker Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50 \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

Native Environment

Please refer to the installation page to install CM for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=50 \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
Server
cm run script --tags=run-mlperf,inference,_r4.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.
All Scenarios
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios \
   --model=llama2-70b-99.9 \
   --implementation=neuralmagic \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet \
   --api_server=http://localhost:8000 \
   --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
   --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm

Tip

  • <SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists