Skip to content

Text Summarization with Llama2-70b for Student Cluster Competition 2025

Introduction

This guide is designed for the Student Cluster Competition 2025 to walk participants through running and optimizing the MLPerf Inference Benchmark using Llama2 70b across various software and hardware configurations. The goal is to maximize system throughput (measured in Tokens per second) without compromising accuracy. Since the model performs poorly on CPUs, it is essential to run it on GPUs.

For a valid MLPerf Inference submission in this competition, you must run both a performance test and an accuracy test—no compliance runs are required. We use the Offline scenario, where throughput is the key metric (higher is better). For Llama 2-70B with the OpenOrca dataset (24,576 samples), the performance run must process an integer multiple of the full dataset (24,576 × N samples), while the accuracy run must process exactly the full dataset (24,576 samples). These requirements are taken care of by the MLPerf inference implementations. Setup for NVIDIA GPUs typically takes 2–3 hours and can be done offline. The final output is a tarball (mlperf_submission.tar.gz) containing MLPerf-compatible results which can be submitted to the organizers via a CLI command.

Scoring

In the SCC, your first objective will be to get a valid MLPerf benchmark run. Traditionally running the reference MLPerf inference implementation (in Python) is easier compared to running Nvidia MLPerf inference implementation. Since for SCC25 we are having the Llama2-70b model, running the reference implementation needs around 600GB of VRAM and is tested only on 8xH100 Nvidia GPUs. If you have lower VRAM, trying the vendor implementation like of Nvidia or AMD is the best option.

MLCommons provides automation to run the MLPerf inference benchmarks which you can make use of. Currently the automation supports the reference implementation as well as Nvidia implementation and this is useful for you to get a quick valid result as the automation produces the required final output. You can also use the manual steps by following the reference, Nvidia or AMD implementation readmes.

Once the initial run is successful, you'll have the opportunity to optimize the benchmark further by maximizing system utilization, applying quantization techniques, adjusting ML frameworks, experimenting with batch sizes, and more, all of which can earn you additional points.

Since vendor implementations of the MLPerf inference benchmark vary, teams will compete within their respective hardware categories (e.g., Nvidia GPUs, AMD GPUs). Points will be awarded based on the throughput achieved on your system.

Additionally, significant bonus points will be awarded if your team enhances an existing implementation, enables multi-node execution, or adds/extends scripts to mlperf-automations repository supporting new devices, frameworks, implementations etc. All improvements must be made publicly available under the Apache 2.0 license and submitted as pull requests by November 10, 2025 and only the code which is merge ready will be considered for evaluation. As a guideline, below are some examples which can fetch you bonus points.

  • Adds multi-node execution support for Nvidia, AMD or reference implementations
  • Support automation for AMD implementation
  • Supports fp8/fp4 quantization for Reference implementation
  • Automate the network reference implementation (this uses OpenAI compatible endpoints)
  • The MLPerf automation supports docker run of Nvidia implementation. Supporting apptainer is a valuable contribution

PS: For any query regarding the contribution, feel free to raise an issue in the Inference or MLPerf automations repositories.

Info

Both MLPerf and MLC automation are evolving projects. If you encounter issues related to SCC, please submit them here with scc-25 label with proper information about the command used, error logs and any additional usefull information to debug the issue.

Artifacts to submit to the SCC committee

You will need to submit the following files:

  • mlperf_submission.run - MLC commands to run MLPerf inference benchmark saved to this file.
  • mlperf_submission.md - description of your platform and some highlights of the MLPerf benchmark execution.
  • <Team Name> under which results are pushed to the github repository.

SCC interview

You are encouraged to highlight and explain the obtained MLPerf inference throughput on your system and describe any improvements and extensions to this benchmark (such as adding new hardware backend or supporting multi-node execution) useful for the community and MLCommons.

Run Commands

MLPerf Reference Implementation in Python

LLAMA2-70B-99

Datacenter category

In the datacenter category, llama2-70b-99 has Offline scenario and the scenario is mandatory for a closed division submission.

Pytorch framework

CPU device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 900GB for manual execution of reference implementation and 1.5TB for automated run through MLC-Scripts
Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet --rerun
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for the Offline scenario.

Please click here to see more options for the docker launch

  • --docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

  • --docker_os=ubuntu: ubuntu and rhel are supported.
  • --docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL
Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python
mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: To be updated

  • Disk Space: 900GB for manual execution of reference implementation and 1.5TB for automated run through MLC-Scripts

Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet --rerun
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for the Offline scenario.

Please click here to see more options for the docker launch

  • --docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

Tip

  • It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.
# Setup a virtual environment for Python
mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
ROCm device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 900GB for manual execution of reference implementation and 1.5TB for automated run through MLC-Scripts
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python
mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

LLAMA2-70B-99.9

Datacenter category

In the datacenter category, llama2-70b-99.9 has Offline scenario and the scenario is mandatory for a closed division submission.

Pytorch framework

CPU device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 900GB for manual execution of reference implementation and 1.5TB for automated run through MLC-Scripts
Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: To be updated

  • Disk Space: 900GB for manual execution of reference implementation and 1.5TB for automated run through MLC-Scripts

Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

Tip

  • It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
ROCm device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 900GB for manual execution of reference implementation and 1.5TB for automated run through MLC-Scripts
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet  
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet  

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
  • If you want to download the official MLPerf model and dataset for llama2-70b-99.9 you can follow this README.

Nvidia MLPerf Implementation

LLAMA2-70B-99

Datacenter category

In the datacenter category, llama2-70b-99 has Offline scenario and the scenario is mandatory for a closed division submission.

TensorRT framework

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 2x80GB

  • Disk Space: 900GB for manual execution of vendor implementation and 1.5TB for automated run through MLC-Scripts

Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --tp_size=2 --rerun
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for the Offline scenario.

Please click here to see more options for the docker launch

  • --docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

  • --gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.
  • Add --adr.llama2-model.tags=_pre-quantized to use the Nvidia quantized models with the available in the MLC Storage. These models were quantized with three different configurations of tensor parallelism and pipeline parallelism: TP1–PP2, TP2–PP1, and TP1–PP1. The appropriate model will be automatically selected based on the values provided for --tp_size and --pp_size in run command. By default tp size of 2 and pp size of 1 would be used.
Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE> 
accuracy-only
mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=llama2-70b-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE> 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
  • --gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

LLAMA2-70B-99.99

Datacenter category

In the datacenter category, llama2-70b-99.99 has Offline scenario and the scenario is mandatory for a closed division submission.

TensorRT framework

CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 2x80GB

  • Disk Space: 900GB for manual execution of vendor implementation and 1.5TB for automated run through MLC-Scripts

Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for llama2-70b-99.

Performance Estimation for Offline Scenario

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=llama2-70b-99.99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --tp_size=2 --rerun
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
performance-only
mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=llama2-70b-99.99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE> 
accuracy-only
mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=llama2-70b-99.99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet  \
   --tp_size=<TP_SIZE> \
   --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE> 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
  • --gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

Submission Commands

Generate actual submission tree

mlcr generate,inference,submission,_wg-inference \
   --clean \
   --run-checker \
   --tar=yes \
   --env.MLC_TAR_OUTFILE=submission.tar.gz \
   --division=open \
   --category=datacenter \
   --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \
   --quiet \
   --submitter=<Team Name>
  • Use --hw_name="My system name" to give a meaningful system name.
  • At the end, a .tar file would be generated inside the current working directory.

Submit Results

Note: Further instructions on the final submission will be published as the deadline approaches.