Text-to-Image with Stable Diffusion for Student Cluster Competition 2024
Introduction
This guide is designed for the Student Cluster Competition 2024 to walk participants through running and optimizing the MLPerf Inference Benchmark using Stable Diffusion XL 1.0 across various software and hardware configurations. The goal is to maximize system throughput (measured in samples per second) without compromising accuracy. Since the model performs poorly on CPUs, it is essential to run it on GPUs.
For a valid MLPerf inference submission, two types of runs are required: a performance run and an accuracy run. In this competition, we focus on the Offline
scenario, where throughput is the key metric—higher values are better. The official MLPerf inference benchmark for Stable Diffusion XL requires processing a minimum of 5,000 samples in both performance and accuracy modes using the COCO 2014 dataset. However, for SCC, we have reduced this and we also have two variants. scc-base
variant has dataset size reduced to 50 samples, making it possible to complete both performance and accuracy runs in approximately 5-10 minutes. scc-main
variant has dataset size of 500 and running it will fetch extra points as compared to running just the base variant. Setting up for Nvidia GPUs may take 2-3 hours but can be done offline. Your final output will be a tarball (mlperf_submission.tar.gz
) containing MLPerf-compatible results, which you will submit to the SCC organizers for scoring.
Scoring
In the SCC, your first objective will be to run scc-base
variant for reference (unoptimized) Python implementation or a vendor-provided version (such as Nvidia's) of the MLPerf inference benchmark to secure a baseline score.
Once the initial run is successful, you'll have the opportunity to optimize the benchmark further by maximizing system utilization, applying quantization techniques, adjusting ML frameworks, experimenting with batch sizes, and more, all of which can earn you additional points.
Since vendor implementations of the MLPerf inference benchmark vary and are often limited to single-node benchmarking, teams will compete within their respective hardware categories (e.g., Nvidia GPUs, AMD GPUs). Points will be awarded based on the throughput achieved on your system.
Additionally, significant bonus points will be awarded if your team enhances an existing implementation, adds support for new hardware (such as an unsupported GPU), enables multi-node execution, or adds/extends scripts to cm4mlops repository supporting new devices, frameworks, implementations etc. All improvements must be made publicly available under the Apache 2.0 license and submitted alongside your results to the SCC committee to earn these bonus points, contributing to the MLPerf community.
Info
Both MLPerf and CM automation are evolving projects. If you encounter issues or have questions, please submit them here
Artifacts to submit to the SCC committee
You will need to submit the following files:
mlperf_submission.run
- CM commands to run MLPerf inference benchmark saved to this file.mlperf_submission.md
- description of your platform and some highlights of the MLPerf benchmark execution.<Team Name>
under which results are pushed to the github repository.
SCC interview
You are encouraged to highlight and explain the obtained MLPerf inference throughput on your system and describe any improvements and extensions to this benchmark (such as adding new hardware backend or supporting multi-node execution) useful for the community and MLCommons.
Run Commands
MLPerf Reference Implementation in Python
Tip
- MLCommons reference implementations are only meant to provide a rules compliant reference implementation for the submitters and in most cases are not best performing. If you want to benchmark any system, it is advisable to use the vendor MLPerf implementation for that system like Nvidia, Intel etc.
SDXL
Datacenter category
In the datacenter category, sdxl has Offline scenarios and all the scenarios are mandatory for a closed division submission.
Pytorch framework
ROCm device
Please click here to see the minimum system requirements for running the benchmark
- Disk Space: 50GB
Native Environment
Please refer to the installation page to install CM for running the automated benchmark commands.
# Setup a virtual environment for Python
cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario
Tip
-
Batch size could be adjusted using
--batch_size=#
, where#
is the desired batch size. This option works only if the implementation in use is supporting the given batch size. -
Add
--adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK>
if you are modifying the official MLPerf Inference implementation in a custom fork. -
Add
--adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK>
if you are modifying the model config accuracy script in the submission checker within a custom fork. -
Add
--adr.inference-src.version=custom
if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.
cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet \
--precision=float16
Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet --precision=float16
Please click here to see more options for the RUN command
-
Use
--division=closed
to do a closed division submission which includes compliance runs -
Use
--rerun
to do a rerun even when a valid run exists
CUDA device
Please click here to see the minimum system requirements for running the benchmark
-
Device Memory: 24GB(fp32), 16GB(fp16)
-
Disk Space: 50GB
Docker Environment
Please refer to the installation page to install CM for running the automated benchmark commands.
# Docker Container Build and Performance Estimation for Offline Scenario
Tip
-
Batch size could be adjusted using
--batch_size=#
, where#
is the desired batch size. This option works only if the implementation in use is supporting the given batch size. -
Add
--adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK>
if you are modifying the official MLPerf Inference implementation in a custom fork. -
Add
--adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK>
if you are modifying the model config accuracy script in the submission checker within a custom fork. -
Add
--adr.inference-src.version=custom
if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.
Tip
--env.CM_MLPERF_MODEL_SDXL_DOWNLOAD_TO_HOST=yes
option can be used to download the model on the host so that it can be reused across different container lanuches.
cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--docker --quiet \
--precision=float16
Please click here to see more options for the docker launch
-
--docker_cm_repo=<Custom CM GitHub repo URL in username@repo format>
: to use a custom fork of cm4mlops repository inside the docker image -
--docker_cm_repo_branch=<Custom CM GitHub repo Branch>
: to checkout a custom branch of the cloned cm4mlops repository inside the docker image -
--docker_cache=no
: to not use docker cache during the image build
Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--quiet --precision=float16
Please click here to see more options for the RUN command
-
Use
--division=closed
to do a closed division submission which includes compliance runs -
Use
--rerun
to do a rerun even when a valid run exists
Native Environment
Please refer to the installation page to install CM for running the automated benchmark commands.
Tip
- It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.
# Setup a virtual environment for Python
cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario
Tip
-
Batch size could be adjusted using
--batch_size=#
, where#
is the desired batch size. This option works only if the implementation in use is supporting the given batch size. -
Add
--adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK>
if you are modifying the official MLPerf Inference implementation in a custom fork. -
Add
--adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK>
if you are modifying the model config accuracy script in the submission checker within a custom fork. -
Add
--adr.inference-src.version=custom
if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.
cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--quiet \
--precision=float16
Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--quiet --precision=float16
Please click here to see more options for the RUN command
-
Use
--division=closed
to do a closed division submission which includes compliance runs -
Use
--rerun
to do a rerun even when a valid run exists
Nvidia MLPerf Implementation
SDXL
Datacenter category
In the datacenter category, sdxl has Offline scenarios and all the scenarios are mandatory for a closed division submission.
TensorRT framework
CUDA device
Please click here to see the minimum system requirements for running the benchmark
-
Device Memory: 16GB
-
Disk Space: 50GB
Docker Environment
Please refer to the installation page to install CM for running the automated benchmark commands.
# Docker Container Build and Performance Estimation for Offline Scenario
Tip
- Batch size could be adjusted using
--batch_size=#
, where#
is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
Tip
--env.CM_MLPERF_MODEL_SDXL_DOWNLOAD_TO_HOST=yes
option can be used to download the model on the host so that it can be reused across different container lanuches.
cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=nvidia \
--framework=tensorrt \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--docker --quiet
Please click here to see more options for the docker launch
-
--docker_cm_repo=<Custom CM GitHub repo URL in username@repo format>
: to use a custom fork of cm4mlops repository inside the docker image -
--docker_cm_repo_branch=<Custom CM GitHub repo Branch>
: to checkout a custom branch of the cloned cm4mlops repository inside the docker image -
--docker_cache=no
: to not use docker cache during the image build --gpu_name=<Name of the GPU>
: The GPUs with supported configs in CM areorin
,rtx_4090
,rtx_a6000
,rtx_6000_ada
,l4
,t4
anda100
. For other GPUs, default configuration as per the GPU memory will be used.
Offline
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=nvidia \
--framework=tensorrt \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--quiet
Please click here to see more options for the RUN command
-
Use
--division=closed
to do a closed division submission which includes compliance runs -
Use
--rerun
to do a rerun even when a valid run exists --gpu_name=<Name of the GPU>
: The GPUs with supported configs in CM areorin
,rtx_4090
,rtx_a6000
,rtx_6000_ada
,l4
,t4
anda100
. For other GPUs, default configuration as per the GPU memory will be used.
Info
Once the above run is successful, you can change _scc24-base
to _scc24-main
to run the main variant.
Submission Commands
Generate actual submission tree
cm run script --tags=generate,inference,submission \
--clean \
--run-checker \
--tar=yes \
--env.CM_TAR_OUTFILE=submission.tar.gz \
--division=open \
--category=datacenter \
--env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \
--run_style=test \
--adr.submission-checker.tags=_short-run \
--quiet \
--submitter=<Team Name>
- Use
--hw_name="My system name"
to give a meaningful system name.
Push Results to GitHub
Fork the mlperf-inference-results-scc24
branch of the repository URL at https://github.com/mlcommons/cm4mlperf-inference.
Run the following command after replacing --repo_url
with your GitHub fork URL.
cm run script --tags=push,github,mlperf,inference,submission \
--repo_url=https://github.com/<myfork>/cm4mlperf-inference \
--repo_branch=mlperf-inference-results-scc24 \
--commit_message="Results on system <HW Name>" \
--quiet
Once uploaded give a Pull Request to the origin repository. Github action will be running there and once finished you can see your submitted results at https://docs.mlcommons.org/cm4mlperf-inference.