Question Answering using Bert-Large¶

MLCommons-PythonNvidia

MLPerf Reference Implementation in Python¶

Tip

MLCommons reference implementations are only meant to provide a rules compliant reference implementation for the submitters and in most cases are not best performing. If you want to benchmark any system, it is advisable to use the vendor MLPerf implementation for that system like Nvidia, Intel etc.

BERT-99

datacenteredge

Datacenter category¶

In the datacenter category, bert-99 has Offline, Server scenarios and all of the scenarios are mandatory for a closed division submission.

PytorchDeepsparse

Pytorch framework¶

CPUCUDAROCm

CPU device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=100 --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build
--docker_os=ubuntu: ubuntu and rhel are supported.
--docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

CUDA device¶

Please click here to see the minimum system requirements for running the benchmark

Device Memory: To be updated
Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500 --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

Tip

It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

ROCm device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

Native

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Deepsparse framework¶

CPU

CPU device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build
--docker_os=ubuntu: ubuntu and rhel are supported.
--docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Please click here to view available generic model stubs for bert deepsparse

obert-large-pruned95_quant-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni
mobilebert-none-14layer_pruned50_quant-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni
mobilebert-none-base_quant-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
bert-base-pruned95_obs_quant-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none
mobilebert-none-14layer_pruned50-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni
obert-base-pruned90-none: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none
obert-large-pruned97_quant-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none
bert-base-pruned90-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none
bert-large-pruned80_quant-none-vnni: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni
obert-large-pruned95-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni
obert-large-pruned97-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none
bert-large-base-none: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none
obert-large-base-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none
mobilebert-none-base-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Please click here to view available generic model stubs for bert deepsparse

obert-large-pruned95_quant-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni
mobilebert-none-14layer_pruned50_quant-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni
mobilebert-none-base_quant-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
bert-base-pruned95_obs_quant-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none
mobilebert-none-14layer_pruned50-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni
obert-base-pruned90-none: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none
obert-large-pruned97_quant-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none
bert-base-pruned90-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none
bert-large-pruned80_quant-none-vnni: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni
obert-large-pruned95-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni
obert-large-pruned97-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none
bert-large-base-none: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none
obert-large-base-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none
mobilebert-none-base-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none

Edge category¶

In the edge category, bert-99 has Offline, SingleStream scenarios and all of the scenarios are mandatory for a closed division submission.

PytorchDeepsparse

Pytorch framework¶

CPUCUDAROCm

CPU device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=100 --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build
--docker_os=ubuntu: ubuntu and rhel are supported.
--docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge  \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge  \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

CUDA device¶

Please click here to see the minimum system requirements for running the benchmark

Device Memory: To be updated
Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500 --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cuda \
   --quiet

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge  \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

Tip

It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cuda \
   --quiet

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge  \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

ROCm device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

Native

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=rocm \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=rocm \
   --quiet

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge  \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Deepsparse framework¶

CPU

CPU device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build
--docker_os=ubuntu: ubuntu and rhel are supported.
--docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge  \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Please click here to view available generic model stubs for bert deepsparse

obert-large-pruned95_quant-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni
mobilebert-none-14layer_pruned50_quant-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni
mobilebert-none-base_quant-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
bert-base-pruned95_obs_quant-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none
mobilebert-none-14layer_pruned50-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni
obert-base-pruned90-none: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none
obert-large-pruned97_quant-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none
bert-base-pruned90-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none
bert-large-pruned80_quant-none-vnni: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni
obert-large-pruned95-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni
obert-large-pruned97-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none
bert-large-base-none: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none
obert-large-base-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none
mobilebert-none-base-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python¶

mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

# Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.
_r5.1-dev could also be given instead of _r6.0-dev if you want to run the benchmark with the MLPerf version being 4.1.
Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.
Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.
Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge  \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Please click here to view available generic model stubs for bert deepsparse

obert-large-pruned95_quant-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni
mobilebert-none-14layer_pruned50_quant-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni
mobilebert-none-base_quant-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
bert-base-pruned95_obs_quant-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none
mobilebert-none-14layer_pruned50-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni
obert-base-pruned90-none: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none
obert-large-pruned97_quant-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none
bert-base-pruned90-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none
bert-large-pruned80_quant-none-vnni: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni
obert-large-pruned95-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni
obert-large-pruned97-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none
bert-large-base-none: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none
obert-large-base-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none
mobilebert-none-base-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none

BERT-99.9

datacenter

Datacenter category¶

In the datacenter category, bert-99.9 has Offline, Server scenarios and all of the scenarios are mandatory for a closed division submission.

PytorchDeepsparse

Pytorch framework¶

CPUCUDAROCm

CPU device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

CUDA device¶

Please click here to see the minimum system requirements for running the benchmark

Device Memory: To be updated
Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

Tip

It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

ROCm device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

Native

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
   --test_query_count=100 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=rocm \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Deepsparse framework¶

CPU

CPU device¶

Please click here to see the minimum system requirements for running the benchmark

Disk Space: 50GB

DockerNative

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Please click here to view available generic model stubs for bert deepsparse

obert-large-pruned95_quant-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni
mobilebert-none-14layer_pruned50_quant-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni
mobilebert-none-base_quant-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
bert-base-pruned95_obs_quant-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none
mobilebert-none-14layer_pruned50-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni
obert-base-pruned90-none: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none
obert-large-pruned97_quant-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none
bert-base-pruned90-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none
bert-large-pruned80_quant-none-vnni: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni
obert-large-pruned95-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni
obert-large-pruned97-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none
bert-large-base-none: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none
obert-large-base-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none
mobilebert-none-base-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none

Native Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.1-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=reference \
   --framework=deepsparse \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Please click here to view available generic model stubs for bert deepsparse

obert-large-pruned95_quant-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni
mobilebert-none-14layer_pruned50_quant-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni
mobilebert-none-base_quant-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
bert-base-pruned95_obs_quant-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none
mobilebert-none-14layer_pruned50-none-vnni: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni
obert-base-pruned90-none: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none
obert-large-pruned97_quant-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none
bert-base-pruned90-none: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none
bert-large-pruned80_quant-none-vnni: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni
obert-large-pruned95-none-vnni: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni
obert-large-pruned97-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none
bert-large-base-none: zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none
obert-large-base-none: zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none
mobilebert-none-base-none: zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none

If you want to download the official MLPerf model and dataset for bert-99.9 you can follow this README.

Nvidia MLPerf Implementation¶

BERT-99

datacenteredge

Datacenter category¶

In the datacenter category, bert-99 has Offline, Server scenarios and all of the scenarios are mandatory for a closed division submission.

TensorRT

TensorRT framework¶

CUDA

CUDA device¶

Please click here to see the minimum system requirements for running the benchmark

Device Memory: To be updated

Docker

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

Tip

Default batch size is assigned based on GPU memory or the specified GPU. Please click more option for docker launch or run command to see how to specify the GPU name.
When run with --all_models=yes, all the benchmark models of NVIDIA implementation can be executed within the same container.
If you encounter an error related to ulimit or max locked memory during the run_harness step, please refer to the this issue for details and resolution steps.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500 --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build
--gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
--gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

Edge category¶

In the edge category, bert-99 has Offline, SingleStream scenarios and all of the scenarios are mandatory for a closed division submission.

TensorRT

TensorRT framework¶

CUDA

CUDA device¶

Please click here to see the minimum system requirements for running the benchmark

Device Memory: To be updated

Docker

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario¶

Tip

Compliance runs can be enabled by adding --compliance=yes.
Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.
Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

Tip

Default batch size is assigned based on GPU memory or the specified GPU. Please click more option for docker launch or run command to see how to specify the GPU name.
When run with --all_models=yes, all the benchmark models of NVIDIA implementation can be executed within the same container.
If you encounter an error related to ulimit or max locked memory during the run_harness step, please refer to the this issue for details and resolution steps.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500 --rerun

The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.

Please click here to see more options for the docker launch

--docker_privileged: to launch the container in privileged mode
--docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of mlperf-automations repository inside the docker image
--docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned mlperf-automations repository inside the docker image
--docker_cache=no: to not use docker cache during the image build
--gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

OfflineSingleStreamAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

SingleStream¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=SingleStream \
   --execution_mode=valid \
   --device=cuda \
   --quiet

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_all-scenarios \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge  \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
--gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

BERT-99.9

datacenter

Datacenter category¶

In the datacenter category, bert-99.9 has Offline, Server scenarios and all of the scenarios are mandatory for a closed division submission.

TensorRT

TensorRT framework¶

CUDA

CUDA device¶

Please click here to see the minimum system requirements for running the benchmark

Device Memory: To be updated

Docker

Docker Environment¶

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

You can reuse the same environment as described for bert-99.

Performance Estimation for Offline Scenario¶

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500 --rerun

The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

OfflineServerAll Scenarios

Offline¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Server¶

performance-onlyaccuracy-only

performance-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_performance-only \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

accuracy-only¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_accuracy-only \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Server\
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

All Scenarios¶

mlcr run-mlperf,inference,_full,_r5.0-dev,_all-scenarios \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --server_target_qps=<SERVER_TARGET_QPS> \
   --execution_mode=valid \
   --device=cuda \
   --quiet

Tip

<SERVER_TARGET_QPS> must be determined manually. It is usually around 80% of the Offline QPS, but on some systems, it can drop below 50%. If a higher value is specified, the latency constraint will not be met, and the run will be considered invalid.

Please click here to see more options for the RUN command

Use --division=closed to do a closed division submission which includes compliance runs
Use --rerun to do a rerun even when a valid run exists
Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
--gpu_name=<Name of the GPU> : The GPUs with supported configs in MLC are orin, rtx_4090, rtx_a6000, rtx_6000_ada, l4, t4and a100. For other GPUs, default configuration as per the GPU memory will be used.

s