Skip to content

Question and Answering using Bert Large for IndySCC 2024

Introduction

This guide is designed for the IndySCC 2024 to walk participants through running and optimizing the MLPerf Inference Benchmark using Bert Large across various software and hardware configurations. The goal is to maximize system throughput (measured in samples per second) without compromising accuracy.

For a valid MLPerf inference submission, two types of runs are required: a performance run and an accuracy run. In this competition, we focus on the Offline scenario, where throughput is the key metric—higher values are better. The official MLPerf inference benchmark for Bert Large requires processing a minimum of 10833 samples in both performance and accuracy modes using the Squad v1.1 dataset.

Scoring

In the IndySCC 2024, your objective will be to run a reference (unoptimized) Python implementation of the MLPerf inference benchmark to complete a successful submission passing the submission checker. Only one of the available framework needs to be submitted.

Info

Both MLPerf and MLC automation are evolving projects. If you encounter issues or have questions, please submit them here

Artifacts to submit to the SCC committee

All the needed files are automatically pushed to the GitHub repository if you manage to complete the given commands. No additional files need to be submitted.

MLPerf Reference Implementation in Python

Tip

  • MLCommons reference implementations are only meant to provide a rules compliant reference implementation for the submitters and in most cases are not best performing. If you want to benchmark any system, it is advisable to use the vendor MLPerf implementation for that system like Nvidia, Intel etc.

BERT-99

Edge category

In the edge category, bert-99 has Offline scenarios and all the scenarios are mandatory for a closed division submission.

Pytorch framework

CPU device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 50GB
Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=100
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for the Offline scenario.

Please click here to see more options for the docker launch

  • --docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

  • --docker_os=ubuntu: ubuntu and rhel are supported.
  • --docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL
Offline
mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python
mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --quiet \
   --test_query_count=100
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
CUDA device

Please click here to see the minimum system requirements for running the benchmark

  • Device Memory: 8GB

  • Disk Space: 50GB

Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for the Offline scenario.

Please click here to see more options for the docker launch

  • --docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

Offline
mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

Tip

  • It is advisable to use the commands in the Docker tab for CUDA. Run the below native command only if you are already on a CUDA setup with cuDNN and TensorRT installed.
# Setup a virtual environment for Python
mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful
ROCm device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 50GB
Native Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Setup a virtual environment for Python
mlcr install,python-venv --name=mlperf
export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
# Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
   --test_query_count=100
The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

Offline
mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=rocm \
   --quiet 

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

Deepsparse framework

CPU device

Please click here to see the minimum system requirements for running the benchmark

  • Disk Space: 50GB
Docker Environment

Please refer to the installation page to install MLCFlow for running the automated benchmark commands.

# Docker Container Build and Performance Estimation for Offline Scenario

Tip

  • Compliance runs can be enabled by adding --compliance=yes.

  • Number of threads could be adjusted using --threads=#, where # is the desired number of threads. This option works only if the implementation in use supports threading.

  • Batch size could be adjusted using --batch_size=#, where # is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

  • _r4.1-dev could also be given instead of _r5.0-dev if you want to run the benchmark with the MLPerf version being 4.1.

  • Add --adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the official MLPerf Inference implementation in a custom fork.

  • Add --adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK> if you are modifying the model config accuracy script in the submission checker within a custom fork.

  • Add --adr.inference-src.version=custom if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cpu  \
   --docker --quiet \
   --test_query_count=100\
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for the Offline scenario.

Please click here to see more options for the docker launch

  • --docker_mlc_repo=<Custom MLC GitHub repo URL in username@repo format>: to use a custom fork of cm4mlops repository inside the docker image

  • --docker_mlc_repo_branch=<Custom MLC GitHub repo Branch>: to checkout a custom branch of the cloned cm4mlops repository inside the docker image

  • --docker_cache=no: to not use docker cache during the image build

  • --docker_os=ubuntu: ubuntu and rhel are supported.
  • --docker_os_version=20.04: [20.04, 22.04] are supported for Ubuntu and [8, 9] for RHEL
Offline
mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=bert-99 \
   --implementation=reference \
   --framework=deepsparse \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cpu \
   --quiet \
   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

Please click here to see more options for the RUN command

  • Use --division=closed to do a closed division submission which includes compliance runs

  • Use --rerun to do a rerun even when a valid run exists

  • Use --compliance to do the compliance runs (only applicable for closed division) once the valid runs are successful

You can use any model from NeuralMagic sparse zoo (trained on Imagenet dataset) as --nm_model_zoo_stub

Please click here to view available generic model stubs for bert deepsparse

                * **obert-large-pruned95_quant-none-vnni:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni

                * **mobilebert-none-14layer_pruned50_quant-none-vnni:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni

                * **mobilebert-none-base_quant-none:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

                * **bert-base-pruned95_obs_quant-none:** zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none

                * **mobilebert-none-14layer_pruned50-none-vnni:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni

                * **obert-base-pruned90-none:** zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none

                * **obert-large-pruned97_quant-none:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none

                * **bert-base-pruned90-none:** zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none

                * **bert-large-pruned80_quant-none-vnni:** zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni

                * **obert-large-pruned95-none-vnni:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni

                * **obert-large-pruned97-none:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none

                * **bert-large-base-none:** zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none

                * **obert-large-base-none:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none

                * **mobilebert-none-base-none:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none
                </details>
            === "Native"
                ###### Native Environment

                Please refer to the [installation page](site:inference/install/) to install MLCFlow for running the automated benchmark commands.

                ####### Setup a virtual environment for Python


                ```bash
                mlcr install,python-venv --name=mlperf
                export MLC_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"
                ```
                ####### Performance Estimation for Offline Scenario

                !!! tip

                    - Compliance runs can be enabled by adding `--compliance=yes`.

                    - Number of threads could be adjusted using `--threads=#`, where `#` is the desired number of threads. This option works only if the implementation in use supports threading.

                    - Batch size could be adjusted using `--batch_size=#`, where `#` is the desired batch size. This option works only if the implementation in use is supporting the given batch size.

                    - `_r4.1-dev` could also be given instead of `_r5.0-dev` if you want to run the benchmark with the MLPerf version being 4.1.

                    - Add `--adr.mlperf-implementation.tags=_branch.master,_repo.<CUSTOM_INFERENCE_REPO_LINK>` if you are modifying the official MLPerf Inference implementation in a custom fork.

                    - Add `--adr.inference-src.tags=_repo.<CUSTOM_INFERENCE_REPO_LINK>` if you are modifying the model config accuracy script in the submission checker within a custom fork.

                    - Add `--adr.inference-src.version=custom` if you are using the modified MLPerf Inference code or accuracy script on submission checker within a custom fork.



                ```bash
                mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
                   --model=bert-99 \
                   --implementation=reference \
                   --framework=deepsparse \
                   --category=edge \
                   --scenario=Offline \
                   --execution_mode=test \
                   --device=cpu  \
                   --quiet \
                   --test_query_count=100\
                   --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
                ```
                The above command should do a test run of Offline scenario and record the estimated offline_target_qps.

                === "Offline"
                    ###### Offline



                    ```bash
                    mlcr run-mlperf,inference,_full,_r5.0-dev \
                       --model=bert-99 \
                       --implementation=reference \
                       --framework=deepsparse \
                       --category=edge \
                       --scenario=Offline \
                       --execution_mode=valid \
                       --device=cpu \
                       --quiet \
                       --nm_model_zoo_stub=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none
                    ```
                <details>
                <summary> Please click here to see more options for the RUN command</summary>

                * Use `--division=closed` to do a closed division submission which includes compliance runs

                * Use `--rerun` to do a rerun even when a valid run exists
                * Use `--compliance` to do the compliance runs (only applicable for closed division) once the valid runs are successful
                </details>

You can use any model from NeuralMagic sparse zoo (trained on Imagenet dataset) as --nm_model_zoo_stub

Please click here to view available generic model stubs for bert deepsparse

                * **obert-large-pruned95_quant-none-vnni:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni

                * **mobilebert-none-14layer_pruned50_quant-none-vnni:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni

                * **mobilebert-none-base_quant-none:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none

                * **bert-base-pruned95_obs_quant-none:** zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none

                * **mobilebert-none-14layer_pruned50-none-vnni:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni

                * **obert-base-pruned90-none:** zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none

                * **obert-large-pruned97_quant-none:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none

                * **bert-base-pruned90-none:** zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none

                * **bert-large-pruned80_quant-none-vnni:** zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni

                * **obert-large-pruned95-none-vnni:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni

                * **obert-large-pruned97-none:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none

                * **bert-large-base-none:** zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none

                * **obert-large-base-none:** zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none

                * **mobilebert-none-base-none:** zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none
                </details>

Submission Commands

Generate actual submission tree

mlcr generate,inference,submission \
   --clean \
   --run-checker \
   --tar=yes \
   --env.CM_TAR_OUTFILE=submission.tar.gz \
   --division=open \
   --category=edge \
   --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \
   --run_style=test \
   --quiet \
   --submitter=<Team Name>
  • Use --hw_name="My system name" to give a meaningful system name.

Push Results to GitHub

Fork the mlperf-inference-results-scc24 branch of the repository URL at https://github.com/mlcommons/cm4mlperf-inference.

Run the following command after replacing --repo_url with your GitHub fork URL.

mlcr push,github,mlperf,inference,submission \
   --repo_url=https://github.com/<myfork>/cm4mlperf-inference \
   --repo_branch=mlperf-inference-results-scc24 \
   --commit_message="Results on system <HW Name>" \
   --quiet

Once uploaded give a Pull Request to the origin repository. Github action will be running there and once finished you can see your submitted results at https://docs.mlcommons.org/cm4mlperf-inference.