FAQ
This page contains answers to frequently asked questions about GaNDLF.
Where do I start?¶
The usage guide provides a good starting point for you to understand the application of GaNDLF. If you have any questions, please feel free to post a support request, and we will do our best to address it ASAP.
Why do I get the error importlib.metadata.PackageNotFoundError: GANDLF
?¶
This means that GaNDLF was not installed correctly. Please ensure you have followed the installation guide properly.
Why is GaNDLF not working?¶
Verify that the installation has been done correctly by running gandlf verify-install
after activating the correct virtual environment. If you are still having issues, please feel free to post a support request, and we will do our best to address it ASAP.
Which parts of a GaNDLF configuration are customizable?¶
Virtually all of it! For more details, please see the usage guide and our extensive samples. All available options are documented in the config_all_options.yaml file.
Can I run GaNDLF on a high performance computing (HPC) cluster?¶
Yes, GaNDLF has successfully been run on an SGE cluster and another managed using Kubernetes. Please post a question with more details such as the type of scheduler, and so on, and we will do our best to address it.
How can I track the per-epoch training performance?¶
Yes, look for logs_*.csv
files in the output directory. It should be arranged in accordance with the cross-validation configuration. Furthermore, it should contain separate files for each data cohort, i.e., training/validation/testing, along with the values for all requested performance metrics, which are defined per problem type.
Why are my compute jobs failing with excess RAM usage?¶
If you have data_preprocessing
enabled, GaNDLF will load all of the resized images as tensors into memory. Depending on your dataset (resolution, size, number of modalities), this can lead to high RAM usage. To avoid this, you can enable the memory saver mode by enabling the flag memory_save_mode
in the configuration. This will write the resized images into disk.
How can I resume training from a previous checkpoint?¶
GaNDLF allows you to resume training from a previous checkpoint in 2 ways:
- By using the --resume
CLI parameter in gandlf run
, only the model weights and state dictionary will be preserved, but parameters and data are taken from the new options in the CLI. This is helpful when you are updated the training data or some compatible options in the parameters.
- If both
--resume
and--reset
are skipped ingandlf run
, the model weights, state dictionary, and all previously saved information (parameters, training/validation/testing data) is used to resume training.
How can I update GaNDLF?¶
- If you have installed from pip, then you can simply run
pip install --upgrade gandlf
to get the latest version of GaNDLF, or if you are interested in the nightly builds, then you can runpip install --upgrade --pre gandlf
. - If you have performed installation from sources, then you will need to do
git pull
from the baseGaNDLF
directory to get the latest master of GaNDLF. Follow this up withpip install -e .
after activating the appropriate virtual environment to ensure the updates get passed through.
How can I perform federated learning of my GaNDLF model?¶
Please see https://mlcommons.github.io/GaNDLF/usage/#federating-your-model-using-openfl.
How can I perform federated evaluation of my GaNDLF model?¶
Please see https://mlcommons.github.io/GaNDLF/usage/#federating-your-model-evaluation-using-medperf.
I was using GaNDLF version 0.0.19
or earlier, and I am facing issues after updating to 0.0.20
or later. What should I do?¶
Please read the migration guide to understand the changes that have been made to GaNDLF. If you have any questions, please feel free to post a support request.
I am getting an error related to version mismatch (greater or smaller) between the configuration and GaNDLF version. What should I do?¶
This is a safety feature to ensure a tight integration between the configuration used to define a model and the code version used to perform the training. Ensure that you have all requirements satisfied, and then check the version
key in the configuration, and ensure it appropriately matches the output of gandlf run --version
.
What if I have another question?¶
Please post a support request.