Multi-GPUs Training
Multi-nodes training
For multi-GPU training, use the --distributed
flag.
This will use PyTorch’s DistributedDataParallel module to train the model on multiple GPUs.
Combine with on-line data loading for large datasets (see .. _multipreprocessing:).
Here is an example command to train a model on 4 GPUs on Slurm:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=train
#SBATCH --output=train.out
#SBATCH --nodes=2
#SBATCH --ntasks=20
#SBATCH --ntasks-per-node=10
#SBATCH --gpus-per-node=10
#SBATCH --cpus-per-task=8
#SBATCH --exclusive
#SBATCH --time=1:00:00
srun python <mace_repo_dir>/mace/cli/run_train.py \
--name='model' \
--model='MACE' \
--num_interactions=2 \
--num_channels=128 \
--max_L=2 \
--correlation=3 \
--E0s='average' \
--r_max=5.0 \
--train_file='./h5_data/train.h5' \
--valid_file='./h5_data/valid.h5' \
--statistics_file='./h5_data/statistics.json' \
--num_workers=8 \
--batch_size=20 \
--valid_batch_size=80 \
--max_num_epochs=100 \
--loss='weighted' \
--error_table='PerAtomRMSE' \
--default_dtype='float32' \
--device='cuda' \
--distributed \
--seed=2222 \
This script will train the model on 20 GPUs (–ntasks=20) on 2 nodes (–nodes=2) with 10 GPUs per node (–ntasks-per-node=10).
For Slurm users, the necessary environment variables should be set automatically in the file mace/tools/slurm_distributed.py
.
For other systems, you may need to set the environment variables manually by modifying the file.
Single-node multi-GPU training
For training on a single node with multiple GPUs, you can use the following command:
torchrun --standalone --nnodes=1 --nproc_per_node=4 <mace_repo_dir>/mace/cli/run_train.py \
--name='model' \
--model='MACE' \
--num_interactions=2 \
--num_channels=128 \
--max_L=2 \
--correlation=3 \
--E0s='average' \
--r_max=5.0 \
--train_file='./h5_data/train.h5' \
--valid_file='./h5_data/valid.h5' \
--statistics_file='./h5_data/statistics.json' \
--num_workers=8 \
--batch_size=20 \
--valid_batch_size=80 \
--max_num_epochs=100 \
--loss='weighted' \
--error_table='PerAtomRMSE' \
--default_dtype='float32' \
--device='cuda' \
--distributed \
--seed=2222 \
This script will train the model on 4 GPUs (–nproc_per_node=4) on a single node (–nnodes=1).