Skip to content

Multi-GPU & High-Throughput Training

Multi-GPU (torchrun)

ALIGNN supports PyTorch DistributedDataParallel launched via torchrun:

torchrun --nproc_per_node=4 train_alignn.py \
  --root_dir DataDir \
  --config config.json \
  --output_dir temp

Experimental

Multi-GPU training is not yet thoroughly tested. Please report issues on GitHub.

SLURM example

#SBATCH -n 4
#SBATCH -N 1
#SBATCH --gres=gpu:4

torchrun --nproc_per_node=4 train_alignn.py \
  --root_dir DataDir --config config.json --output_dir temp

Make sure --nproc_per_node matches the GPUs requested from the scheduler.

High-throughput training

For running the same training pipeline across many public datasets, see alignn/scripts/train_*.py. These scripts:

  • Download datasets via jarvis-tools databases (JARVIS-DFT, Materials Project, QM9_JCTC, …)
  • Generate the id_prop.csv and per-target configs automatically
  • Submit one training job per target property

Adapt the scheduler-specific lines (sbatch, qsub, …) at the top of each script for your cluster.

When to use which

Situation Use
Single dataset, single target, one node plain train_alignn.py
Single dataset, multiple GPUs on one node torchrun --nproc_per_node=N
Many datasets / targets, cluster queue alignn/scripts/train_*.py