Multi-GPU & High-Throughput Training¶
Multi-GPU (torchrun)¶
ALIGNN supports PyTorch DistributedDataParallel launched via torchrun:
torchrun --nproc_per_node=4 train_alignn.py \
--root_dir DataDir \
--config config.json \
--output_dir temp
Experimental
Multi-GPU training is not yet thoroughly tested. Please report issues on GitHub.
SLURM example¶
#SBATCH -n 4
#SBATCH -N 1
#SBATCH --gres=gpu:4
torchrun --nproc_per_node=4 train_alignn.py \
--root_dir DataDir --config config.json --output_dir temp
Make sure --nproc_per_node matches the GPUs requested from the scheduler.
High-throughput training¶
For running the same training pipeline across many public datasets, see
alignn/scripts/train_*.py.
These scripts:
- Download datasets via jarvis-tools databases (JARVIS-DFT, Materials Project, QM9_JCTC, …)
- Generate the
id_prop.csvand per-target configs automatically - Submit one training job per target property
Adapt the scheduler-specific lines (sbatch, qsub, …) at the top of each script for
your cluster.
When to use which¶
| Situation | Use |
|---|---|
| Single dataset, single target, one node | plain train_alignn.py |
| Single dataset, multiple GPUs on one node | torchrun --nproc_per_node=N |
| Many datasets / targets, cluster queue | alignn/scripts/train_*.py |