บทความนี้เขียนขึ้นโดยมีสมมุติฐานว่าท่านมีประสบการณ์การใช้งาน HPC cluster มาก่อน เช่น TARA

และควรมีชุดประสบการณ์ดังต่อไปนี้

เข้าใจความแตกต่าง และวัตถุประสงค์การใช้งานเบื้องต้นของ frontend-node และ compute-node อ่านเพิ่มเติม

ส่วนสำคัญคือ ท่านทราบว่าไม่สามารถ download ใด ๆ ได้ใน compute node ต้อง pre-download files ที่ Frontend-node เท่านั้น

สามารถรัน batch job บน HPC cluster โดยใช้คำสั่ง sbatch ตามด้วย Slurm script ได้ อ่านเพิ่มเติม
ทำการติดตั้งและ activate conda environment ใน home หรือ project directory ของท่านบน HPC cluster ได้ อ่านเพิ่มเติม
มีความเข้าใจในโปรแกรมที่ท่านกำลังใช้งานอยู่เป็นอย่างดี

บทความนี้แบ่งเนื้อหาเป็นสามส่วนหลัก คือ

ส่วนแนะนำภาพกว้างต่าง ๆ (ดังที่ท่านได้อ่านแล้วบางส่วน)
ส่วน Setup หลักเพื่อใช้ DDP PyTorch บน HPC
ส่วน ตัวอย่างการใช้งาน DDP เพื่อประยุกต์กับงานของท่านบน TARA หรือ LANTA

ซึ่งท่านสามารถดูภาพรวมเนื้อหาได้จาก Table of Contents ด้านล่างนี้ เพื่อประโยชน์ในการไปดูส่วนที่สนใจได้ทันที ตัวอย่างทั้งหมดได้ผ่านการทดลองบน TARA แล้วทั้งสิ้น

เกริ่นนำ

โดยทั่วไปแล้วมีเพียงสองเหตุผลที่ทำให้เราต้องการใช้ multiple GPUs ในการเทรน neural networks:

execution time บน single GPU นั้นนานเกินกว่าทีมวิจัยจะรับได้
โมเดลที่เลือกใช้นั้นมีขนาดใหญ่เกินไปที่จะฟิตใน single GPU

ทั้งนี้ การเลือก GPUs จำนวนมากขึ้นย่อมทำให้การรอคิวของท่านใน HPC cluster นานมากขึ้นตามไปด้วย โดยเฉพาะเมื่อคลัสเตอร์มีความหนาแน่นของ Job ที่เข้าใช้งานพร้อมกันจำนวนมาก

ตรวจสอบความหนาแน่นของคิวได้จากคำสั่ง squeue หรือตรวจสอบเครื่องใน partition ต่าง ๆ ว่าว่างหรือไม่ด้วยคำสั่ง sinfo

Distributed Data-Parallel (DDP) ใน PyTorch

SPMD หรือที่ย่อมาจาก Single-Program, Multiple Data คือไอเดียที่ว่า “โมเดล” จะถูกทำสำเนาเก็บไว้ที่ GPUs ทุกตัวที่จะใช้ และข้อมูลที่จะใช้กับโมเดลดังกล่าวก็จะถูกแบ่งออกเป็นจำนวนเท่า ๆ กันสำหรับแต่ละ GPUs เมื่อ gradients ถูกคำนวนจากทุก GPUs แล้วจะนำมารวมกันเพื่อหาค่าเฉลี่ย และค่า weights ก็จะถูกอัพเดททั้งชุดผ่าน gradient all-reduced โดยที่กระบวนการนี้จะถูกทำซ้ำด้วย mini-batches ชุดใหม่ที่ถูกส่งเข้า GPUs แต่ละตัวอีกครั้งหนึ่ง

เราควรใช้ DistributedDataParallel ซึ่งจากการใช้งานทราบกันโดยทั่วไปว่า efficient กว่า DataParallel แต่ถ้าใครอยากลองดูวิธีใช้ DataParallel เรามีให้ลองอ่านที่ multi-GPUs Training with PyTorch -- DataParallel

Setup หลักเพื่อใช้ DDP PyTorch บน HPC

Setup ส่วนของ PyTorch

dist.init_process_group()

def setup(rank, world_size):
    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

Code block นี้คือจุดสังเกตของการใช้งาน DDP โดยเป็นการสร้าง process group โดยมีฟังก์ชั่น dist.init_process_group() เป็นจุด check point ที่จะ “บล็อก” การทำงานเอาไว้เพื่อรอให้ processes ทั้งหมดที่กระจายกันไปได้ทำงานให้เสร็จกันก่อนที่นี่ (ดูการใช้งานภาพรวมได้ใน Code block ของ SimpleDDP.py ด้านล่าง)

ตัวอย่างของ single-gpu training

model = Net().to(device)
optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

ตัวอย่างของ multi-GPUs training with DDP

มีโมเดลที่ส่งให้ devices ตามจำนวน local_rank โดยที่โมเดลถูกส่งเข้า DDP อีกทอดหนึ่ง

model = Net().to(local_rank)
ddp_model = DDP(model, device_ids=[local_rank])
optimizer = optim.Adadelta(ddp_model.parameters(), lr=args.lr)

ซึ่ง local_rank ก็คือ gpu index ในเครื่องที่เราจะเห็นเป็นเลข 0-x เวลาเรา nvidia-smi หรือเรียก cuda ดู เช่น ใน DGX เราจะเห็น 0, 1, …, 7 เป็นต้น

นอกจากนี้เรายังต้องทำให้แน่ใจว่า แต่ละ batch จะถูกส่งไปที่ gpu ทุกตัวใน node

data.distributed.DistributedSampler & data.DataLoader

train_sampler = torch.utils.data.distributed.DistributedSampler(\
                            dataset1, num_replicas=world_size, rank=rank)
train_loader  = torch.utils.data.DataLoader(\
                            dataset1, batch_size=args.batch_size,  \
                            sampler=train_sampler, \ 
                            num_workers=int(os.environ["SLURM_CPUS_PER_TASK"]), \
                            pin_memory=True)

Setup ส่วนของ HPC (Slurm configuration)

สำหรับ Slurm script ที่ใช้รันจริงดูได้จาก code block ที่แสดงให้ดูเป็นตัวอย่างที่สามารถทดลองทำซ้ำได้ เช่น SimpleScriptDDP.sh, Script-N-1-worldsize-8.sh เป็นต้น ใน code block ด้านล่างนี้คือ setup หลักที่อยากแสดงให้เห็นก่อน นั่นคือ การเลือก partition ที่จะใช้งาน (-p) การเลือกจำนวนโหนดที่ต้องการใช้งาน (-N) การเลือกจำนวน tasks หรือ process ที่ให้รันในหนึ่งโหนด (--ntasks-per-node) ซึ่งโดยทั่วไปเพื่อให้เกิดประสิทธิภาพการใช้งานสูงสุดเราจะเลือกใช้เต็มจำนวน gpus ที่มี และการเลือกจำนวน cpu ต่อ tasks หรือ process (--cpus-per-task) ซึ่งโดยปกติเราก็จะเลือกใช้เต็มจำนวน cpus ที่มี

#SBATCH -p dgx-preempt           # trainning partition ที่เราเลือกใช้คือ dgx-preempt
#SBATCH -N 1                     # จำนวน node ของ dgx-preempt ที่เราต้องการ
#SBATCH --ntasks-per-node=8      # จำนวน ntasks per node (=จำนวน gpus ที่มี)
#SBATCH --cpus-per-task=5        # จำนวน cpu per task (5x8=40 cpus)
...
export WORLD_SIZE=8   # ควรได้มาจาก $(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE
...
srun python DDP.py

โดยการคำนวณค่า --cpus-per-task ดังกล่าว เราต้องทราบจำนวน CPU cores ที่มีในโหนด เพื่อไม่ให้เกิดการระบุค่าเกินจำนวน core ที่มีอยู่จริง ซึ่งจะส่งผลให้ Job ของเราติด ST (Status) PD หรือ Pending โดยมี NODELIST (REASON) แบบ PartitionConfig ซึ่งแสดงให้เห็นได้จากคำสั่ง squeue

    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   0000001       dgx  myjob   xxxxxxx PD       0:00      1 (PartitionConfig)
   0000002   compute herjob   yyyyyyy PD       0:00      1 (QOSGrpBillingMinutes)
   0000003 dgx-preem hisjob   xxxxxxx PD       0:00      1 (Resources)
   0000004       gpu onejob   yyyyyyy PD       0:00      1 (Priority)
   0000005       gpu onejob   yyyyyyy  R    6:40:53      1 tara-g-001

นอกจากนี้เรายังทำการ export ค่า WORLD_SIZE ซึ่งเป็นค่าที่เราคำนวณได้เองจากการที่เราระบุจำนวน Node ที่ต้องการ (-N) และเราสามารถทราบจำนวน GPUs ที่มีต่อโหนดที่เราเรียกใช้จาก partition ดังกล่าวได้ เช่น จาก cluster information page (https://thaisc.io/en/resources/ ) เป็นต้น โดยค่า WORLD_SIZE จะถูกนำไปใช้ในการแจกงานภายใน DDP ต่อไป

ใน TARA เราสามารถเรียกใช้ GPU ได้จากสาม partitions กล่าวคือ gpu, dgx และ dgx-preempt

Examples

Simple DDP

โดยในตัวอย่างนี้ เราจะใช้งาน SimpleDDP.py คู่กับ SimpleScriptDDP.sh ในการส่งคำสั่งเข้า Slurm ผ่าน sbatch

หากเราโฟกัสที่ SimpleDDP.py เราจะพบว่า เป็นการใช้งาน Linear Regression Model ง่าย ๆ แต่เน้นไปที่การทำความเข้าใจกับ parameter เพิ่มเติมอย่าง rank, world_size, gpus_per_node, local_rank เป็นต้น

SimpleDDP.py

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from socket import gethostname

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(42, 3)

    def forward(self, x):
        x = self.fc(x)
        x = F.relu(x)
        output = F.log_softmax(x, dim=1)
        return output

rank          = int(os.environ["SLURM_PROCID"])
world_size    = int(os.environ["WORLD_SIZE"])
gpus_per_node = int(os.environ["SLURM_GPUS_ON_NODE"])
assert gpus_per_node == torch.cuda.device_count()
print(f"Hello from rank {rank} of {world_size} on {gethostname()} where there are" \
      f" {gpus_per_node} allocated GPUs per node.", flush=True)

dist.init_process_group("nccl", rank=rank, world_size=world_size)
if rank == 0: print(f"Group initialized? {dist.is_initialized()}", flush=True)

local_rank = rank - gpus_per_node * (rank // gpus_per_node)
torch.cuda.set_device(local_rank)

model = Net().to(local_rank)
ddp_model = DDP(model, device_ids=[local_rank])

ddp_model.eval()
with torch.no_grad():
  data = torch.rand(1, 42)
  data = data.to(local_rank)
  output = ddp_model(data)
  print(f"host: {gethostname()}, rank: {rank}, output: {output}")

dist.destroy_process_group()

rank, world_size, gpus_per_node, local_rank

rank          = int(os.environ["SLURM_PROCID"])
world_size    = int(os.environ["WORLD_SIZE"])
gpus_per_node = int(os.environ["SLURM_GPUS_ON_NODE"])
...
local_rank = rank - gpus_per_node * (rank // gpus_per_node)
torch.cuda.set_device(local_rank)

SLURM_PROCID เป็น Slurm environment variable ที่มีค่าระหว่าง [0 - N-1], โดยที่ N เป็นค่า the number of tasks ที่รันภายใต้ srun ยกตัวอย่างเช่น การสั่งให้ Slurm config ที่จำนวน nodes = 2 และให้ใช้ ntasks-per-node=4

#SBATCH -N 2
#SBATCH --ntasks-per-node=4
...
srun python myscript.py

หมายความว่า Python interpreter จะสร้าง 2x4 = 8 processes ซึ่งจะมีค่า SLURM_PROCID ดังนี้ 0, 1, 2, 3, 4, 5, 6, 7

ซึ่งค่า rank ได้มาจาก SLURM_PROCID ดังได้กล่าวไปข้างต้น

ส่วน world_size แท้จริงแล้วก็คือจำนวน GPUs ที่มีทั้งหมดในการรันครั้งนี้ซึ่งได้มาจาก จำนวน GPU node(s) x จำนวน GPUs หรือ --ntasks-per-node ที่มีใน Node นั้น ๆ เช่น หากเรียกใช้งาน DGX node 1 node ซึ่งประกอบไปด้วย GPUs 8 ตัว ค่า world_size ที่ได้จะเท่ากับ 1x8=8 หรือหากเรียกใช้งาน GPU node 2 node ซึ่งประกอบไปด้วย GPUs 2 ตัวต่อโหนด ค่า world_size ที่ได้จะเท่ากับ 2x2=4 เป็นต้น

ทั้งนี้ค่า gpus_per_node และ local_rank จะถูกใช้ในการระบุ device ให้กับ torch ดังแสดงข้างต้น

อย่างไรก็ดี ค่าเหล่านี้จำต้องได้รับการเซ็ตจาก Slurm environment เอง หากมีการเซ็ตไว้โดยแอดมินของ HPC cluster หรือเราสามารถ export เองได้ใน launch script ดังแสดงในตัวอย่างสคริปต์ SimpleScriptDDP.sh

Output SimpleDDP

WORLD_SIZE=8
MASTER_PORT=15334
MASTER_ADDR=tara-dgx1-002
Hello from rank 6 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 5 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 4 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 7 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 3 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 0 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 1 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 2 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Group initialized? True
host: tara-dgx1-002.tara.nstda.or.th, rank: 6, output: tensor([[-1.2452, -1.0438, -1.0216]], device='cuda:6')
host: tara-dgx1-002.tara.nstda.or.th, rank: 4, output: tensor([[-1.2201, -0.9631, -1.1298]], device='cuda:4')
host: tara-dgx1-002.tara.nstda.or.th, rank: 5, output: tensor([[-1.2976, -0.8704, -1.1775]], device='cuda:5')
host: tara-dgx1-002.tara.nstda.or.th, rank: 2, output: tensor([[-1.2821, -1.0588, -0.9790]], device='cuda:2')
host: tara-dgx1-002.tara.nstda.or.th, rank: 1, output: tensor([[-1.1647, -0.9979, -1.1416]], device='cuda:1')
host: tara-dgx1-002.tara.nstda.or.th, rank: 0, output: tensor([[-1.1372, -1.0257, -1.1372]], device='cuda:0')
host: tara-dgx1-002.tara.nstda.or.th, rank: 3, output: tensor([[-1.2476, -1.0863, -0.9799]], device='cuda:3')
host: tara-dgx1-002.tara.nstda.or.th, rank: 7, output: tensor([[-1.1277, -1.0428, -1.1277]], device='cuda:7')

SimpleScriptDDP.sh

#!/bin/bash
#SBATCH -A thaisc                # account of your project
#SBATCH -J SG-torch              # create a short name for your job
#SBATCH -p dgx-preempt           # your choice of partition
#SBATCH -N 1                     # node count
#SBATCH --ntasks-per-node=8      # total number of tasks per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=ALL          # send email when job begins, ends
#SBATCH --mail-user=me@myorg.org

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "MASTER_PORT="$MASTER_PORT

export WORLD_SIZE=8   # ควรได้มาจาก $(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

module purge
module load cuDNN/8.0.5.39-CUDA-11.1.1
source ~/.bashrc
myconda
conda activate condapy37

srun python simpleDDP.py

ใน SimpleScriptDDP.sh ดังแสดงข้างต้นประกอบไปด้วยส่วนหลักอยู่ 4 ส่วนด้วยกันกล่าวคือ

ส่วนของ SBATCH configuration ที่มีสัญลักษณ์ #SBATCH แสดงต้นประโยค ในที่นี้ขออธิบายรายละเอียดที่เกี่ยวข้องกับการทำ distributed training นั่นคือ
```
#SBATCH -p dgx-preempt           # trainning partition ที่เราเลือกใช้คือ dgx-preempt
#SBATCH -N 1                     # จำนวน node ของ dgx-preempt ที่เราต้องการ
#SBATCH --ntasks-per-node=8      # จำนวน ntasks per node (=จำนวน gpus ที่มี)
```
1. -p ยกตัวอย่างคลัสเตอร์ TARA จะมีเพียง 3 partitions ที่มี gpu กล่าวคือ partition dgx, dgx-preempt, และ partition gpu ทำให้ตัวเลือก -p เป็นไปได้ทั้งหมด 3 ค่าดังกล่าว
2. ส่วนจำนวน -N ที่เป็นไปได้นั้น สำหรับ partition dgx เป็นไปได้ตั้งแต่ 1 ถึง 3 เท่ากับจำนวนโหนดที่มีมากสุดในแต่ละ partition (-N 2 สำหรับ gpu)
3. --ntask-per-node สำหรับ dgx มีค่ามากสุดเท่ากับ 8 ส่วน gpu มีค่ามากสุดที่ 2 ตามจำนวน gpus ที่มี
ส่วนการ export ค่าต่าง ๆ ที่จำเป็นต่อการรันแบบ distributed pytorch
โดย MASTER_PORT จำเป็นต้องถูกเซ็ต เช่นเดียวกับ WORLD_SIZE ดังได้อธิบายไปก่อนหน้า ส่วน MASTER_ADDR สามารถเรียกได้จาก scontrol show hostnames ดังแสดงใน script ด้านล่างนี้
```
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "MASTER_PORT="$MASTER_PORT

export WORLD_SIZE=8   # ควรได้มาจาก $(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR
```
ส่วนของการจัดการกับ training environment ซึ่งในตัวอย่างนี้ เลือกใช้การจัดการ environment ของ python ด้วย conda เมื่อรันบน HPC เราจึงต้องทำการ activate ทั้ง module ที่เกี่ยวข้องและเรียกใช้ conda ที่เราใช้งานได้พร้อมกัน ซึ่งในกรณีนี้เราต้องเรียกใช้ version cuDNN ที่สามารถใช้งานร่วมกับ environment ของ conda ที่เรา activate ขึ้นมาได้ ซึ่งในตัวอย่างนี้มีการ source ~/.bashrc เนื่องจากมีการเขียนฟังก์ชั่น myconda ในการ activate conda main ไว้นั่นเอง
```
module purge
module load cuDNN/8.0.5.39-CUDA-11.1.1
source ~/.bashrc
myconda
conda activate condapy37
```
ส่วนของการสั่ง srun เพื่อรันงาน python simpleDDP.py แบบ distributed parallel srun python simpleDDP.py

Table of Content (in page)

Now you are at the end of "Simple DDP". To navigate elsewhere, click expand to see the table of content.

MNIST DDP

โดยในตัวอย่างนี้เป็นการแปลง python code แต่เดิมของ MNIST ที่เป็น single gpu ให้เป็น DDP

DDP.py

import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR

import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from socket import gethostname

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

def setup(rank, world_size):
    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)
    
    train_kwargs = {'batch_size': args.batch_size}
    test_kwargs = {'batch_size': args.test_batch_size}
    if use_cuda:
        cuda_kwargs = {'num_workers': int(os.environ["SLURM_CPUS_PER_TASK"]),
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    dataset1 = datasets.MNIST('data', train=True, download=False,
                       transform=transform)
    dataset2 = datasets.MNIST('data', train=False,
                       transform=transform)
 
    world_size    = int(os.environ["WORLD_SIZE"])
    rank          = int(os.environ["SLURM_PROCID"])
    gpus_per_node = int(os.environ["SLURM_GPUS_ON_NODE"])
    assert gpus_per_node == torch.cuda.device_count()
    print(f"Hello from rank {rank} of {world_size} on {gethostname()} where there are" \
          f" {gpus_per_node} allocated GPUs per node.", flush=True)

    setup(rank, world_size)
    if rank == 0: print(f"Group initialized? {dist.is_initialized()}", flush=True)

    local_rank = rank - gpus_per_node * (rank // gpus_per_node)
    torch.cuda.set_device(local_rank)
    print(f"host: {gethostname()}, rank: {rank}, local_rank: {local_rank}")

    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset1, num_replicas=world_size, rank=rank)
    train_loader = torch.utils.data.DataLoader(dataset1, batch_size=args.batch_size, sampler=train_sampler, \
                                               num_workers=int(os.environ["SLURM_CPUS_PER_TASK"]), pin_memory=True)
    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

    model = Net().to(local_rank)
    ddp_model = DDP(model, device_ids=[local_rank])
    optimizer = optim.Adadelta(ddp_model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, ddp_model, local_rank, train_loader, optimizer, epoch)
        if rank == 0: test(ddp_model, local_rank, test_loader)
        scheduler.step()

    if args.save_model and rank == 0:
        torch.save(model.state_dict(), "mnist_cnn.pt")

    dist.destroy_process_group()


if __name__ == '__main__':
    main()

--cpus-per-task หรือ SLURM_CPUS_PER_TASK

ซึ่ง Python script DDP.py ดังกล่าวข้างต้นถูกใช้งานร่วมกับ script.sh ดังแสดงด้านล่างนี้ โดยมี 1 บันทัดที่เพิ่มเติมขึ้นมาสำหรับ Data Loader นั่นคือ

#SBATCH --cpus-per-task=5 # number of cpu per task (5x8=40 cpus)

ซึ่งการกำหนด --cpus-per-task นั้นต้องคำนึงถึงจำนวน cpu ที่มีทั้งหมดของ node หารด้วยจำนวน ntasks-per-node ที่เซ็ตไว้ซึ่งทั้ง partition dgx และ gpu มีจำนวน CPUs เท่ากันที่ 40 CPUs ดังนั้นสำหรับ dgx ค่า cpus-per-task จึงกำหนดได้เป็น 5 และสำหรับ partition gpu ค่า cpus-per-task จึงกำหนดได้เป็น 20

script-N-1-worldsize-8.sh

#!/bin/bash
#SBATCH -A thaisc                # account of your project
#SBATCH -J SG-torch              # create a short name for your job
#SBATCH -p dgx-preempt           # your choice of partition
#SBATCH -N 1                     # node count
#SBATCH --ntasks-per-node=8      # total number of tasks per node
#SBATCH --cpus-per-task=5        # number of cpu per task (5x8=40 cpus)
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=ALL          # send email when job begins, ends
#SBATCH --mail-user=me@myorg.org

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "MASTER_PORT="$MASTER_PORT

export WORLD_SIZE=8    # ควรได้มาจาก $(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

module purge
module load cuDNN/8.0.5.39-CUDA-11.1.1
source ~/.bashrc
myconda
conda activate condapy37

srun python DDP.py

Output DDP กับ 1 Node DGX-1 (8 GPUs)

ผลลัพธ์ของ script-N-1-worldsize-8.sh และ DDP.py แสดงได้ดังต่อไปนี้

WORLD_SIZE=8
MASTER_PORT=15347
MASTER_ADDR=tara-dgx1-002
Hello from rank 1 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 3 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 0 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 2 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 5 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 4 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 6 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 7 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Group initialized? True
host: tara-dgx1-002.tara.nstda.or.th, rank: 1, local_rank: 1
host: tara-dgx1-002.tara.nstda.or.th, rank: 3, local_rank: 3
host: tara-dgx1-002.tara.nstda.or.th, rank: 4, local_rank: 4
host: tara-dgx1-002.tara.nstda.or.th, rank: 6, local_rank: 6
host: tara-dgx1-002.tara.nstda.or.th, rank: 0, local_rank: 0
host: tara-dgx1-002.tara.nstda.or.th, rank: 2, local_rank: 2
host: tara-dgx1-002.tara.nstda.or.th, rank: 5, local_rank: 5
host: tara-dgx1-002.tara.nstda.or.th, rank: 7, local_rank: 7
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.297117
Train Epoch: 1 [640/60000 (8%)] Loss: 1.329343
Train Epoch: 1 [1280/60000 (17%)]       Loss: 0.518520
Train Epoch: 1 [1920/60000 (25%)]       Loss: 0.331641
Train Epoch: 1 [2560/60000 (34%)]       Loss: 0.256029
Train Epoch: 1 [3200/60000 (42%)]       Loss: 0.126544
Train Epoch: 1 [3840/60000 (51%)]       Loss: 0.129393
Train Epoch: 1 [4480/60000 (59%)]       Loss: 0.135831
Train Epoch: 1 [5120/60000 (68%)]       Loss: 0.094554
Train Epoch: 1 [5760/60000 (76%)]       Loss: 0.131771
Train Epoch: 1 [6400/60000 (85%)]       Loss: 0.078105
Train Epoch: 1 [7040/60000 (93%)]       Loss: 0.078772
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.315368
Train Epoch: 1 [640/60000 (8%)] Loss: 1.471632
Train Epoch: 1 [1280/60000 (17%)]       Loss: 0.394169
Train Epoch: 1 [1920/60000 (25%)]       Loss: 0.376319
...
Train Epoch: 14 [5120/60000 (68%)]      Loss: 0.003920
Train Epoch: 14 [5760/60000 (76%)]      Loss: 0.105166
Train Epoch: 14 [6400/60000 (85%)]      Loss: 0.020963
Train Epoch: 14 [7040/60000 (93%)]      Loss: 0.071237

Test set: Average loss: 0.0298, Accuracy: 9897/10000 (99%)

Output DDP กับ 2 Nodes DGX-1 (16 GPUs)

หากเราต้องการเทรนนิ่งด้วย 2 DGX nodes หรือ 16 GPUs V100 เราสามารถปรับแต่งค่าที่เกี่ยวข้องต่าง ๆ ใน Slurm configuration ของ script-N-1-worldsize-8.sh ให้กลายเป็น script-N-2-worldsize-16.sh ได้ง่าย ๆ ดังนี้

#SBATCH -N 2                     # node count
...
export WORLD_SIZE=16    # ควรได้มาจาก $(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

ผลลัพธ์ของ script-N-2-worldsize-16.sh และ DDP.py แสดงได้ดังต่อไปนี้

WORLD_SIZE=16
MASTER_PORT=17117
MASTER_ADDR=tara-dgx1-002
Hello from rank 12 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 13 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 15 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 8 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 9 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 10 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 11 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 14 of 16 on tara-dgx1-003.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 4 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 5 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 6 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 7 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 0 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 1 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 2 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 3 of 16 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Group initialized? True
host: tara-dgx1-002.tara.nstda.or.th, rank: 4, local_rank: 4
host: tara-dgx1-002.tara.nstda.or.th, rank: 6, local_rank: 6
host: tara-dgx1-002.tara.nstda.or.th, rank: 0, local_rank: 0
host: tara-dgx1-002.tara.nstda.or.th, rank: 2, local_rank: 2
host: tara-dgx1-002.tara.nstda.or.th, rank: 5, local_rank: 5
host: tara-dgx1-002.tara.nstda.or.th, rank: 7, local_rank: 7
host: tara-dgx1-002.tara.nstda.or.th, rank: 1, local_rank: 1
host: tara-dgx1-002.tara.nstda.or.th, rank: 3, local_rank: 3
host: tara-dgx1-003.tara.nstda.or.th, rank: 12, local_rank: 4
host: tara-dgx1-003.tara.nstda.or.th, rank: 13, local_rank: 5
host: tara-dgx1-003.tara.nstda.or.th, rank: 15, local_rank: 7
host: tara-dgx1-003.tara.nstda.or.th, rank: 8, local_rank: 0
host: tara-dgx1-003.tara.nstda.or.th, rank: 9, local_rank: 1
host: tara-dgx1-003.tara.nstda.or.th, rank: 10, local_rank: 2
host: tara-dgx1-003.tara.nstda.or.th, rank: 11, local_rank: 3
host: tara-dgx1-003.tara.nstda.or.th, rank: 14, local_rank: 6
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.288294
Train Epoch: 1 [640/60000 (17%)]        Loss: 1.147667
Train Epoch: 1 [1280/60000 (34%)]       Loss: 0.518015
Train Epoch: 1 [1920/60000 (51%)]       Loss: 0.241061
Train Epoch: 1 [2560/60000 (68%)]       Loss: 0.159519
Train Epoch: 1 [3200/60000 (85%)]       Loss: 0.387764
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.306850
Train Epoch: 1 [640/60000 (17%)]        Loss: 1.047796
...

การติดตั้งอื่น ๆ ที่จำเป็น

อย่าลืม Pre-download ข้อมูลที่ frontend-node ก่อนรัน

$ conda activate condapy37
(condapy37)$ python
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from torchvision import datasets, transforms
>>> transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
>>> dataset1 = datasets.MNIST('data', train=True, download=True, transform=transform)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
9913344it [00:01, 5744180.56it/s]
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
29696it [00:00, 1196864.05it/s]
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
1649664it [00:00, 2396549.23it/s]
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz
5120it [00:00, 23165950.90it/s]
Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw

>>> exit()

ซึ่งตอนนี้ข้อมูล MNIST จาก datasets ที่ torch มีให้ก็ถูก download ลงมาที่ /data/MNIST/raw บนเครื่องที่เรารันเรียบร้อย

Acknowledgment

Thank you for the very nice resources from Princeton University, as this article is written based on your work.

https://github.com/PrincetonUniversity/multi_gpu_training/tree/main/02_pytorch_ddp

Multi-GPUs Training with PyTorch -- Distributed Data-Parallel