ในบทความนี้จะกล่าวถึงตัวอย่างการติดตั้ง Python package ที่เกี่ยวข้องกับ PyTorch และการทดสอบรันงานบน gpu node ของ TARA HPC โดยใช้ตัวอย่าง Package modules อย่างง่ายจากการทำ 3D-Deep Learning ด้วย PyTorch

Table of Contents

ตัวอย่างโปรแกรม basic multi-GPUs PyTorch

...

ผู้ใช้งานสามารถ copy ไฟล์โปรแกรมตัวอย่างที่ใช้ในบทความนี้ได้ในระบบ TARA ที่ ไฟล์โปรแกรม ตัวอย่างโปรแกรม basic multi-GPUs PyTorch ที่ใช้ในบทความนี้ได้ในระบบ TARA ที่
/tarafs/data/project/common/AI/examples/basic-multigpu-pytorch.py

import PyTorch modules และกำหนด parameters

Code Block

language	py

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

สร้าง dummy dataset ขึ้นมาด้วยการ random

Code Block

language	py

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

สร้าง simple model

ที่แค่รับ input แล้วทำ linear operation แล้วส่งออก output เพื่อแสดงการทำงานของ DataParallel ซึ่งก็คือส่วนที่ทำให้เกิดการใช้งาน multi-GPUs ได้ด้วยการแบ่งข้อมูลออกไปที่แต่ละ GPUs

...

Code Block

language	py

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

สร้าง object device

ซึ่งจะเป็นจุดที่เราส่ง tensor/model ให้กับ device ที่เรามีเพื่อการทำงาน

...

Code Block

language	py

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

สร้าง model instance และการรันงานแบบ nn.DataParallel

โดยจุดที่เราสนใจเป็นพิเศษในบทความนี้คือ หากในเครื่องที่เราได้รันงานมี GPUs หลายตัว เราจะสามารถ wrap model ของเราด้วย nn.DataParallel ได้ จากนั้นจึงนำ model ที่ได้ส่งให้กับ device ที่เรากำหนดไว้ก่อนหน้านี้

...

ในขั้นตอนนี้ถ้าหากเรามี GPU มากกว่าหนึ่งตัว เช่น gpu node ในระบบ TARA ก็จะได้ผลลัพธ์ออกมาเป็นLet’s use 2

Code Block
Let's use 2 GPUs!

run the model

สั่งให้พิมพ์ขนาดของ input/output tensors ออกมาให้ดูด้วย

Code Block

language	py

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())

ผลลัพธ์เมื่อรันด้วย cpu node

ดังนั้นหาก machine ที่เราใช้ไม่มี GPU เช่น
รันใน compute node (tara-c-001) บนระบบ TARA แบบ sinteract จะได้ output ดังแสดงด้านล่าง ซึ่งจะพบว่าไม่มีการแบ่งข้อมูลออกไปเนื่องจากไม่มี GPU และเป็นการรันงานบน cpu

Code Block

(venv-3DDL)[apiyatum@tarauaccount@tara-c-001 segmed]$ python basic-multigpu-pytorch.py 
torch version :  1.9.0+cu102
cuda available? :  False
cuda version:  10.2
cuda device count:  0
cuda device id: 
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
(venv-3DDL)[apiyatum@tarauaccount@tara-c-001 segmed]$

ผลลัพธ์เมื่อรันด้วย gpu node

และหากเครื่องที่ได้ใช้มี GPUs เช่น รันใน gpu node บนระบบ TARA (tara-g-001) แบบ sinteract จะได้ output แบบนี้

Code Block

(venv-3DDL) [apiyatum@tarauaccount@tara-g-001 segmed]$ python basic-multigpu-pytorch.py  
torch version :  1.9.0+cu102
cuda available? :  True
cuda version:  10.2
cuda device count:  2
cuda device id: , 0, 1
Let's use 2 GPUs!
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
	In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
(venv-3DDL) [apiyatum@tarauaccount@tara-g-001 segmed]$

ซึ่งจะเห็นได้ว่ามีการแบ่งข้อมูลไปรันโมเดลที่ GPUs ทั้งสองตัว

ติดตั้ง PyTorch package

สมมุติว่าท่านต้องการเรียกใช้ library ดังแสดงใน header ของโปรแกรม python หนึ่งที่เรียกใช้ torch3D torchvision torchsummary

Code Block

language	py

import numpy as np
import torch
import torch.nn as nn
import torchvision
from torchvision import models
from torchsummary import summary

และท่านมี requirements-3DDL.txt ที่ได้จากการ pip freeze จากเครื่อง development ของท่าน ดังนี้

Code Block

$ cat requirements-3DDL.txt 
cycler==0.10.0
fvcore==0.1.5.post20210812
imageio==2.9.0
iopath==0.1.9
kiwisolver==1.3.1
matplotlib==3.4.3
networkx==2.6.2
numpy==1.21.2
opencv-python==4.5.3.56
Pillow==8.3.1
plotly==5.2.1
portalocker==2.3.0
pyparsing==2.4.7
python-dateutil==2.8.2
PyTorch3d==0.5.0
PyWavelets==1.1.1
PyYAML==5.4.1
scikit-image==0.18.2
scipy==1.7.1
six==1.16.0
tabulate==0.8.9
tenacity==8.0.1
termcolor==1.1.0
tifffile==2021.8.8
torch==1.9.0
torchaudio==0.9.0
torchsummary==1.5.1
torchvision==0.10.0
tqdm==4.62.1
typing-extensions==3.10.0.0
yacs==0.1.8

ท่านสามารถติดตั้ง library ดังกล่าวได้บน project home directory ของท่านบน TARA HPC ใน Virtual Environment (source activate) ของท่าน หรือติดตั้งใน Singularity container เพื่อการใช้งานบน TARA HPC ได้

ติดตั้ง Package module ที่ต้องการใน virtualenv และทดสอบใช้งาน

อย่าลืม load module ซอฟแวร์ที่ต้องการใช้งานก่อนเริ่มทำงานใน TARA

สร้าง Virtual Environment ใน TARA

ในตัวอย่างด้านล่าง ได้ทำการ module load Python แล้วสร้าง virtualenv ไว้ในโฟลเดอร์ venv-3DDL จากนั้น activate virtualenv ที่สร้างขึ้นด้วยคำสั่ง source venv-3DDL/bin/activate แล้วจึงเริ่มทำการติดตั้ง Package ที่ต้องการ เช่น pip install -r requirement.txt หรือ การติดตั้ง Package แบบระบุที่อยู่

Code Block

language	none

[apiyatum@tara-frontend-1 segmed]$ module load Python 
[apiyatum@tara-frontend-1 segmed]$ virtualenv venv-3DDL
...
[apiyatum@tara-frontend-1 segmed]$ source venv-3DDL/bin/activate
(venv-3DDL) [apiyatum@tara-frontend-1 segmed]$ pip list
Package    Version
---------- -------
pip        21.1.3
setuptools 57.4.0
wheel      0.36.2
(venv-3DDL) [apiyatum@tara-frontend-1 segmed]$ pip install -r requirements-3DDL.txt
...
ERROR: Could not find a version that satisfies the requirement PyTorch3d==0.5.0 (from versions: 0.0.1)
ERROR: No matching distribution found for PyTorch3d==0.5.0

Note

ERROR: Could not find a version that satisfies the requirement PyTorch3d==0.5.0 (from versions: 0.0.1)

ERROR: No matching distribution found for PyTorch3d==0.5.0

จาก error ข้างต้นและผลการค้นบนอินเตอร์เน็ต (อ่านเพิ่มเติม) ทำให้ติดตั้ง specific version ของ PyTorch3D ที่เหมาะสมกับเวอร์ชั่นของ Python, cuda, และ PyTorch ที่กำลังใช้งานอยู่บนระบบ TARA ได้ดังนี้

Code Block
(venv-3DDL) [apiyatum@tara-frontend-1 segmed]$ pip install pytorch3d \ -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu102_pyt190/download.html

ทดสอบใช้งานด้วย sinteract

sinteract -p gpu

หลังจากติดตั้งเสร็จสิ้นจึงทดสอบโปรแกรมที่ต้องการรันต่อ $ python basic-multigpu-pytorch.py โดยในกรณีนี้เราต้องการใช้งาน gpu จึงจองทรัพยากรแบบ sinteract เพื่อทำการทดสอบโปรแกรมเบื้องต้น

Code Block

[apiyatum@tara-frontend-1 segmed]$ sinteract -p gpu

No account specified. Please select account to charge from:

  [1] pre0005
  ...
  [6] thaisc

  [q] Quit

Please type a selection: 6
Running interactive job using thaisc account
srun: job 1314560 queued and waiting for resources
srun: job 1314560 has been allocated resources
[apiyatum@tara-g-001 segmed]$

ซึ่งจะสังเกตเห็นว่าตอนนี้เราสามารถจองทรัพยากรสำเร็จและได้เปลี่ยนจาก frontend-1 node มาอยู่บน tara-g-001 หรือ gpu node เบอร์ 001 แล้วนั่นเอง

output

จากนั้นเรียกใช้งาน virtual environment ที่เราต้องการ แล้วทดสอบโปรแกรมกับ package ต่างๆที่ได้ติดตั้งไว้แล้วใน vitualenv

Code Block

[apiyatum@tara-g-001 segmed]$ source venv-3DDL/bin/activate
(venv-3DDL) [apiyatum@tara-g-001 segmed]$ python basic-multigpu-pytorch.py 
torch version :  1.9.0+cu102
cuda available? :  True
cuda version:  10.2
cuda device count:  2
cuda device id: , 0, 1
Let's use 2 GPUs!
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
	In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

ติดตั้ง Package module ที่ต้องการใน Singularity container และทดสอบใช้งาน

อ้างอิงการสร้าง Singularity Container อย่างง่ายใน 5 ขั้นตอน

สร้าง Singularity container

ตัวอย่างด้านล่างนี้ทำใน local machine ที่ติดตั้ง singularity ไว้บน Linux machine โดยทำตามเอกสารอ้างอิง และเพิ่มเติมการทดสอบรันตัวอย่างโปรแกรม Python (basic-multigpu-pytorch.py) ภายใน container เพื่อยืนยันว่าได้ติดตั้ง package module ที่ถูกต้องพร้อมใช้งานจริง

Code Block

language	go

[apiyatum@localhost ~]$ singularity pull docker://nvcr.io/nvidia/pytorch:21.07-py3
[apiyatum@localhost ~]$ sudo singularity build --sandbox mysandbox/ pytorch_21.07-py3.sif
INFO:    Starting build...
INFO:    Creating sandbox directory...
INFO:    Build complete: mysandbox/

[apiyatum@localhost ~]$ singularity shell --writable mysandbox/
Singularity mysandbox:/> 
Singularity mysandbox:/> pip install torchsummary
Singularity mysandbox:/> pip install pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu102_pyt190/download.html
Singularity mysandbox:/> python basic-multigpu-pytorch.py 
torch version :  1.8.1+cu102
cuda available? :  False
cuda version:  10.2
cuda device count:  0
cuda device id: 
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

Singularity mysandbox:/> exit

[apiyatum@localhost ~]$ singularity build pytorch_21.07-py3-3DDL.sif mysandbox/ 
INFO:    Starting build...
INFO:    Creating SIF file...
INFO:    Build complete: pytorch_21.07-py3-3DDL.sif
[apiyatum@localhost ~]$

จะเห็นจาก output ที่ทดสอบภายใน container ว่าไม่มี cuda device เนื่องจากเรากำลังอยู่ใน linux machine ที่ไม่มี GPUs แต่จากผลลัพธ์แสดงให้เห็นว่าสามารถใช้งาน package modules ต่างๆที่จำเป็นได้แล้ว

ทดสอบใช้งานด้วย sbatch

เตรียม submission script

ใน submission script นี้กำหนดให้ใช้งาน partition “dgx-preempt” โดยเราทราบอยู่แล้วว่า dgx node มีจำนวน 8 GPUs ด้วยกัน

Code Block

#!/bin/bash
#SBATCH -p dgx-preempt
#SBATCH -N 1 --ntasks-per-node=40
#SBATCH -t 00:10:00
#SBATCH -J 3DDL
#SBATCH -A thaisc

module purge
module load Singularity

singularity exec --nv pytorch_21.07-py3-3DDL.sif python basic-multigpu-pytorch.py

Note
อย่าลืมใส่ “--nv” เพราะจะทำให้ใช้งาน GPUs ไม่ได้

output

Code Block

$ cat slurm-1314656.out
torch version :  1.8.1+cu102
cuda available? :  True
cuda version:  10.2
cuda device count:  8
cuda device id: , 0, 1, 2, 3, 4, 5, 6, 7
Let's use 8 GPUs!
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
	In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

...

ทดสอบโปรแกรมบน HPC ด้วย sinteract

จากตัวอย่างข้างต้นที่ได้ทำการรันโมเดลให้ดู เกิดจากการเข้าถึงทรัพยากรแบบ sinteract เพื่อสามารถใช้งาน compute partition ต่างๆ ได้แบบ interaction สมกับชื่อของ sinteract โดยมีวิธีการดังนี้

Code Block
[uaccount@tara-frontend-1]$ module load Python

เนื่องจากเราจะใช้คำสั่ง python จึงเรียกใช้โมดูลไว้ก่อนเลย แล้วค่อยเลือก environment ที่เราได้ติดตั้งไว้สำหรับการรันโปรแกรมของเราด้วย virtualenv ซึ่งจะพาเราเข้าสู่ environment ที่ activate ขึ้นมา

ในที่นี้คือ venv-3DDL

Code Block
[uaccount@tara-frontend-1]$ source venv-3DDL/bin/activate (venv-3DDL) [uaccount@tara-frontend-1]$

เมื่อโปรแกรมและ package environment ที่ต้องการในการรันโปรแกรมครบถ้วนแล้วจึงเรียก sinteract ไปยัง compute partition ที่เราต้องการ เช่น gpu ด้วยคำสั่ง sinteract -p gpu แล้วทำการเลือกแหล่งตัดยอด Service Unit (SU) ซึ่งในตัวอย่างนี้เลือกโครงการที่ 6 จากนั้น slurm ได้ allocate resource ให้ ทำให้เราได้ย้ายจากเครื่อง tara-frontend-1 ไปยัง tara-g-001 แล้วจึงเริ่มทำการทดสอบโปรแกรมของเราบน compute node ที่ต้องการได้ในตัวอย่างข้างต้น

Code Block

(venv-3DDL) [uaccount@tara-frontend-1]$ sinteract -p gpu

 No account specified. Please select account to charge from:    
[1] pre0005   
...   
[6] thaisc    
[q] Quit  

Please type a selection: 6 
Running interactive job using thaisc account 
srun: job 1314560 queued and waiting for resources 
srun: job 1314560 has been allocated resources 
(venv-3DDL) [uaccount@tara-g-001]$ python basic-multigpu-pytorch.py
…

Note
อย่างไรก็ดี ปัจจุบันเราแนะนำให้คุณใช้ Distributed Data Parallel (DDP) ใน PyTorch ซึ่งจะรันได้รวดเร็วกว่า DataParallel มาก อ่านได้ที่นี่

Version	Old Version 4	New Version Current
Changes made by	Apivadee Piyatumrong	Apivadee Piyatumrong
Saved on	Aug 27, 2021	Sept 20, 2022

Versions Compared

Key

ตัวอย่างโปรแกรม basic multi-GPUs PyTorch

import PyTorch modules และกำหนด parameters

สร้าง dummy dataset ขึ้นมาด้วยการ random

สร้าง simple model

สร้าง object device

สร้าง model instance และการรันงานแบบ nn.DataParallel

run the model

ผลลัพธ์เมื่อรันด้วย cpu node

ผลลัพธ์เมื่อรันด้วย gpu node

ติดตั้ง PyTorch package

ติดตั้ง Package module ที่ต้องการใน virtualenv และทดสอบใช้งาน

สร้าง Virtual Environment ใน TARA

ทดสอบใช้งานด้วย sinteract

sinteract -p gpu

output

ติดตั้ง Package module ที่ต้องการใน Singularity container และทดสอบใช้งาน

สร้าง Singularity container

ทดสอบใช้งานด้วย sbatch

เตรียม submission script

output

ทดสอบโปรแกรมบน HPC ด้วย sinteract

Related articles

Page Comparison

Versions Compared

Key

ตัวอย่างโปรแกรม basic multi-GPUs PyTorch

import PyTorch modules และกำหนด parameters

สร้าง dummy dataset ขึ้นมาด้วยการ random

สร้าง simple model

สร้าง object device

สร้าง model instance และการรันงานแบบ nn.DataParallel

run the model

ผลลัพธ์เมื่อรันด้วย cpu node

ผลลัพธ์เมื่อรันด้วย gpu node

ติดตั้ง PyTorch package

ติดตั้ง Package module ที่ต้องการใน virtualenv และทดสอบใช้งาน

สร้าง Virtual Environment ใน TARA

ทดสอบใช้งานด้วย sinteract

sinteract -p gpu

output

ติดตั้ง Package module ที่ต้องการใน Singularity container และทดสอบใช้งาน

สร้าง Singularity container

ทดสอบใช้งานด้วย sbatch

เตรียม submission script

output

ทดสอบโปรแกรมบน HPC ด้วย sinteract

Related articles