ในบทความนี้จะกล่าวถึงตัวอย่างการติดตั้ง Python package ที่เกี่ยวข้องกับ PyTorch และการทดสอบรันงานบน gpu node ของ TARA HPC โดยใช้ตัวอย่าง Package modules อย่างง่ายจากการทำ 3D-Deep Learning ด้วย PyTorch

Table of Contents

ตัวอย่างโปรแกรม basic multi-GPUs PyTorch

ผู้ใช้งานสามารถ copy ไฟล์โปรแกรม ตัวอย่างโปรแกรม basic multi-GPUs PyTorch ที่ใช้ในบทความนี้ได้ในระบบ TARA ที่

/tarafs/data/project/common/AI/examples/basic-multigpu-pytorch.py

import PyTorch modules และกำหนด parameters

Code Block

language	py

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

สร้าง dummy dataset ขึ้นมาด้วยการ random

Code Block

language	py

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

สร้าง simple model

ที่แค่รับ input แล้วทำ linear operation แล้วส่งออก output เพื่อแสดงการทำงานของ DataParallel ซึ่งก็คือส่วนที่ทำให้เกิดการใช้งาน multi-GPUs ได้ด้วยการแบ่งข้อมูลออกไปที่แต่ละ GPUs

...

Code Block

language	py

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

สร้าง object device

ซึ่งจะเป็นจุดที่เราส่ง tensor/model ให้กับ device ที่เรามีเพื่อการทำงาน

...

Code Block

language	py

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

สร้าง model instance และการรันงานแบบ nn.DataParallel

โดยจุดที่เราสนใจเป็นพิเศษในบทความนี้คือ หากในเครื่องที่เราได้รันงานมี GPUs หลายตัว เราจะสามารถ wrap model ของเราด้วย nn.DataParallel ได้ จากนั้นจึงนำ model ที่ได้ส่งให้กับ device ที่เรากำหนดไว้ก่อนหน้านี้

...

ในขั้นตอนนี้ถ้าหากเรามี GPU มากกว่าหนึ่งตัว เช่น gpu node ในระบบ TARA ก็จะได้ผลลัพธ์ออกมาเป็นLet’s use 2

Code Block
Let's use 2 GPUs!

run the model

สั่งให้พิมพ์ขนาดของ input/output tensors ออกมาให้ดูด้วย

Code Block

language	py

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())

ผลลัพธ์เมื่อรันด้วย cpu node

ดังนั้นหาก machine ที่เราใช้ไม่มี GPU เช่น
รันใน compute node (tara-c-001) บนระบบ TARA แบบ sinteract จะได้ output ดังแสดงด้านล่าง ซึ่งจะพบว่าไม่มีการแบ่งข้อมูลออกไปเนื่องจากไม่มี GPU และเป็นการรันงานบน cpu

Code Block

(venv-3DDL)[uaccount@tara-c-001]$ python basic-multigpu-pytorch.py 
torch version :  1.9.0+cu102
cuda available? :  False
cuda version:  10.2
cuda device count:  0
cuda device id: 
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
(venv-3DDL)[uaccount@tara-c-001]$

ผลลัพธ์เมื่อรันด้วย gpu node

และหากเครื่องที่ได้ใช้มี GPUs เช่น รันใน gpu node บนระบบ TARA (tara-g-001) แบบ sinteract จะได้ output แบบนี้

...

ซึ่งจะเห็นได้ว่ามีการแบ่งข้อมูลไปรันโมเดลที่ GPUs ทั้งสองตัว

ทดสอบโปรแกรมบน HPC ด้วย sinteract

จากตัวอย่างข้างต้นที่ได้ทำการรันโมเดลให้ดู เกิดจากการเข้าถึงทรัพยากรแบบ sinteract เพื่อสามารถใช้งาน compute partition ต่างๆ ได้แบบ interaction สมกับชื่อของ sinteract โดยมีวิธีการดังนี้

...

เมื่อโปรแกรมและ package environment ที่ต้องการในการรันโปรแกรมครบถ้วนแล้วจึงเรียก sinteract ไปยัง compute partition ที่เราต้องการ เช่น gpu ด้วยคำสั่ง sinteract -p gpu แล้วทำการเลือกแหล่งตัดยอด Service Unit (SU) ซึ่งในตัวอย่างนี้เลือกโครงการที่ 6 จากนั้น slurm ได้ allocate resource ให้ ทำให้เราได้ย้ายจากเครื่อง tara-frontend-1 ไปยัง tara-g-001 แล้วจึงเริ่มทำการทดสอบโปรแกรมของเราบน compute node ที่ต้องการได้ในตัวอย่างข้างต้น

Code Block

(venv-3DDL) [uaccount@tara-frontend-1]$ sinteract -p gpu

 No account specified. Please select account to charge from:    
[1] pre0005   
...   
[6] thaisc    
[q] Quit  

Please type a selection: 6 
Running interactive job using thaisc account 
srun: job 1314560 queued and waiting for resources 
srun: job 1314560 has been allocated resources 
(venv-3DDL) [uaccount@tara-g-001]$ python basic-multigpu-pytorch.py
…

Note
อย่างไรก็ดี ปัจจุบันเราแนะนำให้คุณใช้ Distributed Data Parallel (DDP) ใน PyTorch ซึ่งจะรันได้รวดเร็วกว่า DataParallel มาก อ่านได้ที่นี่

Version	Old Version 9	New Version Current
Changes made by	Apivadee Piyatumrong	Apivadee Piyatumrong
Saved on	Sept 20, 2021	Sept 20, 2022

Versions Compared

Key

ตัวอย่างโปรแกรม basic multi-GPUs PyTorch

import PyTorch modules และกำหนด parameters

สร้าง dummy dataset ขึ้นมาด้วยการ random

สร้าง simple model

สร้าง object device

สร้าง model instance และการรันงานแบบ nn.DataParallel

run the model

ผลลัพธ์เมื่อรันด้วย cpu node

ผลลัพธ์เมื่อรันด้วย gpu node

ทดสอบโปรแกรมบน HPC ด้วย sinteract

Related articles

Page Comparison

Versions Compared

Key

ตัวอย่างโปรแกรม basic multi-GPUs PyTorch

import PyTorch modules และกำหนด parameters

สร้าง dummy dataset ขึ้นมาด้วยการ random

สร้าง simple model

สร้าง object device

สร้าง model instance และการรันงานแบบ nn.DataParallel

run the model

ผลลัพธ์เมื่อรันด้วย cpu node

ผลลัพธ์เมื่อรันด้วย gpu node

ทดสอบโปรแกรมบน HPC ด้วย sinteract

Related articles