Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
breakoutModewide
languagebash
WORLD_SIZE=8
MASTER_PORT=15347
MASTER_ADDR=tara-dgx1-002
Hello from rank 1 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 3 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 0 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 2 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 5 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 4 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 6 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Hello from rank 7 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node.
Group initialized? True
host: tara-dgx1-002.tara.nstda.or.th, rank: 1, local_rank: 1
host: tara-dgx1-002.tara.nstda.or.th, rank: 3, local_rank: 3
host: tara-dgx1-002.tara.nstda.or.th, rank: 4, local_rank: 4
host: tara-dgx1-002.tara.nstda.or.th, rank: 6, local_rank: 6
host: tara-dgx1-002.tara.nstda.or.th, rank: 0, local_rank: 0
host: tara-dgx1-002.tara.nstda.or.th, rank: 2, local_rank: 2
host: tara-dgx1-002.tara.nstda.or.th, rank: 5, local_rank: 5
host: tara-dgx1-002.tara.nstda.or.th, rank: 7, local_rank: 7
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.297117
Train Epoch: 1 [640/60000 (8%)] Loss: 1.329343
Train Epoch: 1 [1280/60000 (17%)]       Loss: 0.518520
Train Epoch: 1 [1920/60000 (25%)]       Loss: 0.331641
Train Epoch: 1 [2560/60000 (34%)]       Loss: 0.256029
Train Epoch: 1 [3200/60000 (42%)]       Loss: 0.126544
Train Epoch: 1 [3840/60000 (51%)]       Loss: 0.129393
Train Epoch: 1 [4480/60000 (59%)]       Loss: 0.135831
Train Epoch: 1 [5120/60000 (68%)]       Loss: 0.094554
Train Epoch: 1 [5760/60000 (76%)]       Loss: 0.131771
Train Epoch: 1 [6400/60000 (85%)]       Loss: 0.078105
Train Epoch: 1 [7040/60000 (93%)]       Loss: 0.078772
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.315368
Train Epoch: 1 [640/60000 (8%)] Loss: 1.471632
Train Epoch: 1 [1280/60000 (17%)]       Loss: 0.394169
Train Epoch: 1 [1920/60000 (25%)]       Loss: 0.376319
...
Train Epoch: 14 [5120/60000 (68%)]      Loss: 0.003920
Train Epoch: 14 [5760/60000 (76%)]      Loss: 0.105166
Train Epoch: 14 [6400/60000 (85%)]      Loss: 0.020963
Train Epoch: 14 [7040/60000 (93%)]      Loss: 0.071237

Test set: Average loss: 0.0298, Accuracy: 9897/10000 (99%)

Output DDP กับ 2

...

Nodes DGX-1 (16 GPUs)

ผลลัพธ์ของ script-N-2-worldsize-16.sh และ DDP.py แสดงได้ดังต่อไปนี้

...