...
Code Block | ||||
---|---|---|---|---|
| ||||
WORLD_SIZE=8 MASTER_PORT=15347 MASTER_ADDR=tara-dgx1-002 Hello from rank 1 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 3 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 0 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 2 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 5 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 4 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 6 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Hello from rank 7 of 8 on tara-dgx1-002.tara.nstda.or.th where there are 8 allocated GPUs per node. Group initialized? True host: tara-dgx1-002.tara.nstda.or.th, rank: 1, local_rank: 1 host: tara-dgx1-002.tara.nstda.or.th, rank: 3, local_rank: 3 host: tara-dgx1-002.tara.nstda.or.th, rank: 4, local_rank: 4 host: tara-dgx1-002.tara.nstda.or.th, rank: 6, local_rank: 6 host: tara-dgx1-002.tara.nstda.or.th, rank: 0, local_rank: 0 host: tara-dgx1-002.tara.nstda.or.th, rank: 2, local_rank: 2 host: tara-dgx1-002.tara.nstda.or.th, rank: 5, local_rank: 5 host: tara-dgx1-002.tara.nstda.or.th, rank: 7, local_rank: 7 Train Epoch: 1 [0/60000 (0%)] Loss: 2.297117 Train Epoch: 1 [640/60000 (8%)] Loss: 1.329343 Train Epoch: 1 [1280/60000 (17%)] Loss: 0.518520 Train Epoch: 1 [1920/60000 (25%)] Loss: 0.331641 Train Epoch: 1 [2560/60000 (34%)] Loss: 0.256029 Train Epoch: 1 [3200/60000 (42%)] Loss: 0.126544 Train Epoch: 1 [3840/60000 (51%)] Loss: 0.129393 Train Epoch: 1 [4480/60000 (59%)] Loss: 0.135831 Train Epoch: 1 [5120/60000 (68%)] Loss: 0.094554 Train Epoch: 1 [5760/60000 (76%)] Loss: 0.131771 Train Epoch: 1 [6400/60000 (85%)] Loss: 0.078105 Train Epoch: 1 [7040/60000 (93%)] Loss: 0.078772 Train Epoch: 1 [0/60000 (0%)] Loss: 2.315368 Train Epoch: 1 [640/60000 (8%)] Loss: 1.471632 Train Epoch: 1 [1280/60000 (17%)] Loss: 0.394169 Train Epoch: 1 [1920/60000 (25%)] Loss: 0.376319 ... Train Epoch: 14 [5120/60000 (68%)] Loss: 0.003920 Train Epoch: 14 [5760/60000 (76%)] Loss: 0.105166 Train Epoch: 14 [6400/60000 (85%)] Loss: 0.020963 Train Epoch: 14 [7040/60000 (93%)] Loss: 0.071237 Test set: Average loss: 0.0298, Accuracy: 9897/10000 (99%) |
Output DDP กับ 2
...
Nodes DGX-1 (16 GPUs)
ผลลัพธ์ของ script-N-2-worldsize-16.sh และ DDP.py แสดงได้ดังต่อไปนี้
...