user0018@10-111-146-12:~/cuda-samples-9.0/samples/0_Simple/simpleP2P$ ./simpleP2P
[./simpleP2P] – Starting…
Checking for multiple GPUs…
CUDA-capable device count: 4
> GPU0 = “Tesla V100-SXM2-16GB” IS capable of Peer-to-Peer (P2P)
> GPU1 = “Tesla V100-SXM2-16GB” IS capable of Peer-to-Peer (P2P)
> GPU2 = “Tesla V100-SXM2-16GB” IS capable of Peer-to-Peer (P2P)
> GPU3 = “Tesla V100-SXM2-16GB” IS capable of Peer-to-Peer (P2P)
Checking GPU(s) for support of peer to peer memory access…
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU2) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
> Tesla V100-SXM2-16GB (GPU0) supports UVA: Yes
> Tesla V100-SXM2-16GB (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 44.84GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Disabling peer access…
Shutting down…
Test passed
user0018@10-111-146-12:~/cuda-samples-9.0/samples/0_Simple/simpleP2P$ cd ~/cuda-samples-9.0/samples/1_Utilities
user0018@10-111-146-12:~/cuda-samples-9.0/samples/1_Utilities$ ls
bandwidthTest deviceQueryDrv topologyQuery
deviceQuery p2pBandwidthLatencyTest
user0018@10-111-146-12:~/cuda-samples-9.0/samples/1_Utilities$ cd p2pBandwidthLatencyTest
user0018@10-111-146-12:~/cuda-samples-9.0/samples/1_Utilities/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-SXM2-16GB, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-SXM2-16GB, pciBusID: 62, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-SXM2-16GB, pciBusID: 89, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-SXM2-16GB, pciBusID: 8a, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 748.32 9.92 11.02 11.03
1 9.96 742.63 11.02 11.07
2 11.05 11.03 744.05 9.94
3 11.04 11.01 9.92 745.47
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 744.05 47.91 47.95 47.89
1 47.90 745.47 47.96 47.96
2 47.91 47.97 746.89 48.00
3 47.97 47.90 47.97 745.47
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 764.43 10.38 18.92 17.99
1 10.42 765.93 18.08 17.56
2 18.79 18.23 763.69 10.41
3 17.86 17.49 10.43 762.94
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 762.20 95.58 95.67 95.58
1 95.37 765.93 95.67 95.65
2 95.37 95.39 765.93 95.44
3 95.48 95.46 95.51 769.70
P2P=Disabled Latency Matrix (us)
D\D 0 1 2 3
0 4.21 20.29 19.84 19.75
1 20.36 4.27 20.36 20.39
2 20.30 20.31 3.92 18.73
3 19.48 19.49 18.69 3.26
P2P=Enabled Latency Matrix (us)
D\D 0 1 2 3
0 4.14 7.38 7.32 7.02
1 8.32 4.37 7.45 7.52
2 6.80 6.84 4.04 6.79
3 6.81 6.75 6.88 3.39
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
user0018@10-111-146-12:~/cuda-samples-9.0/samples/1_Utilities/p2pBandwidthLatencyTest$
Ubuntu16.04.5 TESLA V100-SXM2 16GB x4 CUDA 9.0 Samples simpleP2P p2pBandwidthLatencyTest 他