{"id":7866,"date":"2021-11-27T04:17:57","date_gmt":"2021-11-26T19:17:57","guid":{"rendered":"https:\/\/wp.study3.biz\/?p=7866"},"modified":"2025-12-13T18:29:20","modified_gmt":"2025-12-13T09:29:20","slug":"amd-epyc-7763-64-core-processor-x2-256gb-almalinux-release-8-4-titan-rtx-x2-cuda-11-3-samples%e3%81%a7simplep2p-p2pbandwidthlatencytest-bandwidthtest-devicequery%e3%82%92%e8%a1%a8%e7%a4%ba%e3%81%95","status":"publish","type":"post","link":"https:\/\/wp.study3.biz\/?p=7866","title":{"rendered":"\u7b2c3\u4e16\u4ee3 AMD EPYC 7763 64-Core Processor x2 256GB AlmaLinux release 8.4  TITAN RTX x2 CUDA 11.3 Samples\u3067simpleP2P p2pBandwidthLatencyTest bandwidthTest deviceQuery\u3092\u8868\u793a\u3055\u305b\u3066\u307f\u305f"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/wp.study3.biz\/wp-content\/uploads\/2021\/11\/81afb846d7f62eddbdfb187ec4ab74a2.png\" alt=\"\" width=\"3840\" height=\"2160\" class=\"alignnone size-full wp-image-9685\" \/><br \/>\n[chibi@alma8 simpleP2P]$ .\/simpleP2P<br \/>\n[.\/simpleP2P] &#8211; Starting&#8230;<br \/>\nChecking for multiple GPUs&#8230;<br \/>\nCUDA-capable device count: 2<\/p>\n<p>Checking GPU(s) for support of peer to peer memory access&#8230;<br \/>\n> Peer access from NVIDIA TITAN RTX (GPU0) -> NVIDIA TITAN RTX (GPU1) : Yes<br \/>\n> Peer access from NVIDIA TITAN RTX (GPU1) -> NVIDIA TITAN RTX (GPU0) : Yes<br \/>\nEnabling peer access between GPU0 and GPU1&#8230;<br \/>\nAllocating buffers (64MB on GPU0, GPU1 and CPU Host)&#8230;<br \/>\nCreating event handles&#8230;<br \/>\ncudaMemcpyPeer \/ cudaMemcpy between GPU0 and GPU1: <strong>43.53GB\/s<\/strong><br \/>\nPreparing host buffer and memcpy to GPU0&#8230;<br \/>\nRun kernel on GPU1, taking source data from GPU0 and writing to GPU1&#8230;<br \/>\nRun kernel on GPU0, taking source data from GPU1 and writing to GPU0&#8230;<br \/>\nCopy data back to host from GPU0 and verify results&#8230;<br \/>\nDisabling peer access&#8230;<br \/>\nShutting down&#8230;<br \/>\nTest passed<br \/>\n[chibi@alma8 simpleP2P]$<br \/>\n[chibi@alma8 p2pBandwidthLatencyTest]$ .\/p2pBandwidthLatencyTest<br \/>\n[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]<br \/>\nDevice: 0, NVIDIA TITAN RTX, pciBusID: 1, pciDeviceID: 0, pciDomainID:0<br \/>\nDevice: 1, NVIDIA TITAN RTX, pciBusID: 41, pciDeviceID: 0, pciDomainID:0<br \/>\nDevice=0 CAN Access Peer Device=1<br \/>\nDevice=1 CAN Access Peer Device=0<\/p>\n<p>***NOTE: In case a device doesn&#8217;t have P2P access to other one, it falls back to normal memcopy procedure.<br \/>\nSo you can see lesser Bandwidth (GB\/s) and unstable Latency (us) in those cases.<\/p>\n<p>P2P Connectivity Matrix<br \/>\n     D\\D     0     1<br \/>\n     0       1     1<br \/>\n     1       1     1<br \/>\nUnidirectional P2P=Disabled Bandwidth Matrix (GB\/s)<br \/>\n   D\\D     0      1<br \/>\n     0 564.63   6.01<br \/>\n     1   6.03 562.86<br \/>\nUnidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB\/s)<br \/>\n   D\\D     0      1<br \/>\n     0 541.78  <strong>47.10<\/strong><br \/>\n     1  <strong>47.10<\/strong><br \/>\n 562.56<br \/>\nBidirectional P2P=Disabled Bandwidth Matrix (GB\/s)<br \/>\n   D\\D     0      1<br \/>\n     0 549.79   8.91<br \/>\n     1   8.87 552.93<br \/>\nBidirectional P2P=Enabled Bandwidth Matrix (GB\/s)<br \/>\n   D\\D     0      1<br \/>\n     0 552.95  <strong>94.13<\/strong><br \/>\n     1  <strong>94.11<\/strong> 551.51<br \/>\nP2P=Disabled Latency Matrix (us)<br \/>\n   GPU     0      1<br \/>\n     0   1.38  24.39<br \/>\n     1  14.22   1.37<\/p>\n<p>   CPU     0      1<br \/>\n     0   3.26  10.13<br \/>\n     1   9.99   3.15<br \/>\nP2P=Enabled Latency (P2P Writes) Matrix (us)<br \/>\n   GPU     0      1<br \/>\n     0   1.30   1.79<br \/>\n     1   1.79   1.36<\/p>\n<p>   CPU     0      1<br \/>\n     0   3.20   2.62<br \/>\n     1   2.67   3.28<\/p>\n<p>NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.<br \/>\n[chibi@alma8 p2pBandwidthLatencyTest]$<br \/>\n[chibi@alma8 bandwidthTest]$ .\/bandwidthTest<br \/>\n[CUDA Bandwidth Test] &#8211; Starting&#8230;<br \/>\nRunning on&#8230;<\/p>\n<p> Device 0: NVIDIA TITAN RTX<br \/>\n Quick Mode<\/p>\n<p> Host to Device Bandwidth, 1 Device(s)<br \/>\n PINNED Memory Transfers<br \/>\n   Transfer Size (Bytes)        Bandwidth(GB\/s)<br \/>\n   32000000                     13.1<\/p>\n<p> Device to Host Bandwidth, 1 Device(s)<br \/>\n PINNED Memory Transfers<br \/>\n   Transfer Size (Bytes)        Bandwidth(GB\/s)<br \/>\n   32000000                     13.2<\/p>\n<p> Device to Device Bandwidth, 1 Device(s)<br \/>\n PINNED Memory Transfers<br \/>\n   Transfer Size (Bytes)        Bandwidth(GB\/s)<br \/>\n   32000000                     <strong>540.2<\/strong><\/p>\n<p>Result = PASS<\/p>\n<p>NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.<br \/>\n[chibi@alma8 bandwidthTest]$<br \/>\n[chibi@alma8 deviceQuery]$ .\/deviceQuery<br \/>\n.\/deviceQuery Starting&#8230;<\/p>\n<p> CUDA Device Query (Runtime API) version (CUDART static linking)<\/p>\n<p>Detected 2 CUDA Capable device(s)<\/p>\n<p>Device 0: &#8220;NVIDIA TITAN RTX&#8221;<br \/>\n  CUDA Driver Version \/ Runtime Version          11.3 \/ 11.3<br \/>\n  CUDA Capability Major\/Minor version number:    7.5<br \/>\n  Total amount of global memory:                 24220 MBytes (25396838400 bytes)<br \/>\n  (072) Multiprocessors, (064) CUDA Cores\/MP:    4608 CUDA Cores<br \/>\n  GPU Max Clock rate:                            1770 MHz (1.77 GHz)<br \/>\n  Memory Clock rate:                             7001 Mhz<br \/>\n  Memory Bus Width:                              384-bit<br \/>\n  L2 Cache Size:                                 6291456 bytes<br \/>\n  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)<br \/>\n  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers<br \/>\n  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers<br \/>\n  Total amount of constant memory:               65536 bytes<br \/>\n  Total amount of shared memory per block:       49152 bytes<br \/>\n  Total shared memory per multiprocessor:        65536 bytes<br \/>\n  Total number of registers available per block: 65536<br \/>\n  Warp size:                                     32<br \/>\n  Maximum number of threads per multiprocessor:  1024<br \/>\n  Maximum number of threads per block:           1024<br \/>\n  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)<br \/>\n  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)<br \/>\n  Maximum memory pitch:                          2147483647 bytes<br \/>\n  Texture alignment:                             512 bytes<br \/>\n  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)<br \/>\n  Run time limit on kernels:                     Yes<br \/>\n  Integrated GPU sharing Host Memory:            No<br \/>\n  Support host page-locked memory mapping:       Yes<br \/>\n  Alignment requirement for Surfaces:            Yes<br \/>\n  Device has ECC support:                        Disabled<br \/>\n  Device supports Unified Addressing (UVA):      Yes<br \/>\n  Device supports Managed Memory:                Yes<br \/>\n  Device supports Compute Preemption:            Yes<br \/>\n  Supports Cooperative Kernel Launch:            Yes<br \/>\n  Supports MultiDevice Co-op Kernel Launch:      Yes<br \/>\n  Device PCI Domain ID \/ Bus ID \/ location ID:   0 \/ 1 \/ 0<br \/>\n  Compute Mode:<br \/>\n     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) ><\/p>\n<p>Device 1: &#8220;NVIDIA TITAN RTX&#8221;<br \/>\n  CUDA Driver Version \/ Runtime Version          11.3 \/ 11.3<br \/>\n  CUDA Capability Major\/Minor version number:    7.5<br \/>\n  Total amount of global memory:                 24218 MBytes (25394348032 bytes)<br \/>\n  (072) Multiprocessors, (064) CUDA Cores\/MP:    4608 CUDA Cores<br \/>\n  GPU Max Clock rate:                            1770 MHz (1.77 GHz)<br \/>\n  Memory Clock rate:                             7001 Mhz<br \/>\n  Memory Bus Width:                              384-bit<br \/>\n  L2 Cache Size:                                 6291456 bytes<br \/>\n  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)<br \/>\n  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers<br \/>\n  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers<br \/>\n  Total amount of constant memory:               65536 bytes<br \/>\n  Total amount of shared memory per block:       49152 bytes<br \/>\n  Total shared memory per multiprocessor:        65536 bytes<br \/>\n  Total number of registers available per block: 65536<br \/>\n  Warp size:                                     32<br \/>\n  Maximum number of threads per multiprocessor:  1024<br \/>\n  Maximum number of threads per block:           1024<br \/>\n  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)<br \/>\n  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)<br \/>\n  Maximum memory pitch:                          2147483647 bytes<br \/>\n  Texture alignment:                             512 bytes<br \/>\n  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)<br \/>\n  Run time limit on kernels:                     No<br \/>\n  Integrated GPU sharing Host Memory:            No<br \/>\n  Support host page-locked memory mapping:       Yes<br \/>\n  Alignment requirement for Surfaces:            Yes<br \/>\n  Device has ECC support:                        Disabled<br \/>\n  Device supports Unified Addressing (UVA):      Yes<br \/>\n  Device supports Managed Memory:                Yes<br \/>\n  Device supports Compute Preemption:            Yes<br \/>\n  Supports Cooperative Kernel Launch:            Yes<br \/>\n  Supports MultiDevice Co-op Kernel Launch:      Yes<br \/>\n  Device PCI Domain ID \/ Bus ID \/ location ID:   0 \/ 65 \/ 0<br \/>\n  Compute Mode:<br \/>\n     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) ><br \/>\n> Peer access from NVIDIA TITAN RTX (GPU0) -> NVIDIA TITAN RTX (GPU1) : Yes<br \/>\n> Peer access from NVIDIA TITAN RTX (GPU1) -> NVIDIA TITAN RTX (GPU0) : Yes<\/p>\n<p>deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.3, NumDevs = 2<br \/>\nResult = PASS<br \/>\n[chibi@alma8 deviceQuery]$<br \/>\n<a href=\"https:\/\/wp.study3.biz\/wp-content\/uploads\/2021\/07\/AMD-EPYC-7763-64-Core-Processor-x2-256GB-AlmaLinux-release-8.4-TITAN-RTX-x2-CUDA-11.3-Samples-simpleP2P-p2pBandwidthLatencyTest-bandwidthTest-deviceQuery-nvidia-smi-nvlink-c.txt\">AMD EPYC 7763 64-Core Processor x2 256GB AlmaLinux release 8.4 TITAN RTX x2 CUDA 11.3 Samples simpleP2P p2pBandwidthLatencyTest bandwidthTest deviceQuery nvidia-smi nvlink -c<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[chibi@alma8 simpleP2P]$ .\/simpleP2P [.\/simpleP2P] &#8211; Starting&#8230; Checking for multiple GPUs&#8230; C &hellip; <a href=\"https:\/\/wp.study3.biz\/?p=7866\">\u7d9a\u304d\u3092\u8aad\u3080 <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[30,5,18],"tags":[],"class_list":["post-7866","post","type-post","status-publish","format-standard","hentry","category-2-sockets","category-centos8","category-nvidia"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.study3.biz\/index.php?rest_route=\/wp\/v2\/posts\/7866","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.study3.biz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.study3.biz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.study3.biz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.study3.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7866"}],"version-history":[{"count":4,"href":"https:\/\/wp.study3.biz\/index.php?rest_route=\/wp\/v2\/posts\/7866\/revisions"}],"predecessor-version":[{"id":29177,"href":"https:\/\/wp.study3.biz\/index.php?rest_route=\/wp\/v2\/posts\/7866\/revisions\/29177"}],"wp:attachment":[{"href":"https:\/\/wp.study3.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7866"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.study3.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7866"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.study3.biz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}