相关阅读
CUDA Chttps://blog.csdn.net/weixin_45791458/category_12530616.html?spm=1001.2014.3001.5482
了解自己设备的性能是很有必要的,为此CUDA 运行时(runtime)API给用户也提供了一些查询设备信息的函数,下面的函数用于查看GPU设备的一切信息。
__host__ cudaError_t cudaGetDeviceProperties (cudaDeviceProp* prop, int device)Parametersprop -设备属性指针device -设备编号
其中参数prop是指向设备属性结构cudaDeviceProp的指针,它的定义如下所示。
struct cudaDeviceProp { char name[256]; cudaUUID_t uuid; size_t totalGlobalMem; size_t sharedMemPerBlock; int regsPerBlock; int warpSize; size_t memPitch; int maxThreadsPerBlock; int maxThreadsDim[3]; int maxGridSize[3]; int clockRate; size_t totalConstMem; int major; int minor; size_t textureAlignment; size_t texturePitchAlignment; int deviceOverlap; int multiProcessorCount; int kernelExecTimeoutEnabled; int integrated; int canMapHostMemory; int computeMode; int maxTexture1D; int maxTexture1DMipmap; int maxTexture1DLinear; int maxTexture2D[2]; int maxTexture2DMipmap[2]; int maxTexture2DLinear[3]; int maxTexture2DGather[2]; int maxTexture3D[3]; int maxTexture3DAlt[3]; int maxTextureCubemap; int maxTexture1DLayered[2]; int maxTexture2DLayered[3]; int maxTextureCubemapLayered[2]; int maxSurface3D[3]; int maxSurface1DLayered[2]; int maxSurface2DLayered[3]; int maxSurfaceCubemap; int maxSurfaceCubemapLayered[2]; size_t surfaceAlignment; int concurrentKernels; int ECCEnabled; int pciBusID; int pciDeviceID; int pciDomainID; int tccDriver; int asyncEngineCount; int unifiedAddressing; int memoryClockRate; int memoryBusWidth; int l2CacheSize; int persistingL2CacheMaxSize; int maxThreadsPerMultiProcessor; int streamPrioritiesSupported; int globalL1CacheSupported; int localL1CacheSupported; size_t sharedMemPerMultiprocessor; int regsPerMultiprocessor; int managedMemory; int isMultiGpuBoard; int multiGpuBoardGroupID; int singleToDoublePrecisionPerfRatio; int pageableMemoryAccess; int concurrentManagedAccess; int computePreemptionSupported; int canUseHostPointerForRegisteredMem; int cooperativeLaunch; int cooperativeMultiDeviceLaunch; int pageableMemoryAccessUsesHostPageTables; int directManagedMemAccessFromHost; int accessPolicyMaxWindowSize;}
这些信息的含义如下所示。
name:ASCII字符串,用于标识设备。uuid:16字节的唯一标识符。totalGlobalMem:设备上可用的全局内存总量(以字节为单位)。sharedMemPerBlock:线程块可用的最大共享内存量(以字节为单位)。regsPerBlock:线程块可用的最大32位寄存器数量。warpSize:线程束(warp)大小(以线程数为单位)。memPitch:由cudaMallocPitch()分配的内存区域允许的最大pitch(以字节为单位)。maxThreadsPerBlock:每个块中的最大线程数。maxThreadsDim:块的每个维度的最大尺寸(数组,包含3个元素)。maxGridSize:网格的每个维度的最大尺寸(数组,包含3个元素)。clockRate:时钟频率(以千赫为单位)。totalConstMem:设备上可用的常量内存总量(以字节为单位)。major、minor:定义设备计算能力的主要和次要修订号。textureAlignment:纹理对齐要求;对齐到textureAlignment字节的纹理基地址无需应用偏移量。texturePitchAlignment:绑定到pitched内存的2D纹理引用的pitch对齐要求。deviceOverlap:如果设备可以在执行内核时并发地在主机和设备之间复制内存,则为1;否则为0。已弃用,请改用asyncEngineCount。multiProcessorCount:设备上的多处理器数量。kernelExecTimeoutEnabled:如果设备上执行的内核有运行时限制,则为1;否则为0。integrated:如果设备是集成(主板)GPU,则为1;如果是独立(卡)组件,则为0。canMapHostMemory:如果设备可以将主机内存映射到CUDA地址空间以供cudaHostAlloc()/cudaHostGetDevicePointer()使用,则为1;否则为0。computeMode:设备当前处于的计算模式。可用模式包括:cudaComputeModeDefault、cudaComputeModeProhibited、cudaComputeModeExclusiveProcess。maxTexture1D:最大1D纹理尺寸。maxTexture1DMipmap:最大1D mipmap纹理尺寸。maxTexture1DLinear:绑定到线性内存的1D纹理的最大尺寸。maxTexture2D:最大2D纹理尺寸(数组,包含2个元素)。maxTexture2DMipmap:最大2D mipmap纹理尺寸(数组,包含2个元素)。maxTexture2DLinear:绑定到pitch linear内存的2D纹理的最大尺寸(数组,包含3个元素)。maxTexture2DGather:如果需要执行纹理聚集操作,则包含最大2D纹理尺寸(数组,包含2个元素)。maxTexture3D:最大3D纹理尺寸(数组,包含3个元素)。maxTexture3DAlt:最大替代3D纹理尺寸(数组,包含3个元素)。maxTextureCubemap:最大立方体纹理宽度或高度。maxTexture1DLayered:最大1D分层纹理尺寸(数组,包含2个元素)。maxTexture2DLayered:最大2D分层纹理尺寸(数组,包含3个元素)。maxTextureCubemapLayered:最大立方体分层纹理尺寸(数组,包含2个元素)。maxSurface1D:最大1D表面尺寸。maxSurface2D:最大2D表面尺寸(数组,包含2个元素)。maxSurface3D:最大3D表面尺寸(数组,包含3个元素)。maxSurface1DLayered:最大1D分层表面尺寸(数组,包含2个元素)。maxSurface2DLayered:最大2D分层表面尺寸(数组,包含3个元素)。maxSurfaceCubemap:最大立方体表面宽度或高度。maxSurfaceCubemapLayered:最大立方体分层表面尺寸(数组,包含2个元素)。surfaceAlignment:表面的对齐要求。concurrentKernels:如果设备支持在同一上下文中同时执行多个内核,则为1;否则为0。不能保证设备上同时存在多个内核,因此不应依赖此功能以确保正确性。ECCEnabled:如果设备已启用ECC支持,则为1;否则为0。pciBusID:设备的PCI总线标识符。pciDeviceID:设备的PCI设备(有时称为插槽)标识符。pciDomainID:设备的PCI域标识符。tccDriver:如果设备正在使用TCC驱动程序,则为1;否则为0。asyncEngineCount:当设备可以在执行内核时并发地在主机和设备之间复制内存时为1。当设备可以在主机和设备之间双向并发地复制内存并同时执行内核时为2。如果两者都不支持,则为0。unifiedAddressing:如果设备与主机共享统一的地址空间,则为1;否则为0。memoryClockRate:内存时钟峰值频率(以千赫为单位)。memoryBusWidth:内存总线宽度(以位为单位)。l2CacheSize:L2缓存大小(以字节为单位)。persistingL2CacheMaxSize:L2缓存的最大持久化行大小(以字节为单位)。maxThreadsPerMultiProcessor:每个多处理器的最大常驻线程数。streamPrioritiesSupported:如果设备支持流优先级,则为1;否则为0。globalL1CacheSupported:如果设备支持在L1缓存中缓存全局内存,则为1;否则为0。localL1CacheSupported:如果设备支持在L1缓存中缓存局部内存,则为1;否则为0。sharedMemPerMultiprocessor:多处理器可用的最大共享内存量(以字节为单位);此量由所有同时驻留在多处理器上的线程块共享。regsPerMultiprocessor:多处理器可用的最大32位寄存器数量;此数量由所有同时驻留在多处理器上的线程块共享。managedMemory:如果设备支持在此系统上分配托管内存,则为1;否则为0。isMultiGpuBoard:如果设备位于多GPU板上(例如Gemini卡),则为1;否则为0。multiGpuBoardGroupID:与同一板上关联的设备组的唯一标识符。位于同一多GPU板上的设备将共享相同的标识符。hostNativeAtomicSupported:如果设备与主机之间的链接支持本机原子操作,则为1;否则为0。singleToDoublePrecisionPerfRatio:单精度性能(以浮点运算每秒为单位)与双精度性能之比。pageableMemoryAccess:如果设备支持在没有调用cudaHostRegister的情况下一致地访问分页内存,则为1;否则为0。concurrentManagedAccess:如果设备可以与CPU并发地一致访问托管内存,则为1;否则为0。computePreemptionSupported:如果设备支持计算抢占,则为1;否则为0。canUseHostPointerForRegisteredMem:如果设备可以在CPU处使用主机注册的内存的相同虚拟地址,则为1;否则为0。cooperativeLaunch:如果设备支持通过cudaLaunchCooperativeKernel启动协作内核,则为1;否则为0。cooperativeMultiDeviceLaunch:如果设备支持通过cudaLaunchCooperativeKernelMultiDevice启动协作内核,则为1;否则为0。sharedMemPerBlockOptin:可由特殊选择使用的每个设备的最大块共享内存(字节)。pageableMemoryAccessUsesHostPageTables:如果设备通过主机的页表访问分页内存,则为1;否则为0。directManagedMemAccessFromHost:如果主机可以直接访问设备上的托管内存而无需迁移,则为1;否则为0。maxBlocksPerMultiProcessor:每个多处理器上可驻留的最大线程块数。accessPolicyMaxWindowSize:cudaAccessPolicyWindow::num_bytes的最大值。reservedSharedMemPerBlock:CUDA驱动程序保留的每个块的共享内存(以字节为单位)。hostRegisterSupported:如果设备支持通过cudaHostRegister注册主机内存,则为1;否则为0。sparseCudaArraySupported:如果设备支持稀疏CUDA数组和稀疏CUDA mipmap数组,则为1;否则为0。hostRegisterReadOnlySupported:如果设备支持使用cudaHostRegister标志cudaHostRegisterReadOnly注册必须映射为只读的内存,则为1;否则为0。timelineSemaphoreInteropSupported:如果设备支持外部时间线信号量交互,则为1;否则为0。memoryPoolsSupported:如果设备支持使用cudaMallocAsync和cudaMemPool系列API,则为1;否则为0。gpuDirectRDMASupported:如果设备支持GPUDirect RDMA API,则为1;否则为0。gpuDirectRDMAFlushWritesOptions:根据cudaFlushGPUDirectRDMAWritesOptions枚举解释的位掩码。gpuDirectRDMAWritesOrdering:请参阅cudaGPUDirectRDMAWritesOrdering枚举以获取数值。memoryPoolSupportedHandleTypes:与内存池基于IPC支持的句柄类型的位掩码。deferredMappingCudaArraySupported:如果设备支持延迟映射CUDA数组和CUDA mipmap数组,则为1;否则为0。ipcEventSupported:如果设备支持IPC事件,则为1;否则为0。unifiedFunctionPointers:如果设备支持统一指针,则为1;否则为0。
可以注意到,cudaGetDeviceProperties函数需要设备编号作为参数,如何知道自己有多少设备呢?可以使用cudaGetDeviceCount函数,如下所示。
__host__ __device__ cudaError_t cudaGetDeviceCount (int* count)Parameterscount -计算能力大于2.0的设备数指针
有时候我们需要知道设备的CUDA驱动API版本和CUDA运行时API版本,可以分别使用下面的两个函数。
__host__ cudaError_t cudaDriverGetVersion (int* driverVersion)__host__ __device__ cudaError_t cudaRuntimeGetVersion (int* runtimeVersion)ParametersdriverVersion -指向驱动版本号的指针runtimeVersion -指向运行时版本号的指针
这两个版本号的组成方式是:1000*主版本号+10*小版本号。例如对于9.1版本,版本号是9010;对于10.3版本,版本号是10030。
下面给出了一段程序,查询了大家一般感兴趣的设备属性。
#include <cuda_runtime.h>#include <stdio.h>#define CHECK(call) \{ \ const cudaError_t error = call; \ if (error != cudaSuccess) \ { \ fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__); \ fprintf(stderr, "code: %d, reason: %s\n", error, \ cudaGetErrorString(error)); \ exit(1); \ } \}int main(int argc, char **argv){ printf("%s Starting...\n", argv[0]); int deviceCount = 0; cudaGetDeviceCount(&deviceCount); if (deviceCount == 0) { printf("There are no available device(s) that support CUDA\n"); } else { printf("Detected %d CUDA Capable device(s)\n", deviceCount); printf("\n"); } int dev = 0, driverVersion = 0, runtimeVersion = 0; for(dev = 0; dev < deviceCount; dev++) { CHECK(cudaSetDevice(dev)); cudaDeviceProp deviceProp; CHECK(cudaGetDeviceProperties(&deviceProp, dev)); printf("Device %d: \"%s\"\n", dev, deviceProp.name); cudaDriverGetVersion(&driverVersion); cudaRuntimeGetVersion(&runtimeVersion); printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10, runtimeVersion / 1000, (runtimeVersion % 100) / 10); printf(" CUDA Capability Major/Minor version number: %d.%d\n", deviceProp.major, deviceProp.minor); printf(" Total amount of global memory: %.2f GBytes (%llu " "bytes)\n", (float)deviceProp.totalGlobalMem / pow(1024.0, 3), (unsigned long long)deviceProp.totalGlobalMem); printf(" GPU Clock rate: %.0f MHz (%0.2f " "GHz)\n", deviceProp.clockRate * 1e-3f, deviceProp.clockRate * 1e-6f); printf(" Memory Clock rate: %.0f Mhz\n", deviceProp.memoryClockRate * 1e-3f); printf(" Memory Bus Width: %d-bit\n", deviceProp.memoryBusWidth); if (deviceProp.l2CacheSize) { printf(" L2 Cache Size: %d bytes\n", deviceProp.l2CacheSize); } printf(" Max Texture Dimension Size (x,y,z) 1D=(%d), " "2D=(%d,%d), 3D=(%d,%d,%d)\n", deviceProp.maxTexture1D, deviceProp.maxTexture2D[0], deviceProp.maxTexture2D[1], deviceProp.maxTexture3D[0], deviceProp.maxTexture3D[1], deviceProp.maxTexture3D[2]); printf(" Max Layered Texture Size (dim) x layers 1D=(%d) x %d, " "2D=(%d,%d) x %d\n", deviceProp.maxTexture1DLayered[0], deviceProp.maxTexture1DLayered[1], deviceProp.maxTexture2DLayered[0], deviceProp.maxTexture2DLayered[1], deviceProp.maxTexture2DLayered[2]); printf(" Total amount of constant memory: %lu bytes\n", deviceProp.totalConstMem); printf(" Total amount of shared memory per block: %lu bytes\n", deviceProp.sharedMemPerBlock); printf(" Total number of registers available per block: %d\n", deviceProp.regsPerBlock); printf(" Warp size: %d\n", deviceProp.warpSize); printf(" Maximum number of threads per multiprocessor: %d\n", deviceProp.maxThreadsPerMultiProcessor); printf(" Number of multiprocessor: %d\n", deviceProp.multiProcessorCount); printf(" Maximum number of threads per block: %d\n", deviceProp.maxThreadsPerBlock); printf(" Maximum sizes of each dimension of a block: %d x %d x %d\n", deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]); printf(" Maximum sizes of each dimension of a grid: %d x %d x %d\n", deviceProp.maxGridSize[0], deviceProp.maxGridSize[1], deviceProp.maxGridSize[2]); printf(" Maximum memory pitch: %lu bytes\n", deviceProp.memPitch); printf("\n"); } exit(EXIT_SUCCESS);}
下面是一个服务器上测试的输出结果。
./checkDeviceInfor Starting...Detected 3 CUDA Capable device(s)Device 0: "Tesla P100-PCIE-16GB" CUDA Driver Version / Runtime Version 11.4 / 10.1 CUDA Capability Major/Minor version number: 6.0 Total amount of global memory: 15.90 GBytes (17071734784 bytes) GPU Clock rate: 1329 MHz (1.33 GHz) Memory Clock rate: 715 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 4194304 bytes Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384) Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Number of multiprocessor: 56 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535 Maximum memory pitch: 2147483647 bytesDevice 1: "Tesla P100-PCIE-16GB" CUDA Driver Version / Runtime Version 11.4 / 10.1 CUDA Capability Major/Minor version number: 6.0 Total amount of global memory: 15.90 GBytes (17071734784 bytes) GPU Clock rate: 1329 MHz (1.33 GHz) Memory Clock rate: 715 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 4194304 bytes Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384) Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Number of multiprocessor: 56 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535 Maximum memory pitch: 2147483647 bytesDevice 2: "Tesla P100-PCIE-16GB" CUDA Driver Version / Runtime Version 11.4 / 10.1 CUDA Capability Major/Minor version number: 6.0 Total amount of global memory: 15.90 GBytes (17071734784 bytes) GPU Clock rate: 1329 MHz (1.33 GHz) Memory Clock rate: 715 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 4194304 bytes Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384) Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Number of multiprocessor: 56 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535 Maximum memory pitch: 2147483647 bytes 2147483647 bytes