CUDA学习日记2

xiaoxiao2026-05-06 9

1. 二维数组使用

#include <iostream> #include<cstdlib> using namespace std; static const int ROW = 10; static const int COL = 5; int main() { int** array = (int**)malloc(ROW*sizeof(int*)); int* data = (int*)malloc(ROW*COL*sizeof(int)); // initialize the data for (int i=0; i<ROW*COL; i++) { data[i] = i; } // initialize the array for (int i=0; i<ROW; i++) { array[i] = data + i*COL; } // output the array for (int i=0; i<ROW; i++) for (int j=0; j<COL; j++) { cout << array[i][j] << endl; } free(array); free(data); return 0; }

2. 查看GPU信息的代码

说明：如果找不到#include <helper_cuda.h>，那么将$cuda-samples/NVIDIA_CUDA-7.5_Samples/common/inc添加到NVCC Compiler中的Includes中即可。

解析：

（1）#include <cstdlib>

#include<cstdlib>提供数据类型：size_t，wchar_t，div_t，ldiv_t，lldiv_t；提供常量：NULL，EXIT_FAILURE，EXIT_SUCCESS，RAND_MAX，MB_CUR_MAX；提供函数：atof，atoi，atol， strtod， strtof， strtols， strtol， strtoll， strtoul， strtoull， rand， srand， calloc， free， malloc， realloc， abort， atexit， exit， getenv， system， bsearch， qsort， abs， div， labs， ldiv， llabs， tlldiv， mblen， mbtowc， wctomb， mbstowcs， wcstombs。

（2）CUDART_VERSION

CUDA Runtime API Version，在#include <cuda_runtime.h>中#define CUDART_VERSION 7050。

（3）stdin，stdout，stderr

进程将从标准输入文件中得到输入数据，将正常输出数据输出到标准输出文件，而将错误信息送到标准错误文件中。stdin的文件描述符为0，stdout的文件描述符为1，stderr的文件描述符为2。

（4）__host__ cudaError_t cudaSetDevice ( int device )

Set device to be used for GPU executions.

（5）cudaDeviceProp数据结构

cudaDeviceProp数据类型针对函式cudaGetDeviceProperties定义的，cudaGetDeviceProperties函数的功能是取得支持GPU计算装置的相关属性，比如支持CUDA版本号装置的名称、内存的大小、最大的thread数目、执行单元的频率等。如下所示：[6]

struct cudaDeviceProp { char name[256]; // 识别设备的ASCII字符串（比如，"GeForce GTX 940M"） size_t totalGlobalMem; // 全局内存大小 size_t sharedMemPerBlock; // 每个block内共享内存的大小 int regsPerBlock; // 每个block 32位寄存器的个数 int warpSize; // warp大小 size_t memPitch; // 内存中允许的最大间距字节数 int maxThreadsPerBlock; // 每个Block中最大的线程数是多少 int maxThreadsDim[3]; // 一个块中每个维度的最大线程数 int maxGridSize[3]; // 一个网格的每个维度的块数量 size_t totalConstMem; // 可用恒定内存量 int major; // 该设备计算能力的主要修订版号 int minor; // 设备计算能力的小修订版本号 int clockRate; // 时钟速率 size_t textureAlignment; // 该设备对纹理对齐的要求 int deviceOverlap; // 一个布尔值，表示该装置是否能够同时进行cudamemcpy()和内核执行 int multiProcessorCount; // 设备上的处理器的数量 int kernelExecTimeoutEnabled; // 一个布尔值，该值表示在该设备上执行的内核是否有运行时的限制 int integrated; // 返回一个布尔值，表示设备是否是一个集成的GPU（即部分的芯片组、没有独立显卡等） int canMapHostMemory; // 表示设备是否可以映射到CUDA设备主机内存地址空间的布尔值 int computeMode; // 一个值，该值表示该设备的计算模式：默认值，专有的，或禁止的 int maxTexture1D; // 一维纹理内存最大值 int maxTexture2D[2]; // 二维纹理内存最大值 int maxTexture3D[3]; // 三维纹理内存最大值 int maxTexture2DArray[3]; // 二维纹理阵列支持的最大尺寸 int concurrentKernels; // 一个布尔值，该值表示该设备是否支持在同一上下文中同时执行多个内核｝

（6）__host__ cudaError_t cudaDriverGetVersion ( int* driverVersion )

Returns the CUDA driver version.

（7）__host__ __device__ cudaError_t cudaRuntimeGetVersion ( int* runtimeVersion )

Returns the CUDA Runtime version.

说明：__host__和__device__同时使用时触发编译系统生成同一函数的两个不同的版本。它支持一种常见的应用，即只需要重编译同一函数的源代码就可以生成一个在设备上运行的版本。

（8）#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)

WIN32，_WIN32，WIN64，_WIN64是Windows操作系统预定义的宏。这句话的目的是C/C++编程通过宏定义来判断操作系统的类型。[5]

3. #include<device_launch_parameters.h>

解析：#include<device_launch_parameters.h>头文件包含了内核函数的5个变量threadIdx、blockDim、blockIdx、gridDim和wrapSize。

4. 事件管理

解析：常用函数，如下所示：

（1）cudaEventCreate()：事件的创建。

（2）cudaEventDestroy()：事件的销毁。

（3）cudaEventRecord()：记录事件。

（4）cudaEventSynchronize()：事件同步。

（5）cudaEventElapsedTime()：计算两事件的时间差。

利用CUDA提供的事件管理API实现计时功能，如下所示：

float time; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); { // 需要计时的代码 } cudaEventRecord(stop,0); cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);

5. deviceQuery

解析：输出结果，如下所示：

root@ubuntu:~/cuda-samples/NVIDIA_CUDA-7.5_Samples/1_Utilities/deviceQuery# ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce 940M" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 5.0 Total amount of global memory: 1024 MBytes (1073610752 bytes) ( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 980 MHz (0.98 GHz) Memory Clock rate: 1001 Mhz Memory Bus Width: 64-bit L2 Cache Size: 1048576 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce 940M Result = PASS 说明： 3个SM，每个SM包含128个SP，总共384个SP。

6. 矩阵分块乘法 [11]

解析：

参考文献：

[1] 有哪些优秀的CUDA开源代码？：https://www.zhihu.com/question/29036289/answer/42971562

[2] CUDA一维矩阵的加法：http://tech.it168.com/a2009/1112/807/000000807771.shtml

[3] CUDA二维矩阵加法：http://www.cnblogs.com/jugg1024/p/4349243.html

[4] NVIDIA CUDA Runtime API：http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#axzz4G8M3LWlq

[5] C/C++是如何通过宏定义来判断操作系统的：http://www.myexception.cn/operating-system/1981774.html

[6] CUDA编程其实写个矩阵相乘并不是那么难：http://www.cnblogs.com/yusenwu/p/5300956.html

[7] CUDA实例矩阵乘法：http://wenku.baidu.com/link?url=XCOgGQqpPUns-cifgGm1tbfqmY-5wWTwkXHh1_i_5ZZX6vFmbFu22r67fWMpcs-GxsH9thzOjVeNCpKIjGjdx2SYhq7bW4qfIquRTM0AAW_

[8] 华科并行计算上机作业：http://wenku.baidu.com/link?url=1tWvUvW0t7BnFChxetS_Mr5_pCF_LZHQGLWxN-ArVVPccOM_VmoTx9IUD76l_rVMP-iPKWI97vn7wa5ZChz59rr4wlur3rL6k3MGB15qF4W

[9] CUDA编程：http://www.cnblogs.com/stewart/archive/2013/01/05/2846860.html

[10] NVIDIA Docker：GPU Server Application Deployment Made Easy：https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/

[11] CUDA矩阵乘法——利用共享存储器：http://blog.csdn.net/augusdi/article/details/12614247

转载请注明原文地址: https://ju.6miu.com/read-1309408.html

最新回复(0)