Professional CUDA C Programing
代码下载:http://www.wrox.com/WileyCDA/#并行规约#并行规约
串行的规约:
int sum = 0;
for (int i = 0; i < N; i++)
sum += array[i];
规约的特点:可交换性,计算结果和计算顺序无关。所以max,min等问题也可以用相同的方法解决。
并行的基本思路:
➤ 1将输入的vector分成更小的数据块
➤ 2一个thread负责计算一个小数据块的总和
➤ 3累加个分块的总和
最普遍的思想:一个数据块只包含两个数据,一个thread完成一对数据对的求和,每次迭代的过程中数据量减半,直到只剩一个元素。(每次迭代的计算的过程都在本地进行运算)
➤ 相邻的数据对
➤ 间隔的数据对(等长度的间隔)
C语言串行计算:
// Recursive Implementation of Interleaved Pair Approach
int cpuRecursiveReduce(int *data, int const size)
{
// stop condition
if (size == 1) return data[0];
// renew the stride
int const stride = size / 2;
// in-place reduction
for (int i = 0; i < stride; i++)
{
data[i] += data[i + stride];
}
// call recursively
return cpuRecursiveReduce(data, stride);
}
1.Neighbored Pair Implementation with divergence
每个线程计算两个相邻元素的和
__global__ void reduceNeighbored (int *g_idata, int *g_odata, unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x;
// boundary check
if (idx >= n) return;
// in-place reduction in global memory
for (int stride = 1; stride < blockDim.x; stride *= 2)
{
if ((tid % (2 * stride)) == 0)
{
idata[tid] += idata[tid + stride];
}
// synchronize within threadblock
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}
一个block处理过程分析:
block 0:data 0~data 7
一个block处理的开始数据地址:
int *idata = g_idata + blockIdx.x * blockDim.x;
for (int stride = 1; stride < blockDim.x; stride *= 2)
1.stride=1,tid=0,2,4,6四个thread计算相邻元素的和。
_syncthreads()//保证在这次for循环过程中block中的0,2,4,6线程都已经完成了计算。
2.stride=2,tid=0,4的线程参与计算
3.stride=4,tid=0参与了计算,完成了3 1 7 0 4 1 6 3 这8个数据的求和计算,并将结果保存到g_odata[0]=25;
假设有64个数据,8个block,每个block处理8个数据的和。g_idata保存了输入的原始数据,g_odata有8个数据,g_odata[i]表示的是block i的计算总和,当在CPU端串行计算g_odata数组的和则计算出了所有数据的总和。
nvcc -arch=35 -rdc=true nestedReduce.cu -o nestedReduce
./nestedReduce
注意:直接nvcc编译报错
http://stackoverflow.com/questions/19287461/compiling-code-containing-dynamic-parallelism-fails
2. Improving Divergence in Parallel Reduction
每个线程计算两个相邻元素的和的计算过程我们发现:
if ((tid % (2 * stride)) == 0)
只有当tid满足一定条件时才会执行,如比计算8个数据时:
第一次迭代:tid=0,2,4,6只有一半的线程需要进行计算。
第二次迭代:tid=0,4只有25%的线程参与了计算。
第三次迭代:tid=0,只有1个线程参与了计算。
虽然只有一部分线程参与计算,但是也需要调度所有的线程。因为线程的调度以warp为基本单位,warp是32个连续的thread,比如第二次迭代时:线程0-4,虽然只需要0和4进行计算,但是仍然要调度1,3线程。
合理安排thread处理的数据,减少分支,thread计算相邻元素的和。
for (int stride = 1; stride < blockDim.x; stride *= 2)
{
// convert tid into local array index
int index = 2 * stride * tid;//最大的区别
if (index < blockDim.x)
{
idata[index] += idata[index + stride];
}
// synchronize within threadblock
__syncthreads();
}
如果一个block有512个threads,第一次迭代时8个warp reduction,剩余8个warp do nothing,第二次迭代 4个warp reduction,剩余12个warp do nothing,第三次迭代2个warp执行,剩余14个warp donothing,第四次1个warp执行...
3. Reducing with Interleaved Pairs
每个线程计算等间距的两个元素的和:全局内存访问顺序有所不同
__global__ void reduceInterleaved (int *g_idata, int *g_odata, unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x;
// boundary check
if(idx >= n) return;
// in-place reduction in global memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
{
if (tid < stride)
{
idata[tid] += idata[tid + stride];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}
Unrolling Loops
展开循环是能够降低分支的频率和减少循环的维护开销,是一项优化的技术。
//方法1
for (int i = 0; i < 100; i++) {
a[i] = b[i] + c[i];
}
//方法2
for (int i = 0; i < 100; i+=2) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
}
从高级语言层面是无法看出性能提升的原因的,需要从low-level instruction层面去分析,第二段代码循环次数减少了一半,而循环体两句语句的读写操作的执行在CPU上是可以同时执行互相独立的,所以相对第一段,第二段性能要好。
Unrolling 在CUDA编程中意义更重。我们的目标依然是通过减少指令执行消耗,增加更多的独立指令来提高性能。这样就会增加更多的并行操作从而产生更高的指令和内存带宽(bandwidth)。也就提供了更多的eligible warps来帮助hide instruction/memory latency 。
4. Reducing with Unrolling
在reduceInterleaved核函数中一个block 处理一个data block。我们如何用single block处理两个data block来展开循环?对于每个线程块,计算来自两个data block的数据。这是一个循环分区(在第1章中介绍):每个线程在多个数据块上工作处理每个数据块中的单个元素。
Step1:单个线程块计算相邻两个data block对应位置上的数据。
block0:1 0 1 1 3 4 0 2
block1:1 2 5 7 2 1 6 3
block1:2 2 6 8 5 5 6 5
Step2:一个线程块按照reduceInterleaved计算一个data blcok中的数据总和。
__global__ void reduceUnrolling2 (int *g_idata, int *g_odata, unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x * 2;
// unrolling 2
if (idx + blockDim.x < n) g_idata[idx] += g_idata[idx + blockDim.x];
__syncthreads();
// in-place reduction in global memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
{
if (tid < stride)
{
idata[tid] += idata[tid + stride];
}
// synchronize within threadblock
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}
int size = 1 << 24; // total number of elements to reduce
int blocksize = 512; // initial block size
dim3 block (blocksize, 1);
dim3 grid ((size + block.x - 1) / block.x, 1);
//由于每个block处理两个data block,所以需要调整grid的配置.
reduceUnrolling2<<<grid.x / 2, block>>>(d_idata, d_odata, size);
实验结果1:Improving Divergence in Parallel Reduction
nvprof --metrics inst_per_warp ./reduceInteger
Device "GeForce GT 740M (0)"
reduceNeighbored Instructions per warp 291.625000 291.625000 291.625000
reduceNeighboredLess Instructions per warp 115.812500 115.812500 115.812500
nvprof --metrics gld_throughput ./reduceInteger
reduceNeighbored Global Load Throughput 16.840GB/s 16.840GB/s 16.839GB/s
reduceNeighboredLess Global Load Throughput 17.905GB/s 17.905GB/s 17.905GB/s
实验结果2:Reducing with Interleaved Pairs
./reduceInteger starting reduction at device 0: GeForce GT 740M with array size 16777216 grid 32768 block 512
cpu reduce elapsed 0.046758 sec cpu_sum: 2139353471
gpu Neighbored elapsed 0.031896 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Neighbored2 elapsed 0.029930 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Interleaved elapsed 0.018442 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
实验结果3:循环展开的不同程度
reduceUnrolling4: each threadblock handles 4 data blocks
reduceUnrolling8: each threadblock handles 8 data blocks
可以看出,同一个thread中如果能有更多的独立的load/store操作,会产生更好的性能,因为这样做memory latency能够更好的被隐藏。我们可以使用nvprof的dram_read_throughput来验证:
nvprof --metrics dram_read_throughput ./reduceInteger
//gt740m
reduceUnrolling8 Device Memory Read Throughput 10.086GB/s
reduceUnrolling4 Device Memory Read Throughput 7.5325GB/s
reduceUnrolling2 Device Memory Read Throughput 5.6396GB/s
5. Reducing with Unrolled Warps
__syncthreads用于块内同步。在reduction kernel中,他被用来在每次循环中保证同一个线程块中所有thread的写global memory的操作都已完成,这样才能进行下一阶段的计算。那么,当kernel进行到只需要少于或等32个thread(也就是一个warp)呢?由于我们是使用的SIMT模式,warp内的thread 是有一个隐式的同步过程的。最后六次迭代可以用下面的语句展开:
// unrolling warp
if (tid < 32)
{
volatile int *vmem = idata;
vmem[tid] += vmem[tid + 32];
vmem[tid] += vmem[tid + 16];
vmem[tid] += vmem[tid + 8];
vmem[tid] += vmem[tid + 4];
vmem[tid] += vmem[tid + 2];
vmem[tid] += vmem[tid + 1];
}
warp unrolling避免了最后一个warp内的__syncthreads同步操作。这里注意下volatile修饰符,他告诉编译器每次执行赋值时必须将vmem[tid]的值store回global memory。如果不这样做的话,编译器或cache可能会优化我们读写global/shared memory。有了这个修饰符,编译器就会认为这个值会被其他thread修改,从而使得每次读写都直接去memory而不是去cache或者register。
最初我没有理解unrolling warp的含义,我们知道在单个warp中,指令遵循SIMT,就是在同一时刻32个线程执行相同的指令,也就是说当活动线程数目少于32个时,我们不需要进行同步控制。
32个thread处理64个数据相加:
第一次迭代:[0]+[0+32]、[1]+[[1+32]、[2]+[[2+32]...[31]+[31+32]由于这32个线程在一个warp中,所以这些计算默认是同步的,计算结果保存到[0-31]。
第二次迭代:[0]+[0+16]、[1]+[[1+16]、[2]+[[2+16]...[15]+[15+16]...([15]+[15+16]),由于这32个线程在一个warp中,所以这些计算默认是同步的,计算结果仍然保存到[0-31],但是有效的计算结果只有[0-15].
第三次迭代:[0]+[0+8]、[1]+[[1+8]、[2]+[[2+8]...[7]+[7+8]...([31]+[31+8]),有效的计算结果保存到[0-7].
第四次迭代:[0]+[0+4]、[1]+[[1+4]、[2]+[[2+4]、[3]+[3+4]...([31]+[31+4]),有效的计算结果保存到[0-3].
第五次迭代:[0]+[0+2]、[1]+[[1+2]...([31]+[31+2]),有效的计算结果保存到[0-1].
第六次迭代:[0]+[0+1]...[31]+[31+1],有效的计算结果保存到[0].
在每次迭代的过程中,由于是在一个warp内,SIMT,所以不需再使用块内同步。
reduceUnrollWarps8<<<grid.x / 8, block>>> (d_idata, d_odata, size);
//实验结果gt740m
nvprof --metrics stall_sync ./reduceInteger
gpu Unrolling8 elapsed 0.097200 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu UnrollWarp8 elapsed 0.096202 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
nvprof --metrics stall_sync ./reduceInteger//因为同步占用的时间
reduceUnrollWarps8 Issue Stall Reasons(Synchronization) 9.97%
reduceUnrolling8 Issue Stall Reasons (Synchronization) 14.54%
6. Reducing with Complete Unrolling
如果在编译时已知了迭代次数,就可以完全把循环展开。Fermi和Kepler每个block的最大thread数目都是1024,博文中的kernel的迭代次数都是基于blockDim的,所以完全展开循环是可行的。
__global__ void reduceCompleteUnrollWarps8 (int *g_idata, int *g_odata,
unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x * 8 + threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x * 8;
// unrolling 8
if (idx + 7 * blockDim.x < n)
{
int a1 = g_idata[idx];
int a2 = g_idata[idx + blockDim.x];
int a3 = g_idata[idx + 2 * blockDim.x];
int a4 = g_idata[idx + 3 * blockDim.x];
int b1 = g_idata[idx + 4 * blockDim.x];
int b2 = g_idata[idx + 5 * blockDim.x];
int b3 = g_idata[idx + 6 * blockDim.x];
int b4 = g_idata[idx + 7 * blockDim.x];
g_idata[idx] = a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4;
}
__syncthreads();
// in-place reduction and complete unroll
if (blockDim.x >= 1024 && tid < 512) idata[tid] += idata[tid + 512];
__syncthreads();
if (blockDim.x >= 512 && tid < 256) idata[tid] += idata[tid + 256];
__syncthreads();
if (blockDim.x >= 256 && tid < 128) idata[tid] += idata[tid + 128];
__syncthreads();
if (blockDim.x >= 128 && tid < 64) idata[tid] += idata[tid + 64];
__syncthreads();
// unrolling warp
if (tid < 32)
{
volatile int *vsmem = idata;
vsmem[tid] += vsmem[tid + 32];
vsmem[tid] += vsmem[tid + 16];
vsmem[tid] += vsmem[tid + 8];
vsmem[tid] += vsmem[tid + 4];
vsmem[tid] += vsmem[tid + 2];
vsmem[tid] += vsmem[tid + 1];
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}
main中调用方式
reduceCompleteUnrollWarps8<<<grid.x / 8, block>>>(d_idata, d_odata, size);
7. Reducing with Template Functions
CUDA代码支持模板,我们可以如下设置block大小:
template <unsigned int iBlockSize>
__global__ void reduceCompleteUnroll(int *g_idata, int *g_odata,
unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x * 8 + threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x * 8;
// unrolling 8
if (idx + 7 * blockDim.x < n)
{
int a1 = g_idata[idx];
int a2 = g_idata[idx + blockDim.x];
int a3 = g_idata[idx + 2 * blockDim.x];
int a4 = g_idata[idx + 3 * blockDim.x];
int b1 = g_idata[idx + 4 * blockDim.x];
int b2 = g_idata[idx + 5 * blockDim.x];
int b3 = g_idata[idx + 6 * blockDim.x];
int b4 = g_idata[idx + 7 * blockDim.x];
g_idata[idx] = a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4;
}
__syncthreads();
// in-place reduction and complete unroll
if (iBlockSize >= 1024 && tid < 512) idata[tid] += idata[tid + 512];
__syncthreads();
if (iBlockSize >= 512 && tid < 256) idata[tid] += idata[tid + 256];
__syncthreads();
if (iBlockSize >= 256 && tid < 128) idata[tid] += idata[tid + 128];
__syncthreads();
if (iBlockSize >= 128 && tid < 64) idata[tid] += idata[tid + 64];
__syncthreads();
// unrolling warp
if (tid < 32)
{
volatile int *vsmem = idata;
vsmem[tid] += vsmem[tid + 32];
vsmem[tid] += vsmem[tid + 16];
vsmem[tid] += vsmem[tid + 8];
vsmem[tid] += vsmem[tid + 4];
vsmem[tid] += vsmem[tid + 2];
vsmem[tid] += vsmem[tid + 1];
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}
注意:最大的不同则是定义了template <unsigned int iBlockSize>来设置block的大小。
代码中的if (iBlockSize >= 1024 && tid < 512) idata[tid] += idata[tid + 512];
等判断语句会根据iBlockSize
的值去掉无用的语句,比如iBlockSize=256
,则该语句永远为false,则编译器会自动把该语句去除。实际的作用类似于case:
switch (blocksize) {
case 1024:
reduceCompleteUnroll<1024><<<grid.x/8, block>>>(d_idata, d_odata, size);
break;
case 512:
reduceCompleteUnroll<512><<<grid.x/8, block>>>(d_idata, d_odata, size);
break;
case 256:
reduceCompleteUnroll<256><<<grid.x/8, block>>>(d_idata, d_odata, size);
break;
case 128:
reduceCompleteUnroll<128><<<grid.x/8, block>>>(d_idata, d_odata, size);
break;
case 64:
reduceCompleteUnroll<64><<<grid.x/8, block>>>(d_idata, d_odata, size);
break;
}
总结Reduction Kernel Performance
ccit@ccit:~/hym/CodeSamples/chapter03$ ./reduceInteger
./reduceInteger starting reduction at device 0: Tesla K80 with array size 16777216 grid 32768 block 512
cpu reduce elapsed 0.085437 sec cpu_sum: 2139353471
gpu Neighbored elapsed 0.008413 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Neighbored2 elapsed 0.007775 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Interleaved elapsed 0.005087 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Unrolling2 elapsed 0.002878 sec gpu_sum: 2139353471 <<<grid 16384 block 512>>>
gpu Unrolling4 elapsed 0.001665 sec gpu_sum: 2139353471 <<<grid 8192 block 512>>>
gpu Unrolling8 elapsed 0.000970 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu UnrollWarp8 elapsed 0.000873 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu Cmptnroll8 elapsed 0.000865 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu Cmptnroll elapsed 0.000831 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
==11132== Profiling application: ./reduceInteger
==11132== Profiling result:
Time(%) Time Calls Avg Min Max Name
73.30% 80.877ms 9 8.9864ms 8.8970ms 9.0921ms [CUDA memcpy HtoD]
7.67% 8.4588ms 1 8.4588ms 8.4588ms 8.4588ms reduceNeighbored(int*, int*, unsigned int)
7.13% 7.8633ms 1 7.8633ms 7.8633ms 7.8633ms reduceNeighboredLess(int*, int*, unsigned int)
4.66% 5.1374ms 1 5.1374ms 5.1374ms 5.1374ms reduceInterleaved(int*, int*, unsigned int)
2.64% 2.9117ms 1 2.9117ms 2.9117ms 2.9117ms reduceUnrolling2(int*, int*, unsigned int)
1.43% 1.5799ms 1 1.5799ms 1.5799ms 1.5799ms reduceUnrolling4(int*, int*, unsigned int)
0.82% 901.81us 1 901.81us 901.81us 901.81us reduceUnrolling8(int*, int*, unsigned int)
0.77% 847.57us 1 847.57us 847.57us 847.57us reduceUnrollWarps8(int*, int*, unsigned int)
0.76% 836.63us 1 836.63us 836.63us 836.63us reduceCompleteUnrollWarps8(int*, int*, unsigned int)
0.74% 821.97us 1 821.97us 821.97us 821.97us void reduceCompleteUnroll<unsigned int=512>(int*, int*, unsigned int)
0.10% 107.23us 9 11.914us 5.8560us 21.216us [CUDA memcpy DtoH]
==11132== API calls:
Time(%) Time Calls Avg Min Max Name
55.86% 341.89ms 2 170.94ms 533.87us 341.36ms cudaMalloc
24.73% 151.33ms 1 151.33ms 151.33ms 151.33ms cudaDeviceReset
13.32% 81.521ms 18 4.5289ms 37.614us 9.1564ms cudaMemcpy
5.01% 30.662ms 18 1.7035ms 123.31us 8.4688ms cudaDeviceSynchronize
0.49% 3.0276ms 364 8.3170us 133ns 289.29us cuDeviceGetAttribute
0.25% 1.5256ms 4 381.39us 364.76us 393.17us cuDeviceTotalMem
0.12% 707.06us 1 707.06us 707.06us 707.06us cudaGetDeviceProperties
0.10% 588.83us 2 294.41us 211.99us 376.83us cudaFree
0.07% 452.08us 9 50.231us 39.824us 65.419us cudaLaunch
0.04% 247.65us 4 61.912us 54.456us 65.017us cuDeviceGetName
0.00% 12.626us 27 467ns 152ns 1.3440us cudaSetupArgument
0.00% 9.8140us 9 1.0900us 648ns 1.6530us cudaConfigureCall
0.00% 9.7070us 1 9.7070us 9.7070us 9.7070us cudaSetDevice
0.00% 4.7200us 12 393ns 160ns 750ns cuDeviceGet
0.00% 4.1790us 3 1.3930us 288ns 3.2960us cuDeviceGetCount
ccit@ccit:~/hym/CodeSamples/chapter03$ nvprof ./reduceInteger
==11226== NVPROF is profiling process 11226, command: ./reduceInteger
./reduceInteger starting reduction at device 0: Tesla K80 with array size 16777216 grid 32768 block 512
cpu reduce elapsed 0.060118 sec cpu_sum: 2139353471
gpu Neighbored elapsed 0.008543 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Neighbored2 elapsed 0.007923 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Interleaved elapsed 0.005194 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
gpu Unrolling2 elapsed 0.002968 sec gpu_sum: 2139353471 <<<grid 16384 block 512>>>
gpu Unrolling4 elapsed 0.001631 sec gpu_sum: 2139353471 <<<grid 8192 block 512>>>
gpu Unrolling8 elapsed 0.000953 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu UnrollWarp8 elapsed 0.000905 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu Cmptnroll8 elapsed 0.000911 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
gpu Cmptnroll elapsed 0.000884 sec gpu_sum: 2139353471 <<<grid 4096 block 512>>>
==11226== Profiling application: ./reduceInteger
==11226== Profiling result:
Time(%) Time Calls Avg Min Max Name
72.93% 79.366ms 9 8.8184ms 8.7691ms 8.9602ms [CUDA memcpy HtoD]
7.77% 8.4588ms 1 8.4588ms 8.4588ms 8.4588ms reduceNeighbored(int*, int*, unsigned int)
7.22% 7.8620ms 1 7.8620ms 7.8620ms 7.8620ms reduceNeighboredLess(int*, int*, unsigned int)
4.72% 5.1380ms 1 5.1380ms 5.1380ms 5.1380ms reduceInterleaved(int*, int*, unsigned int)
2.68% 2.9131ms 1 2.9131ms 2.9131ms 2.9131ms reduceUnrolling2(int*, int*, unsigned int)
1.45% 1.5792ms 1 1.5792ms 1.5792ms 1.5792ms reduceUnrolling4(int*, int*, unsigned int)
0.83% 901.11us 1 901.11us 901.11us 901.11us reduceUnrolling8(int*, int*, unsigned int)
0.78% 849.88us 1 849.88us 849.88us 849.88us reduceUnrollWarps8(int*, int*, unsigned int)
0.77% 838.42us 1 838.42us 838.42us 838.42us reduceCompleteUnrollWarps8(int*, int*, unsigned int)
0.75% 821.56us 1 821.56us 821.56us 821.56us void reduceCompleteUnroll<unsigned int=512>(int*, int*, unsigned int)
0.09% 102.27us 9 11.363us 5.2480us 20.416us [CUDA memcpy DtoH]
==11226== API calls:
Time(%) Time Calls Avg Min Max Name
49.61% 248.53ms 2 124.26ms 574.69us 247.95ms cudaMalloc
27.09% 135.73ms 1 135.73ms 135.73ms 135.73ms cudaDeviceReset
15.93% 79.815ms 18 4.4342ms 33.741us 9.0381ms cudaMemcpy
6.14% 30.739ms 18 1.7077ms 122.83us 8.4672ms cudaDeviceSynchronize
0.56% 2.7905ms 364 7.6660us 128ns 274.24us cuDeviceGetAttribute
0.29% 1.4569ms 4 364.24us 326.50us 378.80us cuDeviceTotalMem
0.14% 683.42us 1 683.42us 683.42us 683.42us cudaGetDeviceProperties
0.10% 499.91us 2 249.96us 185.44us 314.47us cudaFree
0.09% 446.98us 9 49.663us 41.375us 67.690us cudaLaunch
0.04% 214.37us 4 53.591us 52.862us 54.100us cuDeviceGetName
0.00% 12.251us 27 453ns 146ns 1.2490us cudaSetupArgument
0.00% 11.131us 1 11.131us 11.131us 11.131us cudaSetDevice
0.00% 8.9040us 9 989ns 785ns 1.8110us cudaConfigureCall
0.00% 4.0720us 3 1.3570us 257ns 3.4250us cuDeviceGetCount
0.00% 2.9850us 12 248ns 125ns 518ns cuDeviceGet
根据该实验结果我们可以发现拷贝数据的时间要80ms,在cpu上计算的总时间60ms。
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K80 (0)"
Kernel: reduceInterleaved(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 96.15% 96.15% 96.15%
1 gst_efficiency Global Memory Store Efficiency 95.52% 95.52% 95.52%
Kernel: reduceCompleteUnrollWarps8(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 107.54% 107.54% 107.54%
1 gst_efficiency Global Memory Store Efficiency 99.40% 99.40% 99.40%
Kernel: reduceNeighbored(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 25.02% 25.02% 25.02%
1 gst_efficiency Global Memory Store Efficiency 25.00% 25.00% 25.00%
Kernel: reduceUnrolling8(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 99.21% 99.21% 99.21%
1 gst_efficiency Global Memory Store Efficiency 97.71% 97.71% 97.71%
Kernel: reduceUnrolling4(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 98.68% 98.68% 98.68%
1 gst_efficiency Global Memory Store Efficiency 97.71% 97.71% 97.71%
Kernel: reduceUnrolling2(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 98.04% 98.04% 98.04%
1 gst_efficiency Global Memory Store Efficiency 97.71% 97.71% 97.71%
Kernel: void reduceCompleteUnroll<unsigned int=512>(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 107.54% 107.54% 107.54%
1 gst_efficiency Global Memory Store Efficiency 99.40% 99.40% 99.40%
Kernel: reduceUnrollWarps8(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 107.54% 107.54% 107.54%
1 gst_efficiency Global Memory Store Efficiency 99.40% 99.40% 99.40%
Kernel: reduceNeighboredLess(int*, int*, unsigned int)
1 gld_efficiency Global Memory Load Efficiency 25.02% 25.02% 25.02%
1 gst_efficiency Global Memory Store Efficiency 25.00% 25.00% 25.00%