Problem 1: Low Memcpy/Compute Overlap

The percentage of time when memcpy is being performed in parallel with compute is low.

Nsight手册第九章 Memory Optimizations

9.1 Data Transfer Between Host and Device

High Priority:

1、Minimize data transfer between the host and the device, even if it means running some kernels on the device gains no performance when compared with running them on the host.

2、Build intermediate data structures and remember to destroyed them.

3、Using pinned memory(就是我们所说的不可分页内存). But don't overuse it.

Fuctions: cudaHostAlloc(), cudaHostRegister() (for regions of system memory that have already been pre-allocated)

4、using cudaMemcpyAsync() instead of cudaMemcpy().

example:

cudaMemcpyAsync(a_d,a_h,size,cudaMemcpyHostToDevice,0);
kernel<<<grid,block>>>(a_d);
cpuFunction

cpuFunction() overlaps the kernel execution. (CPU端代码和device端代码合并?)

然而,CUDA还支持进一步的优化,就是用host和device端的数据传输来掩盖kernel的计算时间(也就是compute和memcpy同时进行)。下面是流登场的时间了!废话不多,上代码:

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d,a_h,size,cudaMemHostToDevice,steam1);
kernel<<<grid,block,0,stream2>>>(<span style="background-color: rgb(255, 0, 0);">otherData_d</span>);//注意是otherData

上面这一段代码使用stream1执行memcpy,使用stream2来执行kernel,一举两得。(PS:cudaMemcpy()限制了先执行内存复制然后执行kernel)

使用情景:This tech could be used when the data dependency is such that the data can be broken into chunks and transferred in multiple stages, for example, launching multiple kernels to operate on each chunk as it arrives.

看完下面给出的例子我再解释上面这段话,

//Sequential copy and execute
cudaMemcpy(a_d,a_h, N*sizeof(float),cudaMemcpyHostToDevice);
kernel<<<N/nThreads,nThreads>>>(a_d);
//Staged concurrent copy and execute
size=N*sizeof(float)/nStreams;
for (int i=0;i<nStreams;i++){offset=i*N/nStreams;cudaMemcpyAsync(a_d+offset,a_h+offset,size,cudaMemcpyHostToDevice,stream[i]);kernel<<<N/(nThreads*nStreams),nThreads,0,stream[i]>>>(a_d+offset);
}

上面两部分代码,第一部分使用并没有使用stream所以说是串行的执行memcpy和kernel;第二部分使用stream达到async处理的目的。虽然性能没有飞升(我指的是想比于其它程序动不动n倍的速度提升),不过思想很美(美这个词是不是不大合适啊)!

nsight提供了一个计算时间的公式:这里,我们假设tE=execution time, tT=transfer time;

if tE>tT
total time=tE+tT/nStreams;
if tE<tT
total time=tT+tE/nStreams

看上述公式,可见如果tT>>tE,使用stream的优化效果将会非常明显。

这里是华丽的分割线———————————————————————————

上面我们看到了如何使用stream来达到memcpy和kernel掩盖latency的目的。

下面我们介绍使用zero copy(需要CUDA version>=2.2)来达到相同的目的。

同样的先上代码,

<pre name="code" class="cpp"><span style="font-size:18px;"><span style="font-size:14px;">float *a_h,*a_map;
....
cudaGetDeviceProperties(&prop,0);
if (!prop.canMapHostMemory)exit(0);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&a_h,nBytes,cudaHostAllocMapped);
cudaHostGetDevicePointer(&a_map,a_h,0);
kernel<<<gridSize,blockSize>>>(a_map);
不使用stream同样可以ovelap CPU-GPU memory transfer。</span>
</span>