合并访存 | FirstMoonlight

本文主要参考下面的视频 Requests, Wavefronts, Sectors Metrics: Understanding and Optimizing Memory

1、访存过程

全局内存通过缓存实现加载和存储的过程如下图。对全局内存的访问将经过L1 Cache；如果未命中，则会接着从L2缓存中查找；如果再次未命中，则会从全局内存DRAM中读取。

L1表示一级缓存，每个SM都有自己L1，但是L2是所有SM公用的，这两个cache都是on-chip的。核函数运行时需要从全局内存（DRAM）中读取数据。

2、对齐与合并访问

9.2.1. Coalesced Access to Global Memory

对于compute capability 5.2的设备，可选择启用全局内存访问的一级缓存。若启用一级缓存，所需内存事务数等于满足128字节对齐要求的内存段数量，即粒度为128字节。

对于compute capability 6.0及以上的设备，合并访问的最小单位为32个字节，即一个Sector。其L1缓存默认启用，但无论全局内存加载是否启用L1缓存，其数据访问单元的大小固定为32字节。

后续我们都是以compute capability 6.0及以上的设备为例，即访问粒度为32字节。

对于compute capability 5.2的设备的行为可以参考CUDA编程学习笔记-03(内存访问)

2.1 对齐访问

当一个内存事务的首个访问地址32的倍数的时候，这个时候被称为对齐内存访问，非对齐访问就是除上述的其他情况，非对齐的内存访问会造成带宽浪费。

2.2 合并访问

当一个线程束内的线程访问的内存都在一个内存块里的时候，就会出现合并访问。

2.3 对齐合并访问

当线程束内的所有线程访问的数据所在的内存是连续的，并且这部分数据所在内存起始地址是32的倍数的时候，此时就是对齐合并访问，访问效率最高。

对齐合并内存访问

非对齐未合并内存访问

上图中，对齐合并内存访问的情况下，一个Warp是32个线程，每个线程访问4个字节，线程0访问地址96 - 99的数据，线程1访问地址100 - 103处的数据，… ，线程31访问地址220 - 223处的数据，由于整个Warp访问的数据正好形成一个连续的内存96 - 224，因此这个Warp在访问内存的时候，只需要一次事务，而不是每个线程一个事务。而非对齐未合并内存访问的情况下，由于其访问的内存并不能形成一个连续的内存，所以其访问内存时需要多次事务。

3、Cache

Behavior of L1/L2 caches The granularity of L1 and L2 caches

目前来看Pascal之后的GPU的内存模型和CUDA编程学习笔记-03(内存访问)这篇博客中的说法已经有些不同了。

L1 Cache的Cache Line还是128-byte。
L1 Cahe Line和L2 Cache Line已经被分割为一个个Sector，即每个Cache Line由4个Sector组成，每个Sector是32字节。
内存访问以Sector为单位，即每次至少访问32字节的数据。
L1 Cache Miss不再是导致128字节的L2 Cache的访问，最小粒度为1个Sector。

In Fermi/Kepler days, a miss on the L1 triggered a 128byte request to the L2. Somewhere between Maxwell and Pascal this changed to a 32-byte granularity. You’ll fetch 128 bytes if you have a request that needs 128 bytes. For example if you have a warp-wide load of a float or int per thread, adjacent. The advantage of getting only 1 sector on a cache miss needs to be considered in the case of a warp request that only needs 32 bytes or less. In that case, it is preferable to request 32 bytes rather than 128.

GM10x, GM20x, and GP10x have very similar TEX/L1 designs. Starting with GM20x TEX/L1 caching of global memory loads can be enabled on non-constant data. See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x. For all Maxwell - Pascal the L1/TEX cache line size is 128B consisting of 4 32B sectors. On a load miss only the 32B sectors in the cache line that were accessed are loaded from L2. The TEX/L1 cache will make the number of 32B requests required to satisfy all threads. Additional sectors in the cache line that are not accessed will not be pre-fetched. The CUDA profilers have enough TEX and L2 counters to write a quick test to show this behavior.

3.1 `L1 Cache`

3.1.1 `L1 Cache`的属性

值得注意的点在于：每个Cache line是128B，L1缓存的粒度是1个sector，即32B。

3.1.2 `L1 Cache`的读取(load)以及存储(store)流程

warp issue an instruction, ie load or store
load/store instruction会发送给MIO模块进行处理
MIO向L1TEX发送requests请求。一个request中包含了一个warp中所有32个thread的请求信息
LSU IN模块(Load Store Input Unit)接受这些requests请求
如果是global/local的请求，那么requests将会被发送到Tag Stage
Tag Stage功能：计算Tag，计算Cache是否hit
wavefronts: requests在处理过程中被转化为wavefronts，wavefronts可以被认为是一系列的作业包(work package… 这个不知道怎么翻译)，这些作业包可以在Tag Stage的pipeline被同时处理，即一个cycle内完成。如果一个request不能够由一个wavefronts所处理，那么会有多个wavefronts串行处理来完成这个request的请求，因此wavefronts的大小表明了L1 Cache的处理效率。
如果hit miss，Miss Stage会发送requests向L2 Cache请求数据，以Sector为单位
- 一旦数据从L2 Cache返回后，Miss Stage将数据push到Data Stage，由Data Stage将数据返回给SM。
如果hit，那么我们可以越过Miss Stage，将数据直接返回给SM。

3.1.3 `Nsight Compute`下的`L1 Cache`

如下图所示，L1 Cache的每个流程的数据我们都可以在Nsight Compute的Memory Workload Analysis的Chart中看到。

3.1.4 `L1 Cache` Loads Examples

3.1.4.1 4-byte access and 4-bytes stride

每次请求4字节，步长为4字节。即thread0请求[0, 3]，thread1请求[4, 7]，… ，thread31请求[124, 127]。总共有1024个warp，可以从Nsight Compute看出，每个request需要4个sector，即128字节，这些访存是合并的。

3.1.4.2 4-byte access and 8-bytes stride

每次请求4字节，步长为8字节。即thread0请求[0, 3]，thread1请求[8, 11]，… ，thread31请求[252, 255]。

总共有1024个warp，可以从Nsight Compute看出，每个request需要8个sector，即256字节，此时每次访问都有一半的数据是无效的，相比4-byte stride，内存利用率降低了50%。

3.2 `L2 Cache`

3.1.2 `L2 Cache`的属性

Note：每个Cache line是128B，缓存的粒度和L1 Cache一样是1个sector，即32B。

3.1.3 `L2 Cache`的读取(load)流程

由L1 Cache的load流程可知，L1TEX的Miss Stage发送requests给LTS，即L2 Cache Slice。
Tage Stage接收requests，并将hit还是miss信息push到Data Stage
- 一旦Miss，Data Stage会在Memory中将数据取回，并返回给L1 Cache的Miss Stage
- 一旦Hit，Data Stage直接将数据返回给L1 Cache的Miss Stage
- 由于L2 Cache被分为两个partition，因此需要coherence(L2 fabric)，以保持两个patition的同步
- Atomic模块处理原子操作，如果有原子操作，那么Data Stage将这部分操作传递给Atomic模块进行处理

3.1.4 `Nsight Compute`中的`L2 Cache`

如下图所示，L2 Cache的每个流程的数据我们都可以在Nsight Compute的Memory Workload Analysis的Chart中看到。

3.1.5 `L2 GLOBAL LOADS`

3.1.5.1 `4-byte access and 4-bytes stride`

每次请求4字节，步长为4字节。由于只需要一个Cache line就可以覆盖请求，因此request为1。总共需要128B，即4个Sector。所有的request都Miss，因此需要向Global Memory请求。

3.1.5.2 `4-byte access and 32-bytes stride`

每次请求4字节，步长为32字节。每个thread尽管只需要4B，但是L2 Cache的最小粒度是32B，所以总共需要32个Sector。总共需要8个Cache line就，因此request为8。总共需要32个Sector。所有的request都Miss，因此需要向Global Memory请求32个sector。

3.1.5.3 `4-byte access and 64-bytes stride`

(这种情况我暂时没有复现出来，我用的RTX A4000，用64-byte stride，Miss的还是32).

每次请求4字节，步长为64字节。每个thread尽管只需要4B，但是L2 Cache的最小粒度是32B，所以总共需要32个Sector。

由于L2向device memory的请求是左/右半边的Cache line，因此粒度为2个sector。

故总共需要16个Cache line就，因此request为16。总共需要64个Sector。所有的request都Miss，因此需要向Global Memory请求64个sector。

Tags: cuda learning

1、访存过程

2、对齐与合并访问

2.1 对齐访问

2.2 合并访问

2.3 对齐合并访问

3、Cache

3.1 L1 Cache

3.1.1 L1 Cache的属性

3.1.2 L1 Cache的读取(load)以及存储(store)流程

3.1.3 Nsight Compute下的L1 Cache

3.1.4 L1 Cache Loads Examples

3.1.4.1 4-byte access and 4-bytes stride

3.1.4.2 4-byte access and 8-bytes stride

3.2 L2 Cache

3.1.2 L2 Cache的属性

3.1.3 L2 Cache的读取(load)流程

3.1.4 Nsight Compute中的L2 Cache

3.1.5 L2 GLOBAL LOADS

3.1.5.1 4-byte access and 4-bytes stride

3.1.5.2 4-byte access and 32-bytes stride

3.1.5.3 4-byte access and 64-bytes stride