1-1、CPU Cache 和缓存行(转)

image.png

引言

先看看下面这个循环遍历哪个更快

int[][] array = new int[64 * 1024][1024];
        // 横向遍历
        long marked = System.currentTimeMillis();
        for (int i = 0; i < 64 * 1024; i++) {
            for (int j = 0; j < 1024; j++) {
                array[i][j]++;
            }
        }
        System.out.println("横向遍历 times:" + (System.currentTimeMillis() - marked) + "ms");

        marked = System.currentTimeMillis();
        // 纵向遍历
        for (int i = 0; i < 1024; i++) {
            for (int j = 0; j < 64 * 1024; j++) {
                array[j][i]++;
            }
        }
        System.out.println("纵向遍历 times:" + (System.currentTimeMillis() - marked) + "ms");

在 CPU处理器参数为 2.8 GHz Intel Core i7 的Mac上运行的结果是

横向遍历 times:64ms
纵向遍历 times:1689ms

Gallery of Processor Cache Effects 用7个源码示例生动的介绍 cache 原理，深入浅出！但是可能因操作系统的差异、编译器是否优化，以及近些年 cache 性能的提升，有些样例在 Mac 的效果与原文相差较大。

CPU Cache

CPU 访问内存时，首先查询 cache 是否已缓存该数据。如果有，则返回数据，无需访问内存；如果不存在，则需把数据从内存中载入 cache，最后返回给理器。在处理器看来，缓存是一个透明部件，旨在提高处理器访问内存的速率，所以从逻辑的角度而言，编程时无需关注它，但是从性能的角度而言，理解其原理和机制有助于写出性能更好的程序。Cache 之所以有效，是因为程序对内存的访问存在一种概率上的局部特征：

Spatial Locality： 对于刚被访问的数据，其相邻的数据在将来被访问的概率高。
Temporal Locality： 对于刚被访问的数据，其本身在将来被访问的概率高。

以Mac OS 为例，可以使用命令 ysctl -a | grep hw 查询 cache 信息，单位是字节 Byte。

ysctl -a | grep hw

hw.cachelinesize: 64
hw.l1icachesize: 32768     一级指令缓存
hw.l1dcachesize: 32768    一级数据缓存
hw.l2cachesize: 262144
hw.l3cachesize: 6291456

CacheLine size : 64 Byte
L1 Data Cache：32 KB
L1 Instruction Cache：32 KB
L2 Cache：256KB
L3 Cache：6MB

下图是计算机存储的基本结构。L1，L2，L3分别表示一级缓存、二级缓存、三级缓存。越靠近CPU的缓存，速度越快，容量也越小。L1缓存小但很快，并且紧靠着在使用它的CPU内核。分为指令缓存和数据缓存；L2大一些，也慢一些，并仍然只能被一个单独的CPU核所使用；L3更大、更慢，并且被单个插槽上所有的CPU核共享；最后是主存，由全部插槽上的所有CPU核共享。

image.png

当CPU执行运算的时候，它先去L1查找所需要的数据、再去L2、然后是L3，如果最后这些缓存中都没有，所需的数据就要去主内存拿。走的越远，运算耗费的时间就越长。所以要尽量确保数据在L1缓存中。

Martin和Mike的 QCon presentation 演讲中给出了一些缓存未命中的消耗数据，也就是从CPU访问不同层级数据的时间概念:

从CPU到	大约需要的CPU时钟周期	大约需要的时间
主存	约60-80ns
QPI 总线传输(between sockets, not drawn)	约20ns
L3 cache	约40-45 cycles	约15ns
L2 cache	约10 cycles	约3ns
L1 cache	约3-4 cycles	约1ns
寄存器	1	cycle

从这里可以看出CPU读取主存中的数据会比从L1中读取慢了将近2个数量级。

我们在每隔 64 Byte (cache line size) 访问一次，访问固定次数。随着array的增大，看看能不能测试出 L1，L2 和 L3 cache size的大小：

public static void main(String[] args) {
        for (int ARRAY_SIZE = 512; ARRAY_SIZE < 128 * 1024 * 1024; ARRAY_SIZE = ARRAY_SIZE << 1) {
            int steps = 640 * 1024 * 1024; // Arbitrary number of steps
            int length_mod = ARRAY_SIZE - 1;
            char[] arr = new char[ARRAY_SIZE];

            long marked = System.currentTimeMillis();
            for (int i = 0; i < steps; i += 64) {
                arr[i & length_mod]++; // (i & length_mod) is equal to (i % length_mod)
            }
            long used = (System.currentTimeMillis() - marked);
            System.out.println(formatSize(ARRAY_SIZE) + "\t" + used);
        }
    }

    /**
     * 把size单位转化为KB, MB, GB
     */
    public static String formatSize(long size) {
        String hrSize = null;

        double b = size;
        double k = size / 1024.0;
        double m = ((size / 1024.0) / 1024.0);
        double g = (((size / 1024.0) / 1024.0) / 1024.0);
        double t = ((((size / 1024.0) / 1024.0) / 1024.0) / 1024.0);

        DecimalFormat dec = new DecimalFormat("0");

        if (t > 1) {
            hrSize = dec.format(t).concat(" TB");
        } else if (g > 1) {
            hrSize = dec.format(g).concat(" GB");
        } else if (m > 1) {
            hrSize = dec.format(m).concat(" MB");
        } else if (k > 1) {
            hrSize = dec.format(k).concat(" KB");
        } else {
            hrSize = dec.format(b).concat(" Bytes");
        }
        return hrSize;
    }

运行结果如下：

image.png

可以看到32KB，256KB，4MB之后耗时均有明显上升。

缓存行Cache Line

Cache是由很多个 Cache line 组成的。Cache line 是 chche 和 RAM交换的最小单位，通常是 64 Byte。当CPU 把内存的数据载入 cache时，会把临近的 64 Byte 的数据一同放入同一个 Cache line ，因为空间局部性；临近的数据在将来被访问的可能性最大。

以大小为 32 KB ，cache line 的大小为 64 Byte 的L1级缓存为例，对于不同存放规则，其硬件图简单表示一种设计：

image.png

伪共享False Sharing

当多线程修改互相独立的变量时，如果这些变量共享同一个缓存行，就会无意中影响彼此的性能，这就是伪共享。缓存行上的写竞争是运行在SMP系统中并行线程实现可伸缩性最重要的因素。有人将伪共享描述成无声的性能杀手。

image.png

下面通过一段代码，看看伪共享对性能的影响。

/**
 * 伪共享
 */
public class FalseSharingNo implements Runnable {

    public final static long ITERATIONS = 500L * 1000L * 100L;

    private int arrayIndex = 0;

    private static ValuePadding[] longs;

    public FalseSharingNo(final int arrayIndex) {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception {
        for (int i = 1; i < 10; i++) {
            System.gc();
            final long start = System.currentTimeMillis();
            runTest(i);
            System.out.println(i + " Threads, duration = " + (System.currentTimeMillis() - start));
        }

    }

    private static void runTest(int NUM_THREADS) throws InterruptedException {
        Thread[] threads = new Thread[NUM_THREADS];
        longs = new ValuePadding[NUM_THREADS];
        for (int i = 0; i < longs.length; i++) {
            longs[i] = new ValuePadding();
        }
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(new FalseSharingNo(i));
        }

        for (Thread t : threads) {
            t.start();
        }

        for (Thread t : threads) {
            t.join();
        }
    }

    @Override
    public void run() {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            longs[arrayIndex].value = 0L;
        }
    }

    public final static class ValuePadding {
        protected long p1, p2, p3, p4, p5, p6, p7;
        protected volatile long value = 0L;
        protected long p9, p10, p11, p12, p13, p14;
        protected long p15;
    }

    public final static class ValueNoPadding {
        // protected long p1, p2, p3, p4, p5, p6, p7;
        protected volatile long value = 0L;
        // protected long p9, p10, p11, p12, p13, p14, p15;
    }
}

在分别使用 ValuePadding 和 ValueNoPadding 两种对象，让多线程分别访问数组中相邻的对象，试图构建一个伪共享的场景。在有Padding填充的情况下，看看运行结果：

1 Threads, duration = 390
2 Threads, duration = 402
3 Threads, duration = 418
4 Threads, duration = 442
5 Threads, duration = 467
6 Threads, duration = 537
7 Threads, duration = 591
8 Threads, duration = 648
9 Threads, duration = 730

把代码中 ValuePadding 都替换为 ValueNoPadding 后的结果：

1 Threads, duration = 390
2 Threads, duration = 2016
3 Threads, duration = 2325
4 Threads, duration = 2644
5 Threads, duration = 3621
6 Threads, duration = 3805
7 Threads, duration = 3532
8 Threads, duration = 3612
9 Threads, duration = 2944

Cache Line 伪共享解决方案

处理伪共享的两种方式：

1、字节填充：增大元素的间隔，使得不同线程存取的元素在不同的 cache line 上，典型的空间换时间。
2、在每个线程中创建对应元素的本地拷贝，结束后再写回全局数组

我们这里只看第一种字节填充。保证不同线程的变量存在于不同的 Cache line 即可，这样就不会出现伪贡献问题。在代码层面如何实现图中的字节填充呢？

image.png

Java6 中实现字节填充

public class PaddingObject{
    public volatile long value = 0L;    // 实际数据
    public long p1, p2, p3, p4, p5, p6; // 填充
}

PaddingObject 类中需要保存一个long类型的value值，如果多线程操作同一个CacheLine中的PaddingObject对象，便无法发挥出 CPU Cache的优势(想象一下你定义了一个 PaddingObject[] 数组，数组元素在内存中连续，却由于伪共享导致无法使用 CPU Cache 带来的沮丧)。

不知道你注意没有，实际数据 value + 用于填充的 p1-p6总共只占据了 7 * 8 = 56 个字节，而 Cache Line 的大小应该是64KB，这是有意为之，因为在 Java中，对象头还占据了 8个字节，所以一个 PaddingObject 对象可以恰好占据一个 Cache Line。

Java7 中实现字节填充

在 Java7 之后，一个 JVM 的优化给字节填充造成了一些影响，上面的代码片段 public long p1, p2, p3, p4, p5, p6; 会被认为是无效代码被优化掉，有回归到了伪共享的窘境之中。

为了避免 JVM 的自动优化，需要使用继承的方式来填充。

abstract class AbstractPaddingObject{
    protected long p1, p2, p3, p4, p5, p6;// 填充
}

public class PaddingObject extends AbstractPaddingObject{
    public volatile long value = 0L;    // 实际数据
}

Java8 中实现字节填充

//JDK 8中提供的注解
@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.FIELD, ElementType.TYPE})
public @interface Contended {

    /**
     * The (optional) contention group tag.
     * This tag is only meaningful for field level annotations.
     *
     * @return contention group tag.
     */
    String value() default "";
}

在JDK 8里提供了一个新注解@Contented，可以用来减少 false sharing 的情况。JVM 在计算对象布局的时候就会自动把标注的字段拿出来并且插入合适的大小Padding。

因为这个功能暂时还是实验性功能，暂时还没到默认普及给用户代码用的程度。要在用户代码(非Bootstrap class loader 或 extension class loader 所加载的类) 中使用 @Contended 注解的话，需要使用 -XX:-RestrictContended 参数。

比如在JDK8中的 ConcurrentHashMap 源码中，使用 @sun.misc.Contended 对静态内部类 CounterCell 进行了修饰。

/* ---------------- Counter support -------------- */

/**
 * A padded cell for distributing counts.  Adapted from LongAdder
 * and Striped64.  See their internal docs for explanation.
 */
@sun.misc.Contended 
static final class CounterCell {
        volatile long value;
        CounterCell(long x) { value = x; }
}

Thread

Thread 线程类的源码中，使用 @sun.misc.Contended 对成员变量进行修饰。

// The following three initially uninitialized fields are exclusively
// managed by class java.util.concurrent.ThreadLocalRandom. These
// fields are used to build the high-performance PRNGs in the
// concurrent code, and we can not risk accidental false sharing.
// Hence, the fields are isolated with @Contended.

/** The current seed for a ThreadLocalRandom */
@sun.misc.Contended("tlr")
long threadLocalRandomSeed;

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@sun.misc.Contended("tlr")
int threadLocalRandomProbe;

/** Secondary seed isolated from public ThreadLocalRandom sequence */
@sun.misc.Contended("tlr")
int threadLocalRandomSecondarySeed;