Netty 内存管理源码分析 jemalloc

背景

生活就像一座围城，城外的人想进去，城里的人想出去。写java程序的同学基本上不太用关心内存分配算法实现，也不用关心申请到的内存是如何被释放。比如创建一个对象Object obj = new Object(); 将来这个对象是何时被释放的，这个其实你是不用去关心的，因为GC会帮你搞定一切。写C语言的同学用malloc申请一块内存的时候，那么你在用完这块内存之后就需要用free去释放这块内存。写java的同学好奇写C的同学是如何直接操作这些内存的，写C的同学很多时候又很羡慕写java的同学不用去关心这些内存操作的细节。为了高效的实现内存的分配和复用，netty基于jemalloc实现了一套内存分配和回收机制

原理之前

jemalloc是一种优秀的内存管理算法，在这里就不展开去探究了，大家可以自行 ~~google~~ 百度。本文基于netty管理pooled direct memory实现进行讲解，netty对于java heap buffer的管理和对direct memory的管理在实现上基本相同

原理

下图展示了netty基于jemalloc实现的内存划分逻辑

netty_jemalloc.png

关键对象

PooledByteBufAllocator
申请内存通过这个类提供的buffer去操作,下面是这个类定义的属性

    //heap类型arena的个数，数量计算方法同下 
    private static final int DEFAULT_NUM_HEAP_ARENA;
    //direct类型arena个数，默认为min(cpu_processors * 2,maxDirectMemory/16M/2/3),正 
    //常情况下如果不设置很小的Xmx或者很小的-XX:MaxDirectMemorySize， 
    //arena的数量就等于计算机processor个数的2倍，
    private static final int DEFAULT_NUM_DIRECT_ARENA;
     //内存页的大小，默认为8k（这个内存页可以类比操作系统内存管理中的内存页）
    private static final int DEFAULT_PAGE_SIZE;’
    //默认是11,因为一个chunk默认是16M = 2^11  *  2^13(8192)
    private static final int DEFAULT_MAX_ORDER; // 8192 << 11 = 16 MiB per chunk
    //缓存tiny类型的内存的个数，默认是512
    private static final int DEFAULT_TINY_CACHE_SIZE;
    //缓存small类型的内存的个数，默认是256
    private static final int DEFAULT_SMALL_CACHE_SIZE;
   //缓存normal类型的内存的个数，默认是64
    private static final int DEFAULT_NORMAL_CACHE_SIZE;
   //最大可以被缓存的内存值，默认为32K，当申请的内存超过32K，那么这块内存就不会被放入缓存池了
    private static final int DEFAULT_MAX_CACHED_BUFFER_CAPACITY;
    //cache经过多少次回收之后，被清理一次，默认是8192
    private static final int DEFAULT_CACHE_TRIM_INTERVAL;
    private static final long DEFAULT_CACHE_TRIM_INTERVAL_MILLIS;
    //是不是所有的线程都是要cache，默认true，
    private static final boolean DEFAULT_USE_CACHE_FOR_ALL_THREADS;
    private static final int DEFAULT_DIRECT_MEMORY_CACHE_ALIGNMENT;
// Use 1023 by default as we use an ArrayDeque as backing storage which will then allocate an internal array
        // of 1024 elements. Otherwise we would allocate 2048 and only use 1024 which is wasteful.
    static final int DEFAULT_MAX_CACHED_BYTEBUFFERS_PER_CHUNK;
//-------------------------------------------下半部分的属性---------------------------------------
    //heap类型的arena数组    
    private final PoolArena<byte[]>[] heapArenas;
    //direct memory 类型的arena数组   
    private final PoolArena<ByteBuffer>[] directArenas;
    //对应上面的DEFAULT_TINY_CACHE_SIZE;
    private final int tinyCacheSize;
    private final int smallCacheSize;
    private final int normalCacheSize;
    private final List<PoolArenaMetric> heapArenaMetrics;
    private final List<PoolArenaMetric> directArenaMetrics;
    //这是个FastThreadLocal，记录是每个线程自己的内存缓存信息
    private final PoolThreadLocalCache threadCache;
    //每个PageChunk代表的内存大小，默认是16M，这个可以类比操作系统内存管理中段的概念
    private final int chunkSize;

在初始化PooledByteBufAllocator的时候会创建heap和direct memory类型的Arena，下面会具体介绍什么是Arena，下面是创建Arena数组的源码

 private static <T> PoolArena<T>[] newArenaArray(int size) {
        return new PoolArena[size];
    }

同时还会给数组的每个元素赋值对应的Arena类型,下面是direct memory类型实现源码

   for (int i = 0; i < directArenas.length; i ++) {
                PoolArena.DirectArena arena = new PoolArena.DirectArena(
                        this, pageSize, maxOrder, pageShifts, chunkSize, directMemoryCacheAlignment);
                directArenas[i] = arena;
                metrics.add(arena);
            }

Arena
arena是jemalloc中的概念，它是一个内存管理单元，线程在arena中去分配和释放内存，系统正常会存在多个arena，每个线程会被绑定一个arena，同一个arena可以被多个线程共享,arena和thread之间的关系如下图

arena_thread_relation.png

现在我们看下PoolArena的属性，比较多，耐性看完

    //maxOrder默认是11
    private final int maxOrder;
    //内存页的大小，默认是8k
    final int pageSize;
    //默认是13，表示的是8192等于2的13次方
    final int pageShifts;
   // 默认是16M
    final int chunkSize;
     //这个等于~(pageSize-1),用于判断申请的内存是不是大于或者等于一个page
     //申请内存reqCapacity&subpageOverflowMask如果等于0那么表示申请的内 
     //存小于一个page的大小，如果不等于0那么表示申请的内存大于或者一个page的大小
    final int subpageOverflowMask;
   //它等于pageShift - 9，默认等4
    final int numSmallSubpagePools;
    final int directMemoryCacheAlignment;
    final int directMemoryCacheAlignmentMask;
     //tiny类型内存PoolSubpage数组，数组长度是32，从index=1开始使用
    private final PoolSubpage<T>[] tinySubpagePools;
     //small类型内存PoolSubpage数组，数组长度在默认情况下是4
    private final PoolSubpage<T>[] smallSubpagePools;
    //PoolChunkList代表链表中的节点，
    //每个PoolChunkList存放内存使用量在相同范围内的chunks，
    //比如q075存放的是使用量达到了75%以上的chunk
    private final PoolChunkList<T> q050;
    private final PoolChunkList<T> q025;
    private final PoolChunkList<T> q000;
    private final PoolChunkList<T> qInit;
    private final PoolChunkList<T> q075;
    private final PoolChunkList<T> q100;

  private final List<PoolChunkListMetric> chunkListMetrics;
    
//下面都是一些记录性质的属性
    // Metrics for allocations and deallocations
    private long allocationsNormal;
    // We need to use the LongCounter here as this is not guarded via synchronized block.
    private final LongCounter allocationsTiny = PlatformDependent.newLongCounter();
    private final LongCounter allocationsSmall = PlatformDependent.newLongCounter();
    private final LongCounter allocationsHuge = PlatformDependent.newLongCounter();
    private final LongCounter activeBytesHuge = PlatformDependent.newLongCounter();

    private long deallocationsTiny;
    private long deallocationsSmall;
    private long deallocationsNormal;

    // We need to use the LongCounter here as this is not guarded via synchronized block.
    private final LongCounter deallocationsHuge = PlatformDependent.newLongCounter();

    // Number of thread caches backed by this arena.
    final AtomicInteger numThreadCaches = new AtomicInteger();

我们先看下tinySubpagePools和smallSubpagePools数组的初始化

 private PoolSubpage<T>[] newSubpagePoolArray(int size) {
        return new PoolSubpage[size];
    }

SubpagePools的元素会被初始化,可以看到PoolSubpage是双向链表节点型的对象，默认head和next都指向自己

private PoolSubpage<T> newSubpagePoolHead(int pageSize) {
        PoolSubpage<T> head = new PoolSubpage<T>(pageSize);
        head.prev = head;
        head.next = head;
        return head;
    }

所以初始化后的SubpagePools长这样

SubpagePools.png

在讲解PoolSubpage之前我觉的有必要先说下netty的内存分配策略。netty向jvm或者堆外内存每次申请的内存以chunk为基本单位，每个chunk的默认大小是16M，在netty内部每个chunk又被分成若干个page，默认情况下每个page的大小为8k，所以在默认情况下一个chunk包含2048个page

chunk_page.png

应用程序向netty申请内存的时候分成两种情况：
1）如果申请的内存小于chunk的尺寸，默认情况下也就是小于16M，那么netty就会以page为单位去分配内存给应用程序，比如申请10K的内存，那么netty会选择一个chunk中的2个page分配给应用程序，如果申请的内存小于一个page的大小，那么就直接分配一个page给应用程序
2）如果申请的内存大于一个chunk的尺寸，那么netty就会直接向JVM或者操作系统申请相应大小的内存。下面的图展示的是直接内存申请的情况

netty_memory_allocation_strategy.png

PoolSubpage
通过上面的讲解，我们了解到当我们申请的内存小于一个chunk的大小，netty会以page为单位返回结果给申请者，一个page默认是8K,如果我们申请的内存量是32B，netty返回了一个page，这不是很浪费吗，对你说的很对这确实很浪费，所以PoolSubpage的作用就是对申请到的page按照实际申请量的大小对这个page再进行分割从而做到内存的合理利用和减少内存碎片化。
有两种类型的PoolSubpage：
1）申请内存小于512的tiny类型的PoolSubpage
2）申请内存在512到4096之间的small类型的PoolSubpage
先上图

tinySubpagePools.png

smallSubpagePools.png

对于tiny和small来说netty把内存分成不同大小的规格进行管理

tiny是按照16的倍数分块，最大分到496，tinySubpagePools数组长度是32，其中index=0的位置是不使用的，数组的第一个元素指向的是内存被按照以16B为基本单位进行划分的pages，第32个元素指向的是内存被按照以496B为基本单位进行划分的pages。
small类型从512开始后面的数是前面的2倍直到4096，所以smallSubpagePools的大小是4，第一个元素指向的是内存被按照以512B为基本单位进行划分的pages，最后一个元素指向的是内存被按照以4096B为基本单位进行划分的pages
看下PoolSubpage的属性

    //PoolSubpage所属的chunk
    final PoolChunk<T> chunk;
   //本PoolSubpage在memoryMap数组中的index
    private final int memoryMapIdx;
   //本PoolSubpage所在的page在chunk中的开始位置
    private final int runOffset;
    private final int pageSize;
   //PoolSubpage 会被分成pageSize/elementSize个大小为elementSize的内存
   //单元，使用bitmap来记录每个内存单元的使用情况。
    private final long[] bitmap;
  
    PoolSubpage<T> prev;
    PoolSubpage<T> next;

    boolean doNotDestroy;
    int elemSize;
    private int maxNumElems;
    private int bitmapLength;
   //记录下一个可用内存单元的位置
    private int nextAvail;
   //记录可用的内存块的数量
    private int numAvail;

PoolChunk
netty以chunk为单位向操作系统或者JVM申请内存，默认一个chunk的大小是16M，netty使用PoolChunk表示和管理这块内存

    //chunk所属的arena
    final PoolArena<T> arena;
    //真正申请的chunkSize大小的内存，对于堆外内存来说 memory是一个容量
    //默认为16M的ByteBuffer，对于堆上内存来说他是一个默认大小为16M的 byte[]
    final T memory;
    final boolean unpooled;
    final int offset;
    //用于记录内存的分配情况
    private final byte[] memoryMap;
    //记录二叉树每层节点的编码
    private final byte[] depthMap;
   //subpages是一个长度默认是2048的数组。
   //PoolSubpage代表的是一个内存页，netty中内存页默认大小是8K，通过上面的讲解 
   //我们知道PoolSubpage是可以根据实际内存申请量被再次划分的，但是就它 
   //代表的内存来说，它代表的是一个page
   //PoolChunk默认16M的内存会被分成2048个内存页，subpages数组就是表示 
   //这2048个内存页
    private final PoolSubpage<T>[] subpages;
    /** Used to determine if the requested capacity is equal to or greater than pageSize. */
    private final int subpageOverflowMask;
    private final int pageSize;
    private final int pageShifts;
    private final int maxOrder;
    private final int chunkSize;
    private final int log2ChunkSize;
    private final int maxSubpageAllocs;
    /** Used to mark memory as unusable */
    private final byte unusable;

    // Use as cache for ByteBuffer created from the memory. These are just duplicates and so are only a container
    // around the memory itself. These are often needed for operations within the Pooled*ByteBuf and so
    // may produce extra GC, which can be greatly reduced by caching the duplicates.
    //
    // This may be null if the PoolChunk is unpooled as pooling the ByteBuffer instances does not make any sense here.
    private final Deque<ByteBuffer> cachedNioBuffers;
    //剩余可用内存
    int freeBytes;
    //指向自己所属的PoolChunkList
    PoolChunkList<T> parent;
    //指向PoolChunk链的前一个节点
    PoolChunk<T> prev;
    //指向PoolChunk链的后一个节点
    PoolChunk<T> next;

PoolChunkList
在netty中每个arena都会管理很多个chunk，这些chunk初始大小默认都是16M，但是随着系统的运行，这些chunk对应的内存会被一点点的分配出去，这样arena中就会存在很多大小各异的chunk，为了高效的管理这些chunk，netty会根据剩余可用空间把这些chunk分类，剩余可用空间在同一范围的chunk组成一个链表。PoolChunkList就是chunk类别的定义类，这些类别也会按照链表的形式组装起来

PoolChunkList.png

下面是PoolChunkList中定义属性

    //PoolChunkList所属的arena
    private final PoolArena<T> arena; 
    //指向PoolChunkList链表中的下一个节点
    private final PoolChunkList<T> nextList;
    //下面会解释minUsage和maxUsage的作用
    private final int minUsage;
    private final int maxUsage;
    //属于本PoolChunkList管理的chunk最大可被申请的内存量
    private final int maxCapacity;
    //同类chunk链表头指针
    private PoolChunk<T> head;
    private final int freeMinThreshold;
    private final int freeMaxThreshold;

    // This is only update once when create the linked like list of PoolChunkList in PoolArena constructor.
    // 指向PoolChunkList链表中的前一个节点
    private PoolChunkList<T> prevList;

那么netty到底是如何给不同的chunk进行分类的呢？其实是通过设定两个参数实现的，minUsage和maxUsage，在netty中chunk按照剩余可用内存的大小被分成了6类，下面是这六类在PoolArena中的定义

chunk_category.png

我们看到在arena中minUsage和maxUsage被设置了不同大小的值

freeMinThreshold = (maxUsage == 100) ? 0 : (int) (chunkSize * (100.0 - maxUsage + 0.99999999) / 100L);
freeMaxThreshold = (minUsage == 100) ? 0 : (int) (chunkSize * (100.0 - minUsage + 0.99999999) / 100L);

freeMinThreshold和freeMaxThreshold会分别根据maxUsage和minUsage计算出来，我们就拿q025来举例子看下，q025对应的PoolChunkList， minUsage = 25，maxUsage = 75，根据这两个值可以算出freeMinThreshold约定于4.16M，freeMaxThreshold约等于12.16M。当一个chunk分配出去一些内存后，如果这个chunk在这次分配之后剩余的可用内存小于freeMinThreshold，那么这个chunk就不再属于这个chunkList管理了，那么他就会被沿着PoolChunkList链表向下继续查找适合存放它的PoolChunkList。当向一个chunk中释放内存后，如果释放之后这个chunk的可用内存大于freeMaxThreshold，那么就需要沿着PoolChunkList链表向头部方向去寻找适合管理这个chunk的PoolChunkList。下图给出了不同PoolChunkList管理的chunk的内存范围

Chunk_memory_range_for_different_PoolChunkList.png

PoolThreadLocalCache
这是一个FastThreadLocal，它的作用就是给每个线程绑定PoolThreadCache
PoolThreadCache
通过名字我们大致就可以猜出这个类的作用，线程级别的内存缓存池，我们上面的讲解的Arena，Chunk，Page,Subpage这些都是整个应用级别的内存管理，PoolThreadCache管理的是本线程从Arena中申请的内存将来如何在自己本线程中实现复用，当一个线程申请内存的时候，它首先从PoolThreadCache管理的内存池中去取，如果取不到自己需要的内存才会再向Arena去申请。
先看它定义的关键属性

    //本线程绑定的heap类型的arena
    final PoolArena<byte[]> heapArena;
    //本线程绑定的direct类型的arena
    final PoolArena<ByteBuffer> directArena;

    // Hold the caches for the different size classes, which are tiny, small and normal.
    
    //从上面我们知道，当我们申请的内存小于等于4096的时候，netty会分成两 
    //种情况来处理，1)申请的内存大小在(0,496]区间被称为tiny类型，tiny类型
    //会被规格化成31种以16为倍数递增的大小[16,32,48,.......,496]，在arena中
    //定义了tinySubpagePools数组来代表这些大小不同的内存块
    //tinySubPageDirectCaches也是一个大小为32的数组，它每一个元素表示的
   //意义和tinySubpagePools中每一个元素表示的意义一一对应，例如第1个
   //元素表示的这是一个大小为16的ByteBuffer的池，本线程如果申请大小在 
   //(0,16]之间的直接内存，那么首先就会在tinySubPageDirectCaches[1]中去查找
    private final MemoryRegionCache<ByteBuffer>[] tinySubPageDirectCaches;
    //smallSubPageDirectCaches是长度为4的数组，
    //代表大小为[512,1024,2048,4096]ByteBuffer池，和smallSubpagePools表示的意义一一对应
    private final MemoryRegionCache<ByteBuffer>[] smallSubPageDirectCaches;
    //同tinySubPageDirectCaches
    private final MemoryRegionCache<byte[]>[] tinySubPageHeapCaches;
   //同上smallSubPageDirectCaches
    private final MemoryRegionCache<byte[]>[] smallSubPageHeapCaches;
   
    //normalDirectCaches 数组的默认长度是3，它表示的ByteBuffer的长度是
    //[8K,16K,32K]，上面我讲了申请内存小于等于4K的情况，那么如果我们申请 
    //的内存大于4K小于16M呢？通过上面的分析我们知道Arena会按照8K为单 
    //位分配内存给线程，使用完之后，如果分配的内存是8K,16K,32K我们就会
    //把这些内存对应的ByteBuffer放在normalDirectCaches中进行管理，将来
    //如果本线程需要8K,16K,32K的ByteBuffer会首先去normalDirectCaches取
    private final MemoryRegionCache<ByteBuffer>[] normalDirectCaches;
     //同normalDirectCaches
    private final MemoryRegionCache<byte[]>[] normalHeapCaches;

MemoryRegionCache
线程通过MemoryRegionCache复用不同规格的ByteBuffer类
我们看下它的属性

        //用来设置下面queue的大小，tiny类型的size是512，small类型的size为 
        //256，normal类型的size为64，当缓存的ByteBuffer装满了queue，那么就 
         //不会继续去缓存相应的ByteBuffer了，除非有消费者从queue中取走了一些ByteBuffer
        private final int size;
        //这个queue被用来缓存被线程回收的的ByteBuffer或者byte[]
        private final Queue<Entry<T>> queue;
        //用来标记本MemoryRegionCache的类型（tiny, small, normal）
        private final SizeClass sizeClass;

上面我们分析netty实现jemalloc来管理整个应用系统ByteBuf内存，以及每个线程缓存申请到的内存所涉及的一些核心类。

下面通过一段代码，我们看netty是如何给系统分配一块堆外内存的

 ByteBuf byteBuf5 = pooledByteBufAllocator.buffer(18) ;

整个分配过程涉及到的方法调用链比较长：so 要耐心看

源头.png

newDirectBuffer()

PoolByteBufAllocator前面的方法我们就不看了，直接看newDirectBuffer方法

protected ByteBuf newDirectBuffer(int initialCapacity, int maxCapacity) {
       //获取本线程绑定的PoolThreadCache
        PoolThreadCache cache = threadCache.get();
       //获取PoolThreadCache绑定的directArena，这个绑定关系是在 
       //PoolThreadCache初始化中实现的，我们上面分析过每个线程都会
       //绑定到一个arena上面，那么如何保证每个arena绑定的线程数大致相同的
        //每个Arena都会记录自己被绑定的次数，当下次有线程去绑定Arena的时候，
        //只要从Arena数组中找到绑定次数最小的Arena让其绑定，就可以实现每个Arena绑定的线程数大致相同
        PoolArena<ByteBuffer> directArena = cache.directArena;

        final ByteBuf buf;
        if (directArena != null) {
            //进入Arena的allocate了，initCapacity代表的是你申请的内存大小，
           //maxCapacity默认是Integer.MAX_VALUE
            buf = directArena.allocate(cache, initialCapacity, maxCapacity);
        } else {
            buf = PlatformDependent.hasUnsafe() ?
                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity) :
                    new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity);
        }

        return toLeakAwareBuffer(buf);
    }

Arena 第一层 allocate()

  PooledByteBuf<T> allocate(PoolThreadCache cache, int reqCapacity, int maxCapacity) {
        //获取PoolByteBuf，这就是将来我们要返回给用户的结果，PoolByteBuf
       //是可回收对象，newByteBuf具体实现就是从对象缓存池中去获取 
        //PoolByteBuf，关于netty是如何实现对象回收的请看我写的另一篇文章 
        // https://www.jianshu.com/p/8f629e93dd8c
        PooledByteBuf<T> buf = newByteBuf(maxCapacity);
        allocate(cache, buf, reqCapacity);
        return buf;
    }

Arena 第二层 allocate()

好长的一段代码，patient

private void allocate(PoolThreadCache cache, PooledByteBuf<T> buf, final int reqCapacity) {
         //normalize方法的作用是标准化用户请求的内存量，有2种情况：
         // 1.请求的内存量属于区间(0,496]，那么会把reqCapacity标准化为  
         //最小的大于reqCapacity的16的倍数的值，例如按照这个算法18标准化
         //之后的数值是32，6标准化之后的数值是16
         // 2. 申请量reqCapacity大于496的情况，会被标准为大于reqCapacity最小powerOfTwo，
         //比如reqCapacity=1025，被标准化后的结果是2048
        final int normCapacity = normalizeCapacity(reqCapacity);
        //判断申请内存标准化后是不是小于一个page（8K）的大小
        if (isTinyOrSmall(normCapacity)) { // capacity < pageSize
            int tableIdx;
            PoolSubpage<T>[] table;
            //判断申请内存量标准化之后是不是属于tiny类型（小于512）
            boolean tiny = isTiny(normCapacity);
            if (tiny) { // < 512
                //先从线程的PoolThreadCache去分配申请的tiny类型内存，
                if (cache.allocateTiny(this, buf, reqCapacity, normCapacity)) {
                    // was able to allocate out of the cache so move on
                    return;
                }
                //如果分配失败，那么设置table指向tinySubpagePools，计算
                // normCapacity在tinySubpagePools中的index
                tableIdx = tinyIdx(normCapacity);
                table = tinySubpagePools;
            } else {
                 //从线程的PoolThreadCache去分配申请的small类型内存，
                if (cache.allocateSmall(this, buf, reqCapacity, normCapacity)) {
                    // was able to allocate out of the cache so move on
                    return;
                }
                //如果分配失败，那么设置table指向smallSubpagePools，计算
                // normCapacity在smallSubpagePools中的index
                tableIdx = smallIdx(normCapacity);
                table = smallSubpagePools;
            }

 
    
            //head指向申请内存标准后的大小所在的PoolSubpage链表，可以参考
            // 上面tinySubpagePools.png 或者smallSubpagePools.png去理解
            final PoolSubpage<T> head = table[tableIdx];

            /**
             * Synchronize on the head. This is needed as {@link PoolChunk#allocateSubpage(int)} and
             * {@link PoolChunk#free(long)} may modify the doubly linked list as well.
             */
          
            synchronized (head) {
                final PoolSubpage<T> s = head.next;
               //s等于head意味着相应的链表还没有可用的内存块存在，所以不能执行下面的代码
                if (s != head) {
                    assert s.doNotDestroy && s.elemSize == normCapacity;
                    long handle = s.allocate();
                    assert handle >= 0;
                    s.chunk.initBufWithSubpage(buf, null, handle, reqCapacity, cache);
                    incTinySmallAllocation(tiny);
                    return;
                }
            }
            synchronized (this) {
                // 在cache中分配失败去arena分配内存
                allocateNormal(buf, reqCapacity, normCapacity, cache);
            }

            incTinySmallAllocation(tiny);
            return;
        }
         // 如果申请内存小于chunkSize（默认16M）
        if (normCapacity <= chunkSize) {
            //先去cache中分配
            if (cache.allocateNormal(this, buf, reqCapacity, normCapacity)) {
                // was able to allocate out of the cache so move on
                return;
            }
            synchronized (this) {
               // 在cache中分配失败，去arena中分配
                allocateNormal(buf, reqCapacity, normCapacity, cache);
                ++allocationsNormal;
            }
        } else {
            // Huge allocations are never served via the cache so just call allocateHuge
           //申请的内存大于16M，执行allocateHuge
            allocateHuge(buf, reqCapacity);
        }
    }

我们可以看到向申请arena分配内存的时候，先是到线程内存缓存池PoolThreadCache中去分配，如果在PoolThreadCache分配失败，才会从arena管理的内存中去分配。我们先来分析下如何在PoolThreadCache中去获取到自己需要的内存（ByteBuffer），我们以tiny类型的内存请求为例子来分析

cacheForTiny
我们知道在PoolThreadCache有一个长度为32的MemoryRegionCache<ByteBuffer>数组tinySubPageHeapCaches，这个数组每个元素的所代表的意义我们上面已经有介绍，cacheForTiny的作用就是找到标准化后申请的内存量normCapacity在tinySubPageHeapCaches中的index，然后取得对应的MemoryRegionCache对象

private MemoryRegionCache<?> cacheForTiny(PoolArena<?> area, int normCapacity) {
       //获取normCapacity在tinySubPageHeapCaches中的index
        int idx = PoolArena.tinyIdx(normCapacity);
        if (area.isDirect()) {
            return cache(tinySubPageDirectCaches, idx);
        }
        return cache(tinySubPageHeapCaches, idx);
    }

//根据index获取对应的MemoryRegionCache
 private static <T> MemoryRegionCache<T> cache(MemoryRegionCache<T>[] cache, int idx) {
        if (cache == null || idx > cache.length - 1) {
            return null;
        }
        return cache[idx];
    }

MemoryRegionCache.allocate()
从线程的内存缓存池中获取用户申请内存

 public final boolean allocate(PooledByteBuf<T> buf, int reqCapacity, PoolThreadCache threadCache) {
           //queue缓存了同一类尺寸的ByteBuffer的包装类Entry，
            Entry<T> entry = queue.poll();
            if (entry == null) {
              //如果从queue取不到自己申请大小的ByteBuffer，返回false
                return false;
            }
           //如果在queue申请到了，那么就用申请到的ByteBuffer来初始化netty自己的ByteBuf，关于初始化部分我们放在下面解析
            initBuf(entry.chunk, entry.nioBuffer, entry.handle, buf, reqCapacity, threadCache);
            //复用entry对象
            entry.recycle();

            // allocations is not thread-safe which is fine as this is only called from the same thread all time.
            ++ allocations;
            return true;
        }

如果在线程的缓存池中无法取得自己需要的内存，那么就会向arena去申请，我们还是以tiny类型来解析，这个时候又分成两种情况：
1）先在tinySubpagePools找到normCapacity对应的PoolSubpage，如果PoolSubpage链上有除了head以外的节点，那么执行节点的allocate方法

 long allocate() {
         //如果本PoolSubpage内存单元大小为0，返回handle =0
        if (elemSize == 0) {
            return toHandle(0);
        }
       //如果本PoolSubpage可用内存块数量为0或者这个PoolSubpage已经从
       // PoolSubpage链表中销毁了，返回-1
        if (numAvail == 0 || !doNotDestroy) {
            return -1;
        }
       //从本PoolSubpage获取下一个可用的内存单元信息，
       //关于这一块的详细解析我放在下面
        final int bitmapIdx = getNextAvail();
        int q = bitmapIdx >>> 6;
        int r = bitmapIdx & 63;
        assert (bitmap[q] >>> r & 1) == 0;
        bitmap[q] |= 1L << r;
        //在本次申请之后，如果本PoolSubpage可用内存单元块变成了0，那么从PoolSubpage链中删除本PoolSubpage
        if (-- numAvail == 0) {
            removeFromPool();
        }
       //返回结果，通过Handle我们就可以找到自己需要的内存单元块
        return toHandle(bitmapIdx);
    }

第二种情况 2) 从arena中直接分配 allocateNormal()

 //因为一个chunk的大小默认是16M，而allocateNormal申请的大小都会小于 
 //chunk的的大小，先从arena的PoolChunkList链管理的PoolChunk去分配需要的内存
 private void allocateNormal(PooledByteBuf<T> buf, int reqCapacity, int normCapacity, PoolThreadCache threadCache) {
        if (q050.allocate(buf, reqCapacity, normCapacity, threadCache) ||
            q025.allocate(buf, reqCapacity, normCapacity, threadCache) ||
            q000.allocate(buf, reqCapacity, normCapacity, threadCache) ||
            qInit.allocate(buf, reqCapacity, normCapacity, threadCache) ||
            q075.allocate(buf, reqCapacity, normCapacity, threadCache)) {
            return;
        }

        // Add a new chunk.
       //如果从上面的PoolChunkList链表中没有分配成功，那么需要新建一个 
       //chunk，然后从这个chunk中去分配
        PoolChunk<T> c = newChunk(pageSize, maxOrder, pageShifts, chunkSize);
        //从chunk中去分配
        boolean success = c.allocate(buf, reqCapacity, normCapacity, threadCache);
        assert success;
       //把新建的chunk添加到PoolChunkList链表中
        qInit.add(c);
    }

newChunk()
当应用系统向netty申请内存时，如果netty还没有向操作系统或者jvm申请过任何chunk，或者之前申请的所有chunks中，没有一个剩余的可用容量能满足本次申请的需求，那么netty这个时候会向操作系统或者jvm申请一个chunk（默认是16M的内存）

 protected PoolChunk<ByteBuffer> newChunk(int pageSize, int maxOrder,
                int pageShifts, int chunkSize) {
            if (directMemoryCacheAlignment == 0) {
                //allocateDirect就是netty向操作系统申请chunkSize大小堆外内存的地方，
                return new PoolChunk<ByteBuffer>(this,
                        allocateDirect(chunkSize), pageSize, maxOrder,
                        pageShifts, chunkSize, 0);
            }
            final ByteBuffer memory = allocateDirect(chunkSize
                    + directMemoryCacheAlignment);
            return new PoolChunk<ByteBuffer>(this, memory, pageSize,
                    maxOrder, pageShifts, chunkSize,
                    offsetCacheLine(memory));
        }

在上面创建PoolChunk对象的时候会初始化memoryMap和depthMap数组。
到这里有必要介绍下PoolChunk管理的内存是如何被分配的了，我们知道netty分配给应用程序的内存都是以page（默认8K）为单位进行分配，同时根据申请量每次分配 $2^n$ 个page，换言之netty不可能一次分配3个page或者5个page，如果申请3x8192的内存默认netty会分配4个page，如果申请5x8192的内存，默认netty会分配8个page。一个chunk默认的大小是16M，在PoolChunk中这16M的内存在逻辑上被划分成2048个page。为了高效的实现page的分配，netty使用一棵树完全二叉树来管理这16M的内存，这棵完全二叉树的深度为12，从第0层开始，第11层有2048个叶子节点，每个叶子节点代表一个page，这颗完全二叉树中的非叶子节代表的内存容量等于其子节点的容量和

full_binary_tree.png

memoryMap和depthMap就是代表的这颗二叉树，默认memoryMap和depthMap数组长度为4096，但是这棵完全二叉树只有4095个节点，netty选择从数组的index=1开始去表示这棵完全二叉树。那么memoryMap和depthMap有什么区别呢，memoryMap是会随着page的分配和回收动态的修改每个节点的值，depthMap中的元素一旦初始化之后就不会被修改了，将来需要查看某个节点初始状态的值就可以通过depthMap查找。我们看下memoryMap和depthMap初始化代码

        //从数组的index=1开始去表示这棵完全二叉树
        int memoryMapIndex = 1;
       //maxOrder默认为11，d表示的是每层节点的编码，d的范围是[0,11]，不同
       //的编码表示不同的内存量，比如所以的叶子节点的编码都是11，表示每
       //个叶子节点代表的内存大小是8K，节点编号为2表示该节点代表的内存大小是4M
        for (int d = 0; d <= maxOrder; ++ d) { // move down the tree one level at a time     
            //通过depth我们可以计算每层的节点数
            int depth = 1 << d;
            //通过for循环给每层节点赋相同的值
            for (int p = 0; p < depth; ++ p) {
                // in each level traverse left to right and set value to the depth of subtree
                memoryMap[memoryMapIndex] = (byte) d;
                depthMap[memoryMapIndex] = (byte) d;
                memoryMapIndex ++;
            }
        }

初始化之后每个节点所代表的内存大小可以通过下面公式算得
$8K * 2^{maxOrder-d}$

关于具体分配是如何实现的，在下面会详细解析

PoolChunk.allocate()
通过上面对PoolChunk初始化的分析，我了解到PoolChunk使用一颗完全二叉树管理申请到的内存，allocate方法是PoolChunk真正实现page分配的地方，PoolChunk对于page的分配分成两种情况：
1）单次内存申请量标准化之后小于一个page，对应的分配方法是allocateSubpage
2）单次内存申请量标准化之后大于或者等于一个page，对应的分配方法是allocateRun
第一种情况会比较复杂一些，因为它还涉及到把一个page按照normalCapacity划分为多个大小相同的内存单元的过程，以及还需要记录每个内存单元的使用情况。我们拿硬骨头来分析

   //对于allocateSubpage来说normCapacity<pageSize,所以每次只会分配一个page
 private long allocateSubpage(int normCapacity) {
        // Obtain the head of the PoolSubPage pool that is owned by the PoolArena and synchronize on it.
        // This is need as we may add it back and so alter the linked-list structure.
       //根据normCapacity获取这个PoolSubpage在tinySubpagePools
       //或者smallSubpagePools中对应page链表的head
        PoolSubpage<T> head = arena.findSubpagePoolHead(normCapacity);
        int d = maxOrder; // subpages are only be allocated from pages i.e., leaves
        synchronized (head) {
            //从PoolChunk的完全二叉树数中分配内存，id代表的是分配的节点在memoryMap中的index
            int id = allocateNode(d);
           //id<0表示分配失败
            if (id < 0) {
                return id;
            }
            
            final PoolSubpage<T>[] subpages = this.subpages;
            final int pageSize = this.pageSize;
            //分配成功之后，更新本PoolChunk剩余可用内存
            freeBytes -= pageSize;
           //获取本次分配的PoolSubpage在subpages的index
            int subpageIdx = subpageIdx(id);
            PoolSubpage<T> subpage = subpages[subpageIdx];
            if (subpage == null) {
                //初始化本次分配的PoolSubpage对象
                subpage = new PoolSubpage<T>(head, this, id, runOffset(id), pageSize, normCapacity);
                //把本次分配的PoolSubpage添加到subpages中
                subpages[subpageIdx] = subpage;
            } else {
                subpage.init(head, normCapacity);
            }
           //从PoolSubpage分配大小为normCapacity内存块
            return subpage.allocate();
        }
    }

allocateNode(int d)
在memoryMap表示的这棵完全二叉树中去分配需要的内存

// d是申请的内存量在完全二叉树中的节点编码，比如申请的内存是8K那么默认对应的d为11，如果申请内存为16K那么d为10
 private int allocateNode(int d) {
      //id代表的是memoryMap的index，id=1表示我们从树根开始找节点编码为d的节点
        int id = 1;
        int initial = - (1 << d); // has last d bits = 0 and rest all = 1
        byte val = value(id);
        if (val > d) { // unusable
            return -1;
        }
         //如果本节点的对应的编码小于d，那么继续找本节点的左孩子节点
        while (val < d || (id & initial) == 0) { // id & initial == 1 << d for all ids at depth d, for < d it is 0
            id <<= 1;
            val = value(id);
            //如果本节点的编码大于d，那么就去查看本节点的兄弟节点
            if (val > d) {
                id ^= 1;
                val = value(id);
            }
        }
        byte value = value(id);
        assert value == d && (id & initial) == 1 << d : String.format("val = %d, id & initial = %d, d = %d",
                value, id & initial, d);
       //找到和符合的节点后，把这个节点的编码设置为12，表示这个节点已经被分配了
        setValue(id, unusable); // mark as unusable
      //因为非叶子节点表示的容量等于其子节点表示的容量之和，所以在分配完后，
      //需要更新整棵树的编码
        updateParentsAlloc(id);
        return id;
    }

updateParentsAlloc(id)
子节点被分配之后更新父节点的容量

  private void updateParentsAlloc(int id) {
        while (id > 1) {
            //找到父节点
            int parentId = id >>> 1;
            //获取本节点的编码
            byte val1 = value(id);
            //获取本节点兄弟节点的编码
            byte val2 = value(id ^ 1);
            // val取min(val1,val2)
            byte val = val1 < val2 ? val1 : val2;
            //在memoryMap中更新parentId的节点编码为val
            setValue(parentId, val);
            id = parentId;
        }
    }

PoolSubpage 初始化
讲完chunk是如何分配page的那么，看下分配到的一个page是如何初始化的

 PoolSubpage(PoolSubpage<T> head, PoolChunk<T> chunk, int memoryMapIdx, int runOffset, int pageSize, int elemSize) {
       //这些属性的意义在上面有说明
        this.chunk = chunk;
        this.memoryMapIdx = memoryMapIdx;
        this.runOffset = runOffset;
        this.pageSize = pageSize;
        //默认bitmap的长度是8，那么为什么是8呢？因为一个long是64bit
        //8个long能表示 8*64 = 512 bit,默认情况下一个page为8K，normCapacity 
        //最小是16，所以一个PoolSubpage最多可以被分成 8192/16 = 512个标准 
       //内存块，用8个long的512个bit正好可以表示这最大512个标准内存块的使用情况
        bitmap = new long[pageSize >>> 10]; // pageSize / 16 / 64
        init(head, elemSize);
    }


void init(PoolSubpage<T> head, int elemSize) {
        doNotDestroy = true;
      //elemSize是每个内存单元的大小
        this.elemSize = elemSize;
        if (elemSize != 0) {
            //计算这个page被分成多少个内存单元
            maxNumElems = numAvail = pageSize / elemSize;
            nextAvail = 0;
           // maxNunElems >>> 6就可以算出需要多少个long才能表示每个内存单元的使用情况
            bitmapLength = maxNumElems >>> 6;
            if ((maxNumElems & 63) != 0) {
                //如果maxNumElems不是64的整数倍，那么bitmapLength需要+1，
                //举个例子，比如elementSize= 1024，那么maxNumElems就等于 8
               // bitmapLength = 8 >>> 6 = 0，需要8个bit，但是bitmapLength却等于0
               //所以需要加上一个long，才能满足需求
                bitmapLength ++;
            }
            //给bitmap数组前bitmapLength个long赋值0，表示目前所有的内存单元都没有被分配
            for (int i = 0; i < bitmapLength; i ++) {
                bitmap[i] = 0;
            }
        }
       //把本PoolSubpage加入arena相应的PoolSubpage链表中，最新分配的page放在链表的最前面
        addToPool(head);
    }

PoolSubpage.allocate()
返回用户申请的normCapacity大小的内存

long allocate() {
        if (elemSize == 0) {
            return toHandle(0);
        }

        if (numAvail == 0 || !doNotDestroy) {
            return -1;
        }
        //获取下一个可用的内存单元的位置，下面会详细解释下一个内存单元位置的查找过程
        //bitmapIdx包含两层信息，
        // 1）可用内存单元状态bit所在的long在bitmap中的index
        //2）可用内存块对应的bit在这个long所表示 64bit中的位置
        final int bitmapIdx = getNextAvail();
        //q对应上面bitmapIdx表示的第一层信息
        int q = bitmapIdx >>> 6;
        //r对应上面的bitmapIdx表示的第二层信息
        int r = bitmapIdx & 63;
        assert (bitmap[q] >>> r & 1) == 0;
        //更新对应bit状态为1，表示一个内存单元被分配了
        bitmap[q] |= 1L << r;
        //如果所有的内存单元都被分配完了，那么把本PoolSubpage从相应的PoolSubpage链表中删除
        if (-- numAvail == 0) {
            removeFromPool();
        }
        //返回结果
        return toHandle(bitmapIdx);
    }

toHandle

handle.png

handle是通过三个数‘|’在一起的结果，第一个部分是2的64次方，是long能表示
最大的powOfTwo,第二部分表示的是申请的内存单元在相应page中的位置，比如一个page中包含了256个内存单元块，现在申请的内存单元块是19，那么19就是这个位置，第三部分是申请的内存单元所在的page在chunk中的位置
。通过handle的值可以计算出page在chunk中的位置，申请的内存单元在page中的位置
getNextAvail
用户查找下一个可用内存单元块在PoolSubpage中位置，涉及到的核心方法有
findNextAvail，findNextAvail0，我们就深入源码看下吧

private int findNextAvail() {
        
        final long[] bitmap = this.bitmap;
        final int bitmapLength = this.bitmapLength;
        for (int i = 0; i < bitmapLength; i ++) {
            long bits = bitmap[i];
            //如果~bits等于0，表示这个long表示的bits被全部分配了，那么
            //就继续查看下一个long，以此类推，
            //如果不等于0说明这个bits对应的内存单元可以继续被分配
            if (~bits != 0) {
                return findNextAvail0(i, bits);
            }
        }
       //表示所有的内存单元块都没分配了
        return -1;
    }

private int findNextAvail0(int i, long bits) {
        //i 表示是bits所在的long在bitmap中的index
        final int maxNumElems = this.maxNumElems;
       //baseVal用于记录本可用内存单元块的状态位对应的long在bitmap中的index
        final int baseVal = i << 6;
        //遍历这个bits的64个位，j代表是bit在bits中的index
        for (int j = 0; j < 64; j ++) {
             //bits&1等于0，表示bits的第j个bit为0,也就意味着j对应的内存单元可以分配
            if ((bits & 1) == 0) {
               //val 有两部分组成baseVal和j, baseVal等于i << 6，那它一定大于64，
               //j又是小于64，所以将来我们可以把val >> 6就可以得到i，val & 63就
               //可以得到j
                int val = baseVal | j;
                if (val < maxNumElems) {
                    return val;
                } else {
                    break;
                }
            }
           //当前j对应的bit所表示的内存单元已经被分配了
            //bits无符号右移，去取得bits的下一位
            bits >>>= 1;
        }
        return -1;
    }

PoolChunk.initBuf
当我们申请到需要的内存块后（通过handle表示），就需要初始化ByteBuf了

void initBuf(PooledByteBuf<T> buf, ByteBuffer nioBuffer, long handle, int reqCapacity,
                 PoolThreadCache threadCache) {
        int memoryMapIdx = memoryMapIdx(handle);
        int bitmapIdx = bitmapIdx(handle);
        //如果bitmapIdx == 0 表示申请的内存大于等于8k
        if (bitmapIdx == 0) {
            byte val = value(memoryMapIdx);
            assert val == unusable : String.valueOf(val);
            //runOffset用于计算本次申请到的内存在chunk管理的内存块中的偏移量
            //下面会详细解析
            buf.init(this, nioBuffer, handle, runOffset(memoryMapIdx) + offset,
                    reqCapacity, runLength(memoryMapIdx), threadCache);
        } else {
            //申请的内存小于pagesSize
            initBufWithSubpage(buf, nioBuffer, handle, bitmapIdx, reqCapacity, threadCache);
        }
    }

private void initBufWithSubpage(PooledByteBuf<T> buf, ByteBuffer nioBuffer,
                                    long handle, int bitmapIdx, int reqCapacity, PoolThreadCache threadCache) {
        assert bitmapIdx != 0;

        int memoryMapIdx = memoryMapIdx(handle);

        PoolSubpage<T> subpage = subpages[subpageIdx(memoryMapIdx)];
        assert subpage.doNotDestroy;
        assert reqCapacity <= subpage.elemSize;

        buf.init(
            this, nioBuffer, handle,
            runOffset(memoryMapIdx) + (bitmapIdx & 0x3FFFFFFF) * subpage.elemSize + offset,
                reqCapacity, subpage.elemSize, threadCache);
    }

通过上面的代码我们可以看出申请的内存量不论大于等于pageSize，还是小于pageSize最后都需要执行PooledByteBuf.init，唯一不同的是如何计算申请到的内存在PoolChunk管理的内存块中的偏移量。我看下两种情况下的内存偏移量是如何算的

runOffset
对于申请内存大于等于pageSize的情况，我们只需要计算出分配到的pages在chunk中的起始位置就好


private int runOffset(int id) {
        // represents the 0-based offset in #bytes from start of the byte-array chunk
       //depth(id)根据节点的在depth数组中的index，算出节点的初始编码，实现
       //很简单，直接去depthmap中根据id去取就好了，1 << depth(id)计算出
       //的值是完全二叉树每一层第一个节点的index，然后使用id ^ index，就可 
       //以计算出id距离index的偏移量,比如节点2049在第11层的偏离量是1，
       //shift代表的就是node节点在本层的偏移量
        int shift = id ^ 1 << depth(id);
        // runLength 是计算在depthmap中下标为id的节点所代表的内存量，
        //比如节点2049代表的内存量是8K，节点512代表的内存是32k
        //通过shift * runLength(id)就可以计算出申请的内存在chunk中的偏移量
        return shift * runLength(id);
    }

  private byte depth(int id) {
        return depthMap[id];
    }

  private int runLength(int id) {
        // represents the size in #bytes supported by node 'id' in the tree
        //log2ChunkSize默认是24，因为16M = 2的24次方
        return 1 << log2ChunkSize - depth(id);
    }

对于申请内存小于pageSize的情况，offset计算公式为：

runOffset(memoryMapIdx) + (bitmapIdx & 0x3FFFFFFF) * subpage.elemSize + offset

包含3个部分：
1）runOffset(memoryMapIdx) 是计算申请的page在chunk中的偏离量，和上面第一种情况相同
2）(bitmapIdx & 0x3FFFFFFF) * subpage.elemSize其中(bitmapIdx & 0x3FFFFFFF) 算出的是本次申请到的内存单元块在page中属于第N个，然后N*subpage.elemSize 就可以算出申请到的内存块在page中的偏移量。为什么是0x3FFFFFFF而不是0x7FFFFFFF，因为bitmapIdx是通过handle无符号右移动然后在截取最后32位获得,需要剔除符号位的影响
3）第三个部分是chunk的offset，默认是0
上面三个部分加起来就是本次申请的内存单元块在chunk中的偏移量

buf.init
PoolByteBuf初始化，通过上面的解析，下面的代码我相信大家应该都能明白了，具体过程就不在解释了，

private void init0(PoolChunk<T> chunk, ByteBuffer nioBuffer,
                       long handle, int offset, int length, int maxLength, PoolThreadCache cache) {
        assert handle >= 0;
        assert chunk != null;
        
        this.chunk = chunk;
        memory = chunk.memory;
        tmpNioBuf = nioBuffer;
        allocator = chunk.arena.parent;
        this.cache = cache;
        this.handle = handle;
        this.offset = offset;
        this.length = length;
        this.maxLength = maxLength;
    }

对于申请内存大于chunkSize情况，源码相对来说简单不少我们就不做解析了
到此解析完了应用程序向netty申请内存的全部过程。

引用

http://anyteam.me/netty-memory-allocation-PoolArena/
http://www.programmersought.com/article/9322400832/
https://www.jianshu.com/p/4856bd30dd56
https://juejin.im/post/5d4f6d74f265da03e83b5e07

Netty 内存管理源码分析 jemalloc

背景

原理之前

原理

关键对象

newDirectBuffer()

Arena 第一层 allocate()

Arena 第二层 allocate()

引用