Memcached 内存管理
1.Memcached介绍
Memcached是一套分布式的内存对象缓存系统,用于系统中减少数据库的负载,提高系统性能。本文介绍的Memcached内存管理方式基于1.4.24。旧版本的Memcached的内存管理方式与该版本会存在一定的不同,本文没有涉及旧版本的Memcached内存管理介绍。
2.Memcached模型
在具体介绍Memcached内存管理的源码实现之前,我们先介绍一些Memcached内存管理的重要概念。
-
Slab
Slab是Memcached中分配的一块内存,默认大小是1M。Slab是Memcached内存分配的最小单位。
-
Chunk
Slab是Memcached中分配的最小单位,而每一个Slab又会进一步划分成一个个的Chunk。Chunk是Memcached存储数据的最小单元,一个Chunk只能存储一个对象。同时,一个Slab中所有的Chunk的大小是相同的。
-
Item
Item是Memcached中存储的实际数据。Item本身是一个复杂的数据结构,其中除了包含对象的Key-Value键值对,过期时间外,还有其他一些数据结构,稍后会具体介绍。Memcached将Item保存对应的Slab的某个Chunk中。
-
SlabClass
通过上面几个概念的介绍,我们知道Memcached通过分配Slab并把Slab划分成等大小的Chunk来进行存储。那么不同的Item大小数据应该如何存储呢?SlabClass把Slab划分成不同大小的组合,每一个SlabClass对应一种Slab,在同一个SlabClass中所有的Slab都划分成相同大小的Chunk。
这里举一个还算比较形象的列子。Memcached的内存分配可以类比与我们上学时代的写字本。Slab相当于写字本的一页,Chunk相当于写字本一页中的一个个写字格子,Item相当于我们要写的字,而SlabClass相当于一个写字本。所以每一个写字本中的格子大小都是相同的,我们要写不同的字,只需要找最符合该字大小的写字本就可以了。
3. Memecached数据结构
Memcached中最为基本的数据结构是slabclass_t,该数据结构记录了memcached中单个slabclass的结构,具体的数据结构源码如下:
typedef struct {
unsigned int size; /*Chunk的大小,固定不变*/
unsigned int perslab; /* 每一个slab中可以存储对象的数量 */
void *slots; /* 链表记录slabclass中的所有空闲Chunk的列表 */
unsigned int sl_curr; /* 总的空闲的item的数量,即slots的数量 */
unsigned int slabs; /* 该slabclass中分配的slab的数量 */
void **slab_list; /* slab指针的数组 */
unsigned int list_size; /* slab的数量 */
unsigned int killing; /* index+1 of dying slab, or zero if none */
size_t requested; /* 已经被分配的大小 */
} slabclass_t;
这里重点介绍slots、slabs和slab_list。slots是记录了该SlabClass中所有空闲的Chunk列表,是Memcached中内存分配的入口。这里的空闲列表主要来自于两个部分:新分配的Slab的空闲Chunk;已经使用的Chunk过期回收的Chunk。slabs记录的目前该SlabClass已经分配到内存的Slab的总数。slab_list是指针的数组,它表示目前该SlabClass中所有Slab的指针,但并不表示所有的Slab都已经分配了内存。
下面我们介绍一下Memcached中数据存储对象Item的结构:
typedef struct _stritem {
/* Protected by LRU locks */
struct _stritem *next; /* 链表中下一个对象,有可能是指slots中的下一个,也可能是LRU链表中的下一个 */
struct _stritem *prev;
/* Rest are protected by an item lock */
struct _stritem *h_next; /* 相同hash值的hash链表中的下一个 */
rel_time_t time; /* 最近的访问时间 */
rel_time_t exptime; /* 过期时间 */
int nbytes; /* 数据大小 */
unsigned short refcount;
uint8_t nsuffix; /* length of flags-and-length string */
uint8_t it_flags; /* ITEM_* above */
uint8_t slabs_clsid;/* slab class的编号,表示该Item所在的slab class */
uint8_t nkey; /* 键值的长度 */
/* this odd type prevents type-punning issues when we do
* the little shuffle to save space when not using CAS. */
union {
uint64_t cas;
char end;
} data[];
/* if it_flags & ITEM_CAS(大小为2) we have 8 bytes CAS */
/* then null-terminated key (以'\0'为结尾的key)*/
/* then " flags length\r\n" (no terminating null) */
/* then data with terminating \r\n (no terminating null; it's binary!) */
} item;
Item是Memcached存储数据的基本单位,这里我们重点介绍data[]数组。该数组的组成可以分成4个部分:8bytes的CAS,以'\0'结尾的Key,flags,二进制格式的数据。
介绍了SlabClass和Item,接下去介绍一下Memcached的LRU实现。内存的不足肯定会导致数据的换入和换出,而Memcached作为内存存储,采用了LRU的方式来进行数据替换。每一个SlabClass独自维护了一套LRU队列,分别是head和tail变量来LRU队列的头部和尾部,尾部的Item就是最近最少使用的数据,会先被淘汰。
上图我们可以比较简单的看出的slabclass的内存呢管理方式,但实际LRU实现远比这幅图要更为复杂,LRU策略分的更为细粒度。LRU对于Memcached来说单个SlabClass的LRU,而不是整体Memcached的LRU,这是由其内存的分配和管理方式决定,但是也可以通过内存的重分配来调整SlabClass之间的内存分配。
Memcached内存的分配和管理方式,虽然可以避免内存的碎片化,但带来的影响是会造成内存的浪费。举个列子:假设有Item的大小为50K,而可以存该Item的Chunk最合适大小为80K,这样就会造成30K内存的浪费。
4. 源码分析
通过上面内容的介绍,我们已经比较好的了解了Memcached的内存模型,下面我们从代码层面来看看Memcached是如何进行内存的分配的。
Slabclass数组的初始化:
/*系统初始化会进行slabclass的初始化*/
static slabclass_t slabclass[MAX_NUMBER_OF_SLAB_CLASSES];
void slabs_init(const size_t limit, const double factor, const bool prealloc) {
int i = POWER_SMALLEST - 1; //确定SlabClass数组开始分配的起点
unsigned int size = sizeof(item) + settings.chunk_size;//确定初始化SlabClass可以存储的固定大小,item记录元数据,chunk存储用户数据
mem_limit = limit; //可以分配的最大内存
if (prealloc) { //这段代码如果需要与分配内存的时候会调用
/* Allocate everything in a big chunk with malloc */
mem_base = malloc(mem_limit);
if (mem_base != NULL) {
mem_current = mem_base;
mem_avail = mem_limit;
} else {
fprintf(stderr, "Warning: Failed to allocate requested memory in"
" one large chunk.\nWill allocate in smaller chunks\n");
}
}
memset(slabclass, 0, sizeof(slabclass));
while (++i < MAX_NUMBER_OF_SLAB_CLASSES-1 && size <= settings.item_size_max / factor) {
/* 这段代码保证在Memcached中存储的对象都是对齐的,默认是使用8位对齐的*/
if (size % CHUNK_ALIGN_BYTES)
size += CHUNK_ALIGN_BYTES - (size % CHUNK_ALIGN_BYTES);
/* 计算该slabclass存在的item的大小和数量*/
slabclass[i].size = size;
slabclass[i].perslab = settings.item_size_max / slabclass[i].size;
/* 下一个slab的大小的size是按factor参数规律扩展,可以通过调节factor大小来控制memcached的存储分配 */
size *= factor;
if (settings.verbose > 1) {
fprintf(stderr, "slab class %3d: chunk size %9u perslab %7u\n",
i, slabclass[i].size, slabclass[i].perslab);
}
}
/* power_largest表示最后一个slabclass, 大小设置成item_size_max,可以存储我们约定的最大的对象,一个slab只能存一个item,一个slab默认的大小是1M */
power_largest = i;
slabclass[power_largest].size = settings.item_size_max;
slabclass[power_largest].perslab = 1;
if (settings.verbose > 1) {
fprintf(stderr, "slab class %3d: chunk size %9u perslab %7u\n",
i, slabclass[i].size, slabclass[i].perslab);
}
/* for the test suite: faking of how much we've already malloc'd */
{
char *t_initial_malloc = getenv("T_MEMD_INITIAL_MALLOC");
if (t_initial_malloc) {
mem_malloced = (size_t)atol(t_initial_malloc);
}
}
/* 如果需要预分配内存,则会执行这段逻辑,进行预分配 */
if (prealloc) {
/* 默认每个slabclass分配一个slab */
slabs_preallocate(power_largest);
}
}
从slabclass的初始化代码我们可知,每一个slabclass的可以存储的item大小是固定,不同的slabclass的item大小的增长是由growth_rate来控制(默认是1.25)。growth_rate的大小会影响到内存的使用率,这个需要根据应用来进行调优。
在下面介绍了slabclass的初始化中,有不少系统的设置,其定义具体如下:
static void settings_init(void) {
settings.use_cas = true;
settings.access = 0700;
settings.port = 11211;
settings.udpport = 11211;
/* By default this string should be NULL for getaddrinfo() */
settings.inter = NULL;
settings.maxbytes = 64 * 1024 * 1024; /* default is 64MB */
settings.maxconns = 1024; /* to limit connections-related memory to about 5MB */
settings.verbose = 0;
settings.oldest_live = 0;
settings.oldest_cas = 0; /* supplements accuracy of oldest_live */
settings.evict_to_free = 1; /* push old items out of cache when memory runs out */
settings.socketpath = NULL; /* by default, not using a unix socket */
settings.factor = 1.25;
settings.chunk_size = 48; /* 初始化的chunck_size */
settings.num_threads = 4; /* N workers */
settings.num_threads_per_udp = 0;
settings.prefix_delimiter = ':';
settings.detail_enabled = 0;
settings.reqs_per_event = 20;
settings.backlog = 1024;
settings.binding_protocol = negotiating_prot;
settings.item_size_max = 1024 * 1024; /* 1M大小限制,它限制了Memcached缓存对象的最大值. */
settings.maxconns_fast = false;
settings.lru_crawler = false;
settings.lru_crawler_sleep = 100;
settings.lru_crawler_tocrawl = 0;
settings.lru_maintainer_thread = false;
settings.hot_lru_pct = 32;
settings.warm_lru_pct = 32;
settings.expirezero_does_not_evict = false;
settings.hashpower_init = 0;
settings.slab_reassign = false;
settings.slab_automove = 0;
settings.shutdown_command = false;
settings.tail_repair_time = TAIL_REPAIR_TIME_DEFAULT;
settings.flush_enabled = true;
settings.crawls_persleep = 1000;
}
上面介绍了SlabClass的初始化,这里介绍一下每一个slab是如何进行分配的。
static int do_slabs_newslab(const unsigned int id) {
slabclass_t *p = &slabclass[id];
/*如果Memcached要进行重分配的话,则默认使用Slab的最大值*/
int len = settings.slab_reassign ? settings.item_size_max
: p->size * p->perslab;
char *ptr;
if ((mem_limit && mem_malloced + len > mem_limit && p->slabs > 0)) {
mem_limit_reached = true;
MEMCACHED_SLABS_SLABCLASS_ALLOCATE_FAILED(id);
return 0;
}
/* grow_slab_list方法中实现了slabclass中slab size的重分配,如果slabs<list_size,说明还有空余的slab,否则进行2倍的扩张(默认是16个) */
if ((grow_slab_list(id) == 0) ||
((ptr = memory_allocate((size_t)len)) == 0)) {
MEMCACHED_SLABS_SLABCLASS_ALLOCATE_FAILED(id);
return 0;
}
memset(ptr, 0, (size_t)len);
/* 获取一个空闲的Slab,用指针ptr表示 */
split_slab_page_into_freelist(ptr, id);
p->slab_list[p->slabs++] = ptr;
mem_malloced += len;
MEMCACHED_SLABS_SLABCLASS_ALLOCATE(id);
return 1;
}
SlabClass在进行Slab的申请的过程中,会先判断Slab是否已经达到上限(slabs>=list_size),然后来申请一个Slab的内存。在申请完Slab的内容后,Memcached会对该Slab进行划分,划分成一个个空闲的Chunk。下面我们具体看看空闲slab是如何获取的。
static void split_slab_page_into_freelist(char *ptr, const unsigned int id) {
slabclass_t *p = &slabclass[id];
int x;
for (x = 0; x < p->perslab; x++) {
do_slabs_free(ptr, 0, id);
ptr += p->size;
}
}
/*单个item的初始化*/
static void do_slabs_free(void *ptr, const size_t size, unsigned int id) {
slabclass_t *p;
item *it;
assert(id >= POWER_SMALLEST && id <= power_largest);
if (id < POWER_SMALLEST || id > power_largest)
return;
MEMCACHED_SLABS_FREE(size, id, ptr);
p = &slabclass[id];
/*把新分配处理的Item添加到空闲列表Slots中去*/
it = (item *)ptr;
it->it_flags |= ITEM_SLABBED;
it->slabs_clsid = 0;
it->prev = 0;
it->next = p->slots;
if (it->next) it->next->prev = it;
p->slots = it;
p->sl_curr++;
p->requested -= size;
return;
}
所以memcached每申请一个新的slab,都会把该slab进行item化的划分,并使用链表来记录。到此为止,所有的Memcached的内存都已经初始化了,结果是建立了所有slabclass的数组和每个slabclass中的数量(默认是16个),每个slabclass中的大小固定,每个slab中的数量固定,并初始化了一个Slab中的所有空闲Item。而所有的item在一个slot中是采用链表来记录的,数据结构为slots,这个是垮slab,所有的slab中的item都可以找到,并且是顺序存储的。
介绍了上面的slab的分配,下面重点介绍Item是如何获取内存的,可以该方法是Memcached内存分配的最核心入口,代码如下:
#define HOT_LRU 0
#define WARM_LRU 64
#define COLD_LRU 128
#define NOEXP_LRU 192
item *do_item_alloc(char *key, const size_t nkey, const int flags,
const rel_time_t exptime, const int nbytes,
const uint32_t cur_hv) {
int i;
uint8_t nsuffix;
item *it = NULL;
char suffix[40];
unsigned int total_chunks;
size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix, &nsuffix); //计算item的大小
if (settings.use_cas) {
ntotal += sizeof(uint64_t);
}
//根据存储的总大小判断存储的slabclass,slabclass从1开始
unsigned int id = slabs_clsid(ntotal);
if (id == 0)
return 0;
/* If no memory is available, attempt a direct LRU juggle/eviction */
/* This is a race in order to simplify lru_pull_tail; in cases where
* locked items are on the tail, you want them to fall out and cause
* occasional OOM's, rather than internally work around them.
* This also gives one fewer code path for slab alloc/free
*/
for (i = 0; i < 5; i++) {
/* 首先进行内存的回收,默认lru_maintainer_thread为false */
if (!settings.lru_maintainer_thread) {
lru_pull_tail(id, COLD_LRU, 0, false, cur_hv);
}
/*进行了Item内存的分配*/
it = slabs_alloc(ntotal, id, &total_chunks);
if (settings.expirezero_does_not_evict)
total_chunks -= noexp_lru_size(id);
/*根据Item是否*/
if (it == NULL) {
if (settings.lru_maintainer_thread) {
lru_pull_tail(id, HOT_LRU, total_chunks, false, cur_hv);
lru_pull_tail(id, WARM_LRU, total_chunks, false, cur_hv);
lru_pull_tail(id, COLD_LRU, total_chunks, true, cur_hv);
} else {
lru_pull_tail(id, COLD_LRU, 0, true, cur_hv);
}
} else {
break;
}
}
if (i > 0) {
pthread_mutex_lock(&lru_locks[id]);
itemstats[id].direct_reclaims += i;
pthread_mutex_unlock(&lru_locks[id]);
}
/*没有足够的内存来保存*/
if (it == NULL) {
pthread_mutex_lock(&lru_locks[id]);
itemstats[id].outofmemory++;
pthread_mutex_unlock(&lru_locks[id]);
return NULL;
}
assert(it->slabs_clsid == 0);
//assert(it != heads[id]);
/* Refcount is seeded to 1 by slabs_alloc() */
it->next = it->prev = it->h_next = 0;
/* Items are initially loaded into the HOT_LRU. This is '0' but I want at
* least a note here. Compiler (hopefully?) optimizes this out.
*/
if (settings.lru_maintainer_thread) {
if (exptime == 0 && settings.expirezero_does_not_evict) {
id |= NOEXP_LRU;
} else {
id |= HOT_LRU;
}
} else {
/* There is only COLD in compat-mode */
id |= COLD_LRU;
}
it->slabs_clsid = id;
DEBUG_REFCNT(it, '*');
it->it_flags = settings.use_cas ? ITEM_CAS : 0;
it->nkey = nkey;
it->nbytes = nbytes;
memcpy(ITEM_key(it), key, nkey);
it->exptime = exptime;
memcpy(ITEM_suffix(it), suffix, (size_t)nsuffix);
it->nsuffix = nsuffix;
return it;
}
从这段代码中,item内存的分配是先进行LRU的判断,释放LRU中过期的Item,然后通过slab_alloc来进行分配,如果还没有足够的内存,继续进行LRU内存的释放(这次可以通过抛弃还未过期的最近最少使用的对象)。我们先看看lru_pull_tail的实现:
/* Returns number of items remove, expired, or evicted.
* Callable from worker threads or the LRU maintainer thread */
static int lru_pull_tail(const int orig_id, const int cur_lru,
const unsigned int total_chunks, const bool do_evict, const uint32_t cur_hv) {
item *it = NULL;
int id = orig_id;
int removed = 0;
if (id == 0)
return 0;
int tries = 5;
item *search;
item *next_it;
void *hold_lock = NULL;
unsigned int move_to_lru = 0;
uint64_t limit;
id |= cur_lru;
pthread_mutex_lock(&lru_locks[id]);
search = tails[id];
/* We walk up *only* for locked items, and if bottom is expired. */
for (; tries > 0 && search != NULL; tries--, search=next_it) {
/* we might relink search mid-loop, so search->prev isn't reliable */
next_it = search->prev;
if (search->nbytes == 0 && search->nkey == 0 && search->it_flags == 1) {
/* We are a crawler, ignore it. */
tries++;
continue;
}
uint32_t hv = hash(ITEM_key(search), search->nkey);
/* 尝试lock search item,如果失败,则跳过
* Attempt to hash item lock the "search" item. If locked, no
* other callers can incr the refcount. Also skip ourselves. */
if (hv == cur_hv || (hold_lock = item_trylock(hv)) == NULL)
continue;
/* Now see if the item is refcount locked */
if (refcount_incr(&search->refcount) != 2) {
/* Note pathological case with ref'ed items in tail.
* Can still unlink the item, but it won't be reusable yet */
itemstats[id].lrutail_reflocked++;
/* In case of refcount leaks, enable for quick workaround. */
/* WARNING: This can cause terrible corruption */
if (settings.tail_repair_time &&
search->time + settings.tail_repair_time < current_time) {
itemstats[id].tailrepairs++;
search->refcount = 1;
/* refcount为1的会进行回收释放处理*/
do_item_unlink_nolock(search, hv);
item_trylock_unlock(hold_lock);
continue;
}
}
/* Expired or flushed */
if ((search->exptime != 0 && search->exptime < current_time)
|| is_flushed(search)) {
itemstats[id].reclaimed++;
if ((search->it_flags & ITEM_FETCHED) == 0) {
itemstats[id].expired_unfetched++;
}
/* refcnt 2 -> 1 */
do_item_unlink_nolock(search, hv);
/* refcnt 1 -> 0 -> item_free */
do_item_remove(search);
item_trylock_unlock(hold_lock);
removed++;
/* If all we're finding are expired, can keep going */
continue;
}
/* If we're HOT_LRU or WARM_LRU and over size limit, send to COLD_LRU.
* If we're COLD_LRU, send to WARM_LRU unless we need to evict
*/
switch (cur_lru) {
case HOT_LRU:
limit = total_chunks * settings.hot_lru_pct / 100;
case WARM_LRU:
limit = total_chunks * settings.warm_lru_pct / 100;
if (sizes[id] > limit) {
itemstats[id].moves_to_cold++;
move_to_lru = COLD_LRU;
do_item_unlink_q(search);
it = search;
removed++;
break;
} else if ((search->it_flags & ITEM_ACTIVE) != 0) {
/* Only allow ACTIVE relinking if we're not too large. */
itemstats[id].moves_within_lru++;
search->it_flags &= ~ITEM_ACTIVE;
do_item_update_nolock(search);
do_item_remove(search);
item_trylock_unlock(hold_lock);
} else {
/* Don't want to move to COLD, not active, bail out */
it = search;
}
break;
case COLD_LRU:
it = search; /* No matter what, we're stopping */
if (do_evict) {
if (settings.evict_to_free == 0) {
/* Don't think we need a counter for this. It'll OOM. */
break;
}
itemstats[id].evicted++;
itemstats[id].evicted_time = current_time - search->time;
if (search->exptime != 0)
itemstats[id].evicted_nonzero++;
if ((search->it_flags & ITEM_FETCHED) == 0) {
itemstats[id].evicted_unfetched++;
}
do_item_unlink_nolock(search, hv);
removed++;
} else if ((search->it_flags & ITEM_ACTIVE) != 0
&& settings.lru_maintainer_thread) {
itemstats[id].moves_to_warm++;
search->it_flags &= ~ITEM_ACTIVE;
move_to_lru = WARM_LRU;
do_item_unlink_q(search);
removed++;
}
break;
}
if (it != NULL)
break;
}
pthread_mutex_unlock(&lru_locks[id]);
if (it != NULL) {
if (move_to_lru) {
it->slabs_clsid = ITEM_clsid(it);
it->slabs_clsid |= move_to_lru;
item_link_q(it);
}
do_item_remove(it);
item_trylock_unlock(hold_lock);
}
return removed;
}