一、问题现象
- 压测时,多个模块出现报错”unable to create thread: Resource temporarily unavailable“。
- 无论服务进程或supervisor,都出现此报错,多次重试拉起失败后,服务退出,然后容器退出。
二、原因总结
总的来说,是进程resource limit限制的配置作用范围与内核调度时对用户的限制不统一引起的。即是ulimit限制配置是在容器中读取,对进程生效,而内核调度时,对部分资源(这里是线程数量)的判断依据,不区分进程,是整机单个用户的全部进程资源数量的总和。
用户线程数量是由内核判定的,各容器虽然运行环境隔离,但对于内核来说,只是多个进程。同一个用户id运行的进程,即使是不同容器,内核可见的也是累计的线程数量。
另一方面,用户limit实际的生效配置,却是用户态生效。即limit相关配置是各个容器各自读取。
centos的limit默认配置中,对非root用户进程数量软限制为4096:
root@cvm-172_16_30_8:~ # cat /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.
* soft nproc 4096
root soft nproc unlimited
在进行系统调用增加线程时,内核是以这两个值进行判断。所以就出现了同一个用户id,在某个容器内线程数量并不多,却无法开线程的现象。即是此时改进程本身线程限制为4096,而该id的用户,对于内核来说,机器上总的线程数已经超过4096。
具体机制见下文详解。
三、详解
本部分主要说明了三个方面:
- ulimit配置何时生效的。
- 内核如何对limit合法性进行判定。
- docker拉起容器时的ulimit配置继承关系,即如何解决该问题。
ulimit配置生效方式
1. 原容器内进程启动方式:
容器启动时执行entrypoint.sh,该脚本创建指定id的用户,修改目录权限后,通过su切换用户并运行supervisor,进一步拉起服务进程:
➜ data-proxy git:(master) ✗ cat entrypoint.sh
#!/bin/sh
username="yibot"
#create user if not exists
egrep "^${YIBOT_UID}" /etc/passwd >& /dev/null
if [ $? -ne 0 ]
then
useradd -u "${YIBOT_UID}" "${username}"
fi
mkdir -p /data/yibot/"${MODULE}"/log/ && \
mkdir -p /data/supervisor/ && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /entrypoint && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /data && \
su yibot -c "supervisord -n"
2. pam简介
pam(Pluggable Authentication Modules)中文翻译是"可插拔的身份认证模块组"。这些模块本身不属于内核,内核自身没有身份验证的行为。是为了让需要身份验证的应用与身份验证机制本身进行解耦,衍生出来的一套库。现在的su、login等应用都会采用该库。
pam介绍可参考:https://www.linuxjournal.com/article/5940
pam man page: http://man7.org/linux/man-pages/man8/pam.8.html
pam源码:https://github.com/linux-pam/linux-pam/tree/master/libpam
3. pam与ulimit配置读取
查看pam源码发现,在limit处理中 https://github.com/linux-pam/linux-pam/blob/master/modules/pam_limits/pam_limits.c 每一次该pam会话调用,都是parse_config_file-> setup_limits
retval = parse_config_file(pamh, pwd->pw_name, pwd->pw_uid, pwd->pw_gid, ctrl, pl);
retval = setup_limits(pamh, pwd->pw_name, pwd->pw_uid, ctrl, pl);
parse_config_file是从给定配置文件中,读取limit配置存放在pl
指向的pam_limit_s
结构体中,该结构体定义如下:
/* internal data */
struct pam_limit_s {
int login_limit; /* the max logins limit */
int login_limit_def; /* which entry set the login limit */
int flag_numsyslogins; /* whether to limit logins only for a
specific user or to count all logins */
int priority; /* the priority to run user process with */
struct user_limits_struct limits[RLIM_NLIMITS];
const char *conf_file;
int utmp_after_pam_call;
char login_group[LINE_LENGTH];
};
各项limit的值都存在limits
数组中,user_limits_struct
结构体中包含软限制和硬限制
struct user_limits_struct {
int supported;
int src_soft;
int src_hard;
struct rlimit limit;
};
其中limit
结构体中是在init_limits
中通过系统调用getrlimit
获取的当前进程的限制值。
解析完配置文件后,在setup_limits
中,通过系统调用setrlimit
修改当前进程pcb中的rlim
相关值
for (i=0, status=LIMITED_OK; i<RLIM_NLIMITS; i++) {
int res;
if (!pl->limits[i].supported) {
/* skip it if its not known to the system */
continue;
}
if (pl->limits[i].src_soft == LIMITS_DEF_NONE &&
pl->limits[i].src_hard == LIMITS_DEF_NONE) {
/* skip it if its not initialized */
continue;
}
if (pl->limits[i].limit.rlim_cur > pl->limits[i].limit.rlim_max)
pl->limits[i].limit.rlim_cur = pl->limits[i].limit.rlim_max;
res = setrlimit(i, &pl->limits[i].limit);
if (res != 0)
pam_syslog(pamh, LOG_ERR, "Could not set limit for '%s': %m",
rlimit2str(i));
status |= res;
}
以上就是pam库对limit配置的读取和修改过程。系统调用getrlimit
和setrlimit
具体行为见后文
4. su与pam:
su源码:https://github.com/shadow-maint/shadow/blob/master/src/su.c
在最新的su实现中,可以看到是有pam的条件编译:
#ifdef USE_PAM
ret = pam_start ("su", name, &conv, &pamh);
if (PAM_SUCCESS != ret) {
SYSLOG ((LOG_ERR, "pam_start: error %d", ret);
fprintf (stderr,
_("%s: pam_start: error %d\n"),
Prog, ret));
exit (1);
}
在最新的centos中,ldd查看su,可以确定是打开了该条件
root@cvm-172_16_30_8:~ # ldd /usr/bin/su | grep pam
libpam.so.0 => /lib64/libpam.so.0 (0x00007f4d429a6000)
libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007f4d427a2000)
在su的man里也有说明:
This version of su uses PAM for authentication, account and session management. Some configuration options found in other su implementations such as e.g. support of a wheel group have to be configured via PAM.
在pam_start ("su", name, &conv, &pamh)
中pam
会在/etc/pam.d/
下查找名为su
的文件进行配置加载,该文件中指定了pam
认证中需要用到的库。实现可插拔
的特性
最终在pam打开会话pam_open_session
会调用pam_limits
中的pam_sm_open_session
实现limits
相关配置文件的解析和设置。
在su
切换用户后,默认打开shell
,会继承更新后的limits
配置,具体继承机制见下文。
/*
* Use the shell and create an argv
* with the rest of the command line included.
*/
argv[-1] = cp;
execve_shell (shellstr, &argv[-1], environ);
之后再打开的进程,都会进行limits
继承
附pam编程例子:https://www.freebsd.org/doc/en_US.ISO8859-1/articles/pam/pam-sample-appl.html
以上解释了在原entrypoint.sh
的做法中,su
调用pam
会读取当前容器中的limit
配置(/etc/security/limits.d/)
。在非root时,进程limit
中的nproc
会被设为4096
的限制**
5.系统调用setrlimit行为
kernel源码:https://github.com/torvalds/linux
setrlimit
系统调用如下
SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
struct rlimit new_rlim;
if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
return -EFAULT;
return do_prlimit(current, resource, &new_rlim, NULL);
}
current
返回的是当前进程的pcb,即task_struct
结构体的指针,在do_prlimit
中进一步调用security_task_setrlimit
修改当前pcb中的limit
限制值
int security_task_setrlimit(struct task_struct *p, unsigned int resource,
struct rlimit *new_rlim)
{
return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
}
下面这个操作看不太懂,大概是在链表里进行搜索,然后应用FUNC。还请大佬指点迷津。
#define call_int_hook(FUNC, IRC, ...) ({ \
int RC = IRC; \
do { \
struct security_hook_list *P; \
\
hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
RC = P->hook.FUNC(__VA_ARGS__); \
if (RC != 0) \
break; \
} \
} while (0); \
RC; \
})
补充一下pcb task_struct
部分定义,完整定义参考:https://github.com/torvalds/linux/blob/master/include/linux/sched.h
struct task_struct {
...
/* Real parent process: */
struct task_struct __rcu *real_parent;
/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;
/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;
...
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
/* Signal handlers: */
struct signal_struct *signal;
...
}
注:kernel通过list_head
与list_entry
宏,实现了通用的双链表结构
在struct signal_struct
中定义了rlim
:
struct signal_struct {
...
/*
* We don't bother to synchronize most readers of this at all,
* because there is no reader checking a limit that actually needs
* to get both rlim_cur and rlim_max atomically, and either one
* alone is a single word that can safely be read normally.
* getrlimit/setrlimit use task_lock(current->group_leader) to
* protect this instead of the siglock, because they really
* have no need to disable irqs.
*/
struct rlimit rlim[RLIM_NLIMITS];
...
}
rlim
数组中即该进程的resource limit
相关值,setrlimit
最终修改的也即该数组中的值。可见是每个进程单独持有的一组值。
内核判定nproc limit(进程数限制)合法性机制
1. 用户进程总数
在上文给出的task_struct
定义中,有一个结构体struct cred
,定义如下
struct cred {
...
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
...
struct user_struct *user; /* real user ID subscription */
...
}
其中struct user_struct
定义:
struct user_struct {
refcount_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
...
}
结合下文说明pcb中的struct user_struct *user
是全局唯一,则processes
就是系统当前用户在运行的所有进程数(linux中processes与threads几乎相同,内核中没有thread概念)
http://www.mulix.org/lectures/kernel_workshop_mar_2004/things.pdf
In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space.
2. struct user_struct *user全局唯一
在su
的实现中,调用change_uid
,最终通过系统调用setuid
切换uid
SYSCALL_DEFINE1(setuid, uid_t, uid)
{
return __sys_setuid(uid);
}
__sys_setuid
调用set_user
实现用户真正切换,参数new
为当前pcb中的cred
结构体副本
long __sys_setuid(uid_t uid)
{
...
if (ns_capable_setid(old->user_ns, CAP_SETUID)) {
new->suid = new->uid = kuid;
if (!uid_eq(kuid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
} else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
goto error;
}
}
set_user
完整实现:
/*
* change the user struct in a credentials set to match the new UID
*/
static int set_user(struct cred *new)
{
struct user_struct *new_user;
new_user = alloc_uid(new->uid);
if (!new_user)
return -EAGAIN;
/*
* We don't fail in case of NPROC limit excess here because too many
* poorly written programs don't check set*uid() return code, assuming
* it never fails if called by root. We may still enforce NPROC limit
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
free_uid(new->user);
new->user = new_user;
return 0;
}
再来看看alloc_uid
:
struct user_struct *alloc_uid(kuid_t uid)
{
struct hlist_head *hashent = uidhashentry(uid);
struct user_struct *up, *new;
spin_lock_irq(&uidhash_lock);
up = uid_hash_find(uid, hashent);
spin_unlock_irq(&uidhash_lock);
...
}
在kernel/user.c
中,uidhashentry
定义如下
#define uidhashentry(uid) (uidhash_table + __uidhashfn((__kuid_val(uid))))
static struct kmem_cache *uid_cachep;
struct hlist_head uidhash_table[UIDHASH_SZ];
加上uid_hash_find
的实现:
static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
{
struct user_struct *user;
hlist_for_each_entry(user, hashent, uidhash_node) {
if (uid_eq(user->uid, uid)) {
refcount_inc(&user->__count);
return user;
}
}
return NULL;
}
如此就可以看出,实际上对于一个uid,用户信息结构体user_struct
全局唯一。通过uid的hashentry,在链表中查找该结构体,再将指针返回给pcb
3. 新增进程合法性判断
实际上在上文set_user
中,已经有如下判断:
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
rlimit(RLIMIT_NPROC)
是读取当前进程pcb内的nproc
限制,再与新用户总线程数作比较。
另外,在exec
的实现__do_execve_file
中,也有类似判断:https://github.com/torvalds/linux/blob/master/fs/exec.c
if ((current->flags & PF_NPROC_EXCEEDED) &&
atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
retval = -EAGAIN;
goto out_ret;
}
其他创建process时也类似
另外,fork
最终通过copy_creds
实现了atomic_inc(&p->cred->user->processes);
进程数+1
exec
最终通过commit_creds
实现atomic_inc(&p->cred->user->processes);
进程数+1
四、docker容器的ulimit继承关系
1.子进程对父进程ulimt的继承
fork进程时,在fork的实现kernel/fork.c
中实现了copy pcb中的内容,其中的copy_signal
:
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
...
task_lock(current->group_leader);
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
task_unlock(current->group_leader);
...
}
可见完整地复制了pcb中的rlim
,如果不使用setrlim
进行更改的话,子进程与父进程一致。
2.docker容器启动方式
根据官方文档的说明,启动容器时1号进程的rlim
继承于docker daemon:
https://docs.docker.com/engine/reference/commandline/run/
Note: If you do not provide a hard limit, the soft limit will be used for both values. If no ulimits are set, they will be inherited from the default ulimits set on the daemon. as option is disabled now. In other words, the following script is not supported:...
由于docker daemon一般是以root运行,所以即使指定的非root用户运行容器,1号进程仍然是与root一致的rlim
。
此时只要不通过pam读取容器内的ulimit
配置(如在容器内运行su切换用户,或通过远程登录等),则子进程也都会一致继承root的rlim
。
总结来说,该问题的解决方法就是在容器拉起服务进程之前,不要在容器内运行su切换用户。可在容器启动前指定任意用户,不影响ulimit统一继承于docker daemon