深入linux内核架构--核心调度器

内存中保存了每个进程的唯一描述信息，并通过若干结构与其他进程连接起来，那么调度器的核心任务就是高效公平的执行各个进程中的内存代码段。那么一个好的调度器，需要满足哪些条件呢？

高效性：应该尽量减小调度器产生的额外开销。
公平性：应该尽量保证每个作业都能被执行，也应该保证每个cpu都有均衡的作业。
灵活性：可以动态的设置作业的优先级，良好的迁移能力，不同作业具有不同的权利等。
隔离性：不让作业之间彼此影响。
控制性：不能因为产生了过多的作业而让系统崩坏。
那么linux核心调度器是怎么处理好这几个方面的呢？？

核心调度器简介(以下主要结合CFS调度器做具体介绍)

核心调度器组成

linux核心调度器由以下几部分部分组成：主调度器(调度器主体)，周期性调度器(时钟周期调度)，runqueue(就绪队列，每cpu一个)，调度器类(不同的调度策略，如实时调度，完全公平调度，deadline调度)及调度实体(真正被调度的作业实体)。
调度器类有很多种，包括SCHED_RT(实时调度器)，SCHED_DEADLINE(deadline调度器)，但是我们主要还是以SCHED_NORMAL(CFS完全公平调度器)为样例来做具体介绍。
在linux CFS调度器中为了更好的调度group，所以抽象了一层叫做调度实体scehd_entity的结构，调度实体se既可以是task_group，也可以是task_struct；如果是task_group就会在各个CPU伴随着生成一个对应的cfs_rq用来管理所有该group下面的子se，并负责整个group的调度细节，包括时间片计算，作业的流控等。每个CPU都有一个rq的数据结构，其维护着不同的调度器，并管理着顶级cfs_rq以及rt_rq等其他调度器的队列(这些没细看，不瞎说)的数据结构，当周期性调度器触发cpu更新状态时，就会选择一个适合的调度器类来进行相关调度更新。
相关数据结构如下：

struct rq {
       ...
       struct cfs_rq cfs;
       task_struct* curr;
       ...
}
/* CFS-related fields in a runqueue */
struct cfs_rq {
    struct load_weight  load;  // 当前task_group下的所有子调度实体的权重之和
    unsigned long       runnable_weight;
    unsigned int        nr_running;  // 在队列中的调度实体数量
    unsigned int        h_nr_running;

    u64         exec_clock;
    u64         min_vruntime;
#ifndef CONFIG_64BIT
    u64         min_vruntime_copy;
#endif

    struct rb_root_cached   tasks_timeline; // 基于se->vruntime ordering的红黑树
    /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr;
    struct sched_entity *next;
    struct sched_entity *last;
    struct sched_entity *skip;
#ifdef  CONFIG_SCHED_DEBUG
    unsigned int        nr_spread_over;
#endif
       ...
    /*
     *   h_load = weight * f(tg)
     *
     * Where f(tg) is the recursive weight fraction assigned to
     * this group.
     */
    unsigned long       h_load;
    u64         last_h_load_update;
    struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */
#ifdef CONFIG_FAIR_GROUP_SCHED
    struct rq       *rq;    /* CPU runqueue to which this cfs_rq is attached */
    /*
     * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
     * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
     * (like users, containers etc.)
     *
     * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
     * This list is used during load balance.
     */
    int         on_list;
    struct list_head    leaf_cfs_rq_list;
    struct task_group   *tg;    /* group that "owns" this runqueue */
#ifdef CONFIG_CFS_BANDWIDTH
    int         runtime_enabled;
    int         expires_seq;
    u64         runtime_expires;
    s64         runtime_remaining;
    u64         throttled_clock;
    u64         throttled_clock_task;
    u64         throttled_clock_task_time;
    int         throttled;
    int         throttle_count;
    struct list_head    throttled_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

struct sched_entity {
    /* For load-balancing: */
    struct load_weight      load; // 当前调度实体的权重
    unsigned long           runnable_weight;
    struct rb_node          run_node;  // 红黑树节点
    struct list_head        group_node;
    unsigned int            on_rq; // 是否是在rq中(需要真正被调度的实体，目前来说就是task_struct)
    u64             exec_start;
    u64             sum_exec_runtime; // 当前运行物理时间
    u64             vruntime; // 虚拟时间用于调度策略
    u64             prev_sum_exec_runtime; // 被调度起来之前的运行物理时间
    u64             nr_migrations;
    struct sched_statistics     statistics;
#ifdef CONFIG_FAIR_GROUP_SCHED
    int             depth;
    struct sched_entity     *parent;  // 父调度实体(task_group)
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq           *cfs_rq;  // 调度实体所在的cfs_rq
    /* rq "owned" by this entity/group: */
    struct cfs_rq           *my_q;  // 是否是一个task_group
#endif
#ifdef CONFIG_SMP
    /*
     * Per entity load average tracking.
     *
     * Put into separate cache line so it does not
     * collide with read-mostly values above.
     */
    struct sched_avg        avg;
#endif
};

调度器类主要包括以下操作：

struct sched_class {
    const struct sched_class *next;
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); // 作业入队
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); //作业出队
    void (*yield_task)   (struct rq *rq); // 当前作业放弃CPU
    bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); //

    void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); // 检测当前作业是否需要被抢占

    /*
     * It is the responsibility of the pick_next_task() method that will
     * return the next task to call put_prev_task() on the @prev task or
     * something equivalent.
     *
     * May return RETRY_TASK when it finds a higher prio class has runnable
     * tasks.
     */
    struct task_struct * (*pick_next_task)(struct rq *rq,
                           struct task_struct *prev,
                           struct rq_flags *rf);
    void (*put_prev_task)(struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
    void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

    void (*task_woken)(struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task)(struct rq *rq);
    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork)(struct task_struct *p);
    void (*task_dead)(struct task_struct *p);

    /*
     * The switched_from() call is allowed to drop rq->lock, therefore we
     * cannot assume the switched_from/switched_to pair is serliazed by
     * rq->lock. They are however serialized by p->pi_lock.
     */
    void (*switched_from)(struct rq *this_rq, struct task_struct *task);
    void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                  int oldprio);

    unsigned int (*get_rr_interval)(struct rq *rq,
                    struct task_struct *task);

    void (*update_curr)(struct rq *rq);

#define TASK_SET_GROUP      0
#define TASK_MOVE_GROUP     1

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_change_group)(struct task_struct *p, int type);
#endif
};

CPU结构图：

CPU结构图

图中task_struct及task_group都属于sched_entity, 在一个cpu中task_group与cfs_rq一一对应，而一个task_group可能在每一个cpu中都有一个cfs_rq，所以一个group中的调度实体可能会跨多个CPU。而每个task_group都有一个cfs_bandwidth，其维护着流控相关的quota信息。每个rq除了有cfs外，还有 rt(rt_rq), dl(dl_rq)，这两个没有讨论到，所以就没有画出。
基本调度过程如下：

CPU主频触发hrtimer定时器，通过中断调用周期性调度器scheduler_tick函数，然后该函数会根据当前CPU rq中的当前task_struct的调度器类去更新调度相关的变量，诸如当前运行作业的物理运行时间，虚拟运行时间(CFS调度器)，统计CPU相关运行时间及负载等信息；以及流控相关工作：如当前调度实体所在的group用尽了quota，该group就会被限流，被置为block态。否则进入步骤2。
CFS调度器会更新调度实体物理时间以及虚拟运行时间等变量，更新完了这些相关的调度变量后，CFS调度器会根据当前调度实体的运行时间以及其分配的时间片大小，判断该调度实体是否需要被抢占，如果是，则将TIF_NEED_RESCHED置为true。
进程之间的切换主要是通过主调度器schedule函数来进行进程的context_switch，而调用schedule的入口有很多：1. hrtimer定时器定时调用；2. 当进程通过yield主动放弃CPU；3.系统调用返回 .etc。
主调度器调用schedule函数，如果TIF_NEED_RESCHED不为true，就一直自旋，直到TIF_NEED_RESCHED被置为true，就做以下几件事情，更新rq clock,获取下一个被调度的调度实体，cfs_rq中相关队列操作(将当前运行作业入队，选择下一个将要运行的作业出队)，关闭resched标记，切换上下文，并将curr指针指向新的正在被执行作业。
如果是smp系统, schedule_tick还会均衡一下各个cpu的作业。

基本调用图如下：

周期性调度器

CFS完全公平调度器

现代商用linux系统中主要用到的是这个调度器，这个也是cgroup对于CPU进行管理具有最佳实践的调度器，其相对于传统的O1调度器，完全公平调度器具有更好的优先级抢占优势。在传统的时间片轮转O1调度器中，一个作业由于IO进入等待队列，然后回到runqueue时，只能回到队尾，这样对于IO频繁的作业非常不公平。
相比于传统的时间片轮转算法，CFS调度器主要做了以下改动：

与作业优先级(权重)相关的不固定时间片。
新增虚拟运行时钟及红黑树队列来选择下一个运行的作业。
新增cfs_bandwidth来对作业进行流控。

具体调度细节如下：

主调度器

时间片计算

优先级及权重

时间片的选取是CFS调度器中特别重要的一个环节，这与CFS调度器的性能及公平性息息相关，在介绍CFS调度器中时间片的计算时，我们先来看看作业的优先级，每个task_struct都有priority，而且用户空间进程的这个值可以通过nice系统调用动态修改，用户空间的进程nice值处于[-20, 19]中，值越低，优先级越高。而内核使用的数值范围为[0,139]来表示内部优先级，nice值[-20, 19]映射到100->139，而0->99被视为实时优先级，而在task_struct中对应的有三个优先级：动态优先级(prio)、普通优先级(normal_ratio)、静态优先级(static_prio)，其中静态优先级是在程序启动的时候就设置好的，而动态优先级通过effective_prio计算，其实对于CFS调度器来说，动态优先级就是等于静态优先级。其他类型的动态优先级计算暂不讨论。
对于CFS调度器来说，他调度的是调度实体，而调度实体只有权重而没有优先级的概念，所以就有一个优先级与权重的映射计算关系

struct sched_entity {
    /* For load-balancing: */
    struct load_weight  load;
};
struct load_weight {
    unsigned long   weight;
    u32             inv_weight; // 2^32 / weight
};

内核不仅维护了自身权重，还有另外一个值inv_weight，用于计算被负荷权重除的结果。
对于nice值与权重的一般概念而言，进程每降低一个nice值，则多获得10%的CPU时间(因为权重是与时间片的计算正相关的)，因此其基本映射关系可以简单的由下面这个数组给出

static void set_load_weight(struct task_struct *p, bool update_load)
{
    int prio = p->static_prio - MAX_RT_PRIO;
    struct load_weight *load = &p->se.load;

    /*
     * SCHED_IDLE tasks get minimal weight:
     */
    if (task_has_idle_policy(p)) {
        load->weight = scale_load(WEIGHT_IDLEPRIO);
        load->inv_weight = WMULT_IDLEPRIO;
        p->se.runnable_weight = load->weight;
        return;
    }

    /*
     * SCHED_OTHER tasks have to update their load when changing their
     * weight
     */
    if (update_load && p->sched_class == &fair_sched_class) {
        reweight_task(p, prio);
    } else {
        load->weight = scale_load(sched_prio_to_weight[prio]);
        load->inv_weight = sched_prio_to_wmult[prio];
        p->se.runnable_weight = load->weight;
    }
}
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

根据sched_prio_to_weight可以计算，当一个CPU中有两个进程A和B，他们的优先级都为1，此时他们的CPU份额都在50%，当进程A从优先级1变为2，那么A的CPU占比就变为(820/(1024+820) = 0.45) ，而B进程占比变为0.55，他们的CPU占比相差为10%；

那么时间片是怎么根据权重来计算的呢？而Cgroup中的时间片又是怎样计算的呢？我们来看一下sched_slice的相关实现，这是计算一个调度实体在一次调度中占用的时间片大小的关键函数。而CFS调度器中sched_vslice用于计算虚拟调度运行时，方便优先级调度(高优先级的作业更容易被调度)。
在介绍时间片计算之前，首先需要讲到延迟跟踪，内核为了保证CFS调度器很好的公平性，即保证最顶级task_group(root cfs_rq 也就是cpu下的所有cfs相关作业)下的可运行的进程都应该在某个时间段内至少运行一次，防止饥饿，这个时间段被称作延迟周期sysctl_sched_latency；可通过/proc/sys/kernel/sched_latency_ns控制，其默认为20ms；内核为了保证相对高的性能，不能在短时间内频繁进行进程切换，提供了第二个参数sched_nr_latency，用来限制一个延迟周期内能处理最大的活动进程数，默认值为8，如果活动进程超出该上限，则延迟周期则会线性比例的扩展，所以延迟周期还通过sysctl_sched_min_granularity间接的控制线性比例的大小，而这个值可以通过/proc/sys/kernel/sched_min_granularity_ns来设置，默认值4ms；如果当前runqueue中的就绪作业数超过sched_nr_latency时，就根据让slice = sysctl_sched_min_granularity * cfs_rq->nr_running；而每个调度实体的时间片则为当前调度实体在整个cfs_rq中所占的比例乘以延迟周期，如下图：(cgroup相关的调度权重可以根据cpu.shared)

cgroup se slice

所以当一个task_group下面有过多的需要调度的实体时，其运行时间可能会远大于其share
具体代码如下：

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq); // 计算延迟周期
    for_each_sched_entity(se) { // 计算当前调度实体在整个cfs_rq(task_group)中所占的比例，从下往上依次除以其在当前group中所占的比重
        struct load_weight *load;
        struct load_weight lw;
        cfs_rq = cfs_rq_of(se);
        load = &cfs_rq->load;
        if (unlikely(!se->on_rq)) {
            lw = cfs_rq->load;
            update_load_add(&lw, se->load.weight);
            load = &lw;
        }
        slice = __calc_delta(slice, se->load.weight, load);
    }
    return slice;
}
/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
    if (unlikely(nr_running > sched_nr_latency))
        return nr_running * sysctl_sched_min_granularity;
    else
        return sysctl_sched_latency;
}
/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
    u64 fact = scale_load_down(weight); //  weight >> 10
    int shift = WMULT_SHIFT; // 32
    __update_inv_weight(lw); // lw->inv_weight = (1<<32) / (lw->weight >> 10)
    if (unlikely(fact >> 32)) {
        while (fact >> 32) {
            fact >>= 1;
            shift--;
        }
    }
    /* hint to use a 32x32->64 mul */
    fact = (u64)(u32)fact * lw->inv_weight; // fact = weight >> 10 *  (1<<32) / (lw->weight >> 10)
    while (fact >> 32) {
        fact >>= 1;
        shift--;
    }
    return mul_u64_u32_shr(delta_exec, fact, shift); // (delta_exec * ((weight >> 10) *  ((1<<32) / (lw->weight >> 10)))) >> 32
}

__calc_delta的计算有点绕，其实最终结果就是delta_exec * weight / lw.weight，(delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT 其中 lw->inv_weight = 2 >> WMULT_SHIFT / lw->weight，这么一绕主要是为了保证__calc_delta计算出来的为正数。从上面的代码来看时间片的计算主要包括两方面，延迟周期和调度实体在各个层级group中所占的比例。
而时间片主要用于是否需要抢占判断：

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    struct sched_entity *se;
    s64 delta;
    ideal_runtime = sched_slice(cfs_rq, curr); // 计算物理时间片
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; //就算当前运行物理时间
    if (delta_exec > ideal_runtime) { // 如果实际运行时间大于时间片则需要被抢占
        resched_curr(rq_of(cfs_rq));
        /*
         * The current task ran long enough, ensure it doesn't get
         * re-elected due to buddy favours.
         */
        clear_buddies(cfs_rq, curr);
        return;
    }
    /*
     * Ensure that a task that missed wakeup preemption by a
     * narrow margin doesn't have to wait for a full slice.
     * This also mitigates buddy induced latencies under load.
     */
    if (delta_exec < sysctl_sched_min_granularity) // 小于最小粒度
        return;
    se = __pick_first_entity(cfs_rq);
    delta = curr->vruntime - se->vruntime;
    if (delta < 0)
        return;
    if (delta > ideal_runtime)
        resched_curr(rq_of(cfs_rq));
}

CFS 虚拟运行时间

前面有提到CFS调度器是通过维护一个红黑树来保证各个作业的公平性，每次都把红黑树中最左叶子节点作为下一个被运行的调度实体，而红黑树中的排序键值则为调度实体的虚拟运行时间vruntime。那么在CFS调度器中，如何来更新这个虚拟时间来保证作业的优先级公平性，在保证作业从io返回后能依旧保证其优先级呢？而且是如何保证不同优先级的作业之间的相对公平性呢？
前面有提到update_curr这个函数，在每次时钟周期的时候都是通过这个函数来更新一些统计信息，而这个函数最关键的就是更新作业的虚拟运行时间，其不仅在时钟周期会被调用，其在rq中的出入队都会被调用到，
具体代码如下：

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr; // 当前调度实体
    u64 now = rq_clock_task(rq_of(cfs_rq)); // 当前时钟
    u64 delta_exec;
    if (unlikely(!curr))
        return;
    delta_exec = now - curr->exec_start; // 物理时钟偏移
    if (unlikely((s64)delta_exec <= 0))
        return;
    curr->exec_start = now;
    schedstat_set(curr->statistics.exec_max,
              max(delta_exec, curr->statistics.exec_max)); // CPU统计 ps 中可以看到的，暂且不讨论
    curr->sum_exec_runtime += delta_exec; // 计算物理运行时间，用于判断是否需要抢占
    schedstat_add(cfs_rq->exec_clock, delta_exec); // CPU统计
    curr->vruntime += calc_delta_fair(delta_exec, curr); // 权重相关的虚拟时钟偏移 delta * NICE_0_LOAD / se ->load
    update_min_vruntime(cfs_rq); // 更新当前队列的最新虚拟运行时间
    if (entity_is_task(curr)) { // CPU统计
        struct task_struct *curtask = task_of(curr);
        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cgroup_account_cputime(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
    account_cfs_rq_runtime(cfs_rq, delta_exec); // CFS_BANDWIDTH相关
}
/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
    return delta;
}
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
    u64 vruntime = cfs_rq->min_vruntime;
    if (curr) {
        if (curr->on_rq)
            vruntime = curr->vruntime;
        else
            curr = NULL;
    }
    if (leftmost) { /* non-empty tree */
        struct sched_entity *se;
        se = rb_entry(leftmost, struct sched_entity, run_node);
        if (!curr)
            vruntime = se->vruntime;
        else
            vruntime = min_vruntime(vruntime, se->vruntime);
    }
    /* ensure we never gain time by being placed backwards. */
    cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
#ifndef CONFIG_64BIT
    smp_wmb();
    cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}

在上述代码中最关键的两个步骤为calc_delta_fair和update_min_vruntime，
其中calc_delta_fair的值是与调度实体权重息息相关的，当作业权重越大，其vruntime就长得越慢，而红黑树是按照vruntime来排序的，这样当作业从IO调用返回时优先级越低，权重越大的作业就更容易出现在红黑树的前面。但这里还有另外一个问题，如果一个作业从IO返回后如果直接放回队列的话，其vrunrime会远小于cfs_rq->curr，这样他就会在短时间内一直处于红黑树的队首，所以cfs_rq会维护一个单调递增的min_vruntime以确保新入队的作业的vruntime不会小于当前运行进程的vruntime，而update_min_vruntime就是确保cfs_rq中的min_vruntime单调递增。

CFS调度器队列操作

前面有介绍调度器类的相关函数，其中作业的出入队也是调度器类的核心工作，enqueue_entity及dequeue_entity主要用于处理作业作业从其他状态变为就绪态或者从就绪态变为其他状态。而出入队最重要需要关心的就是调度实体的vruntime及整个runqueue的整体load权重(方便计算各个on_rq的调度实体的时间片)。

入队

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
    bool curr = cfs_rq->curr == se;
    /*
     * If we're the current task, we must renormalise before calling
     * update_curr().
     */
    if (renorm && curr)
        se->vruntime += cfs_rq->min_vruntime;
    update_curr(cfs_rq);
    /*
     * Otherwise, renormalise after, such that we're placed at the current
     * moment in time, instead of some random moment in the past. Being
     * placed in the past could significantly boost this task to the
     * fairness detriment of existing tasks.
     */
    if (renorm && !curr)
        se->vruntime += cfs_rq->min_vruntime;
    /*
     * When enqueuing a sched_entity, we must:
     *   - Update loads to have both entity and cfs_rq synced with now.
     *   - Add its load to cfs_rq->runnable_avg
     *   - For group_entity, update its weight to reflect the new share of
     *     its group cfs_rq
     *   - Add its new weight to cfs_rq->load.weight
     */
    update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH); // 计算队列的 平均权重
    update_cfs_group(se);
    enqueue_runnable_load_avg(cfs_rq, se);
    account_entity_enqueue(cfs_rq, se); // 累加cfs_rq队列的整体权重load
    if (flags & ENQUEUE_WAKEUP)
        place_entity(cfs_rq, se, 0); // 队列从其他状态 -> runable 需要更新其vruntime
    check_schedstat_required();
    update_stats_enqueue(cfs_rq, se, flags);
    check_spread(cfs_rq, se);
    if (!curr)
        __enqueue_entity(cfs_rq, se); // 红黑树操作，入队
    se->on_rq = 1;
    if (cfs_rq->nr_running == 1) {
        list_add_leaf_cfs_rq(cfs_rq);
        check_enqueue_throttle(cfs_rq);
    }
}

出队

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    /*
     * When dequeuing a sched_entity, we must:
     *   - Update loads to have both entity and cfs_rq synced with now.
     *   - Subtract its load from the cfs_rq->runnable_avg.
     *   - Subtract its previous weight from cfs_rq->load.weight.
     *   - For group entity, update its weight to reflect the new share
     *     of its group cfs_rq.
     */
    update_load_avg(cfs_rq, se, UPDATE_TG); // 
    dequeue_runnable_load_avg(cfs_rq, se);
    update_stats_dequeue(cfs_rq, se, flags);
    clear_buddies(cfs_rq, se);
    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    se->on_rq = 0;
    account_entity_dequeue(cfs_rq, se);
    /*
     * Normalize after update_curr(); which will also have moved
     * min_vruntime if @se is the one holding it back. But before doing
     * update_min_vruntime() again, which will discount @se's position and
     * can move min_vruntime forward still more.
     */
    if (!(flags & DEQUEUE_SLEEP))
        se->vruntime -= cfs_rq->min_vruntime;

    /* return excess runtime on last dequeue */
    return_cfs_rq_runtime(cfs_rq);
    update_cfs_group(se);
    /*
     * Now advance min_vruntime if @se was the entity holding it back,
     * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
     * put back on, and if we advance min_vruntime, we'll be placed back
     * further than we started -- ie. we'll be penalized.
     */
    if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
        update_min_vruntime(cfs_rq);
}

CFS BANDWIDTH CONTROL

稍微玩过docker的，应该都知道docker是可以通过设置cpu.cfs_period_us及cpu.cfs_quota_us来限制一个cgroup下的进程占用CPU的上限来限流，那么CPU是怎么做流量控制的呢？
在介绍CPU流控之前，我们先看一下流控的相关数据结构，

/* Task group related information */
struct task_group {
    struct cgroup_subsys_state css;
#ifdef CONFIG_FAIR_GROUP_SCHED
    /* schedulable entities of this group on each CPU */
    struct sched_entity **se;
    /* runqueue "owned" by this group on each CPU */
    struct cfs_rq       **cfs_rq;
    unsigned long       shares;
#ifdef  CONFIG_SMP
    /*
     * load_avg can be heavily contended at clock tick time, so put
     * it in its own cacheline separated from the fields above which
     * will also be accessed at each tick.
     */
    atomic_long_t       load_avg ____cacheline_aligned;
#endif
#endif
      ...
    struct cfs_bandwidth    cfs_bandwidth; // 所有CPU共享的一个task_group
};
struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
    raw_spinlock_t      lock;
    ktime_t     period; // 刷新周期
    u64         quota; // 周期内可用的物理时间
    u64         runtime; // 周期内剩余可用物理时间
    s64         hierarchical_quota;
    u64         runtime_expires; 
    int         expires_seq; 
    short           idle;
    short           period_active; // 是否已经开启限流
    struct hrtimer      period_timer; // 周期定时器
    struct hrtimer      slack_timer;
    struct list_head    throttled_cfs_rq; // 被限流的队列
    /* Statistics: */
    int         nr_periods; // 周期数
    int         nr_throttled; //被限流次数
    u64         throttled_time; //总的限流时间
    bool                    distribute_running;
#endif
};

CPU流控的基本思想如下：
在CFS调度器的每个cfs_rq队列中存放的是与某个task_group相关的一级调度实体，这些调度实体要么是在这个CPU执行，要么其子调度实体在这个CPU执行。而每个CPU的cfs_rq共享着全局的task_group，从而共享着task_group 中的cfs_bandwidth流控数据结构。cfs_rq维护这一个runtime_remaining变量，表示该group在当前cpu中可用的cpu时间，runtime_remaining会随着时钟流逝而相应的减少，当runtime_remaining小于等于0后，会从全局task_group中的cfs_bandwidth中获取sched_cfs_bandwidth_slice的可用时间，cfs_bandwidth中维护的runtime也会相应减少sched_cfs_bandwidth_slice；当cfs_bandwidth-> runtime为0后说明group的quota已经用尽；该group中的所有作业都需要被限流了，直至下一个period刷新了quota后才能继续被调度起来。
在前面的update_curr的代码的最后我们有看到一个account_cfs_rq_runtime的函数，该函数就是作用于CPU流量控制。
/proc/sys/ kernel/sched_cfs_bandwidth_slice_us
相关代码如下:

static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
    if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
        return;
    __account_cfs_rq_runtime(cfs_rq, delta_exec);
}
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
    /* dock delta_exec before expiring quota (as it could span periods) */
    cfs_rq->runtime_remaining -= delta_exec;
    expire_cfs_rq_runtime(cfs_rq); // 如果是一个新的周期开始了，runtime_remaining需要置0
    if (likely(cfs_rq->runtime_remaining > 0))
        return;
    /*
     * if we're unable to extend our runtime we resched so that the active
     * hierarchy can be throttled
     */
    if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) // 从共享池中获取quota
        resched_curr(rq_of(cfs_rq));
}
/*
 * Note: This depends on the synchronization provided by sched_clock and the
 * fact that rq->clock snapshots this value.
 */
static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
    struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
    /* if the deadline is ahead of our clock, nothing to do */
    if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
        return;
    if (cfs_rq->runtime_remaining < 0)
        return;
    /*
     * If the local deadline has passed we have to consider the
     * possibility that our sched_clock is 'fast' and the global deadline
     * has not truly expired.
     * Fortunately we can check determine whether this the case by checking
     * whether the global deadline(cfs_b->expires_seq) has advanced.
     */
    if (cfs_rq->expires_seq == cfs_b->expires_seq) {
        /* extend local deadline, drift is bounded above by 2 ticks */
        cfs_rq->runtime_expires += TICK_NSEC;
    } else {
        /* global deadline is ahead, expiration has passed */
        cfs_rq->runtime_remaining = 0;
    }
}
/* returns 0 on failure to allocate runtime */
static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
    struct task_group *tg = cfs_rq->tg;
    struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
    u64 amount = 0, min_amount, expires;
    int expires_seq;
    /* note: this is a positive sum as runtime_remaining <= 0 */
    min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
    raw_spin_lock(&cfs_b->lock);
    if (cfs_b->quota == RUNTIME_INF)
        amount = min_amount;
    else {
        start_cfs_bandwidth(cfs_b);
        if (cfs_b->runtime > 0) {
            amount = min(cfs_b->runtime, min_amount);
            cfs_b->runtime -= amount;
            cfs_b->idle = 0;
        }
    }
    expires_seq = cfs_b->expires_seq;
    expires = cfs_b->runtime_expires;
    raw_spin_unlock(&cfs_b->lock);
    cfs_rq->runtime_remaining += amount;
    /*
     * we may have advanced our local expiration to account for allowed
     * spread between our sched_clock and the one on which runtime was
     * issued.
     */
    if (cfs_rq->expires_seq != expires_seq) {
        cfs_rq->expires_seq = expires_seq;
        cfs_rq->runtime_expires = expires;
    }
    return cfs_rq->runtime_remaining > 0;
}

// cfs_bandwidth 周期性刷新quota定时器相关代码
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
    raw_spin_lock_init(&cfs_b->lock);
    cfs_b->runtime = 0;
    cfs_b->quota = RUNTIME_INF;
    cfs_b->period = ns_to_ktime(default_cfs_period());
    INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
    hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
    cfs_b->period_timer.function = sched_cfs_period_timer; // 定时器中执行函数
    hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    cfs_b->slack_timer.function = sched_cfs_slack_timer;
    cfs_b->distribute_running = 0;
}
void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
    u64 overrun;
    lockdep_assert_held(&cfs_b->lock);
    if (cfs_b->period_active)
        return;
    cfs_b->period_active = 1;
    overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
    cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
    cfs_b->expires_seq++;
    hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}
static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
    struct cfs_bandwidth *cfs_b =
        container_of(timer, struct cfs_bandwidth, period_timer);
    int overrun;
    int idle = 0;
    raw_spin_lock(&cfs_b->lock);
    for (;;) {
        overrun = hrtimer_forward_now(timer, cfs_b->period);
        if (!overrun) // 避免反复刷新
            break;
        idle = do_sched_cfs_period_timer(cfs_b, overrun); // refresh quota every period 
    }
    if (idle)
        cfs_b->period_active = 0;
    raw_spin_unlock(&cfs_b->lock);
    return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
}
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
    u64 runtime, runtime_expires;
    int throttled;
    /* no need to continue the timer with no bandwidth constraint */
    if (cfs_b->quota == RUNTIME_INF)
        goto out_deactivate;
    throttled = !list_empty(&cfs_b->throttled_cfs_rq);
    cfs_b->nr_periods += overrun;
    /*
     * idle depends on !throttled (for the case of a large deficit), and if
     * we're going inactive then everything else can be deferred
     */
    if (cfs_b->idle && !throttled)
        goto out_deactivate;
    __refill_cfs_bandwidth_runtime(cfs_b);
    if (!throttled) {
        /* mark as potentially idle for the upcoming period */
        cfs_b->idle = 1;
        return 0;
    }
    /* account preceding periods in which throttling occurred */
    cfs_b->nr_throttled += overrun;
    runtime_expires = cfs_b->runtime_expires;
    /*
     * This check is repeated as we are holding onto the new bandwidth while
     * we unthrottle. This can potentially race with an unthrottled group
     * trying to acquire new bandwidth from the global pool. This can result
     * in us over-using our runtime if it is all used during this loop, but
     * only by limited amounts in that extreme case.
     */
    while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
        runtime = cfs_b->runtime;
        cfs_b->distribute_running = 1;
        raw_spin_unlock(&cfs_b->lock);
        /* we can't nest cfs_b->lock while distributing bandwidth */
        runtime = distribute_cfs_runtime(cfs_b, runtime,
                         runtime_expires);
        raw_spin_lock(&cfs_b->lock);
        cfs_b->distribute_running = 0;
        throttled = !list_empty(&cfs_b->throttled_cfs_rq);
        lsub_positive(&cfs_b->runtime, runtime);
    }
    /*
     * While we are ensured activity in the period following an
     * unthrottle, this also covers the case in which the new bandwidth is
     * insufficient to cover the existing bandwidth deficit.  (Forcing the
     * timer to remain active while there are any throttled entities.)
     */
    cfs_b->idle = 0;
    return 0;
out_deactivate:
    return 1;
}
/*
 * Replenish runtime according to assigned quota and update expiration time.
 * We use sched_clock_cpu directly instead of rq->clock to avoid adding
 * additional synchronization around rq->lock.
 *
 * requires cfs_b->lock
 */
void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
    u64 now;
    if (cfs_b->quota == RUNTIME_INF)
        return;
    now = sched_clock_cpu(smp_processor_id());
    cfs_b->runtime = cfs_b->quota;
    cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
    cfs_b->expires_seq++;
}

参考资料https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf

最后编辑于：2019.08.06 23:59:19

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,126评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,254评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,445评论 0赞 341
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,185评论 1赞 278
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,178评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,970评论 1赞 284
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,276评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,927评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,400评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,883评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,997评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,646评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,213评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,204评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,423评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,423评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,722评论 2赞 345