进程调度

目标

本章将讨论Linux内核是如何进行进程调度的，进程调度程序(也称为调度器)的工作与实现原理。

进程调度程序负责决定运行哪个进程、什么时候运行、运行多长时间。进程调度程序可以看作是一个负责给进程分配有限处理器时间资源的子系统。一个合理的调度程序可以更充分的发挥系统资源的利用。

本文代码是基于linux-2.6.34版本。

多任务与抢占

操作系统中可以同时并发地运行多个进程。当进程数大于处理器数量时，在某一时刻，总有一些进程没有被cpu真正执行。一部分进程可能因为没有分配到cpu，处于就绪状态；一部分进程可能在等待I/O事件的发生被内核阻塞着或者睡眠着，处于等待状态。

在多任务模式下，为了避免某些进程独占资源导致其它进程一直无法运行或者有更紧急的进程需要立即运行，进程调度程序需要强制将某些进程挂起，让其它等待的进程运行，这个过程被称为抢占。

调度策略

I/O消耗型进程和CPU消耗型进程

进程可分为I/O消耗型和CPU消耗型。

I/O消耗型：指大部分时间用来提供I/O请求或等待I/O请求
CPU消耗型：大部分时间用于执行代码，没有太多的I/O需求

I/O消耗型的进程往往不需要太多的CPU资源，但期望有快速的I/O响应。而CPU消耗型的进程则期望占用更长CPU时间。

进程优先级

Linux采用两种不同的优先级范围：

用nice值：范围从-20到+19,默认值为0，nice值越大，优先级越低。
实时优先级：范围从0到99，值越大，优先级越高。

时间片

时间片表示进程在被抢占前持续运行的时间。

时间片过长会导致系统对交互式响应(I/O型进程)表现欠佳，时间片过短会增大进程切换带来的cpu消耗。

调度算法

调度器类

Linux内核通过抽象调度器的公共数据结构，以模块化的方式组织所有的调度算法，保证多个调度算法可以并存。这个公共数据结构称为调度器类。每一种调度器类调度属于自己范畴的进程。

调度器类数据结构struct sched_class定义在<linux/sched.h>文件中。

struct sched_class {
    const struct sched_class *next; //所有调度器类以链表连接,排在链表开始的优先级最高

    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup,
                  bool head);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
    void (*yield_task) (struct rq *rq);

    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

    void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
    void (*post_schedule) (struct rq *this_rq);
    void (*task_waking) (struct rq *this_rq, struct task_struct *task);
    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);

    void (*switched_from) (struct rq *this_rq, struct task_struct *task,
                   int running);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task,
                 int running);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                 int oldprio, int running);

    unsigned int (*get_rr_interval) (struct rq *rq,
                     struct task_struct *task);

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*moved_group) (struct task_struct *p, int on_rq);
#endif
};

每个进程描述符都包含了调度器类的成员指针(const struct sched_class *sched_class)，表明该进程所属的调度器类。

调度器类数据结构struct sched_class声明了一个具体调度器类实例对象需要实现的通用的操作接口:

enqueue_task：将进程添加到就绪队列中，在进程从睡眠状态变为可运行状态时，调用该操作。
dequeue_task：将进程从就绪队列中移除。如果进程从可运行状态转换到不可运行状态，调用该操作。
yield_task：当进程使用系统调用sched_yield自愿放弃对cpu的控制时，调用该操作。
check_preempt_curr：唤醒新进程来抢占当前进程。例如当fork新的进程时，会调用wake_up_new_task()函数来唤醒新的子进程，就会调用该操作。
pick_next_task：选择下一个将要运行的进程。
put_prev_task：该函数是在另一个进程代替当前运行的进程之前调用。主要负责底层的上下文切换。
set_curr_task：在进程的调度策略发生变化时，调用此操作。
task_tick：每次激活周期性调度器时,由周期性调度器调用。
task_fork:每次fork出新进程后，调用此操作通知调度器。

调度实体

调度器类调度的对象被抽象为调度实体，调度实体的数据结构中维护了与调度相关的统计信息。
调度实体的数据结构在<linux/sched.h>中定义。

/*
 * CFS stats for a schedulable entity (task, task-group etc)
 *
 * Current field usage histogram:
 *
 *     4 se->block_start
 *     4 se->run_node
 *     4 se->sleep_start
 *     6 se->load.weight
 */
struct sched_entity {
    struct load_weight  load;       /* for load-balancing */
    struct rb_node      run_node;
    struct list_head    group_node;
    unsigned int        on_rq; //1-表示在就绪队列中，0-未在就绪队列中

    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;
    u64         prev_sum_exec_runtime;

    u64         last_wakeup;
    u64         avg_overlap;

    u64         nr_migrations;

    u64         start_runtime;
    u64         avg_wakeup;

#ifdef CONFIG_SCHEDSTATS
    u64         wait_start;
    u64         wait_max;
    u64         wait_count;
    u64         wait_sum;
    u64         iowait_count;
    u64         iowait_sum;

    u64         sleep_start;
    u64         sleep_max;
    s64         sum_sleep_runtime;

    u64         block_start;
    u64         block_max;
    u64         exec_max;
    u64         slice_max;

    u64         nr_migrations_cold;
    u64         nr_failed_migrations_affine;
    u64         nr_failed_migrations_running;
    u64         nr_failed_migrations_hot;
    u64         nr_forced_migrations;

    u64         nr_wakeups;
    u64         nr_wakeups_sync;
    u64         nr_wakeups_migrate;
    u64         nr_wakeups_local;
    u64         nr_wakeups_remote;
    u64         nr_wakeups_affine;
    u64         nr_wakeups_affine_attempts;
    u64         nr_wakeups_passive;
    u64         nr_wakeups_idle;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq       *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq       *my_q;
#endif
};

进程描述符结构中包含了一个struct sched_entity的成员变量se，使进程成为可调度实体。

就绪队列

就绪队列主要保存了所有活动进程。其数据结构struct rq定义在kernel/sched.c文件中。

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
    /* runqueue lock: */
    raw_spinlock_t lock;

    /*
     * nr_running and cpu_load should be in the same cacheline because
     * remote CPUs use both these fields when doing load calculation.
     */
    unsigned long nr_running; //队列上可运行的进程总数
    #define CPU_LOAD_IDX_MAX 5
    unsigned long cpu_load[CPU_LOAD_IDX_MAX]; //用于跟踪负载状态
#ifdef CONFIG_NO_HZ
    unsigned char in_nohz_recently;
#endif
    /* capture load from *all* tasks on this cpu: */
    struct load_weight load; //负载信息
    unsigned long nr_load_updates;
    u64 nr_switches;

    struct cfs_rq cfs; //CFS调度类就绪队列
    struct rt_rq rt; //实时调度类就绪队列

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* list of leaf cfs_rq on this cpu: */
    struct list_head leaf_cfs_rq_list;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
    struct list_head leaf_rt_rq_list;
#endif

    /*
     * This is part of a global counter where only the total sum
     * over all CPUs matters. A task can increase this counter on
     * one CPU and if it got migrated afterwards it may decrease
     * it on another CPU. Always updated under the runqueue lock:
     */
    unsigned long nr_uninterruptible;

    struct task_struct *curr, *idle; //curr-当前运行的进程,idle-空闲进程
    unsigned long next_balance;
    struct mm_struct *prev_mm;

    u64 clock; //时钟，周期性调度器时会定时更新该值

    atomic_t nr_iowait;

#ifdef CONFIG_SMP
    struct root_domain *rd;
    struct sched_domain *sd;

    unsigned char idle_at_tick;
    /* For active balancing */
    int post_schedule;
    int active_balance;
    int push_cpu;
    /* cpu of this runqueue: */
    int cpu;
    int online;

    unsigned long avg_load_per_task;

    struct task_struct *migration_thread;
    struct list_head migration_queue;

    u64 rt_avg;
    u64 age_stamp;
    u64 idle_stamp;
    u64 avg_idle;
#endif

    /* calc_load related fields */
    unsigned long calc_load_update;
    long calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
    int hrtick_csd_pending;
    struct call_single_data hrtick_csd;
#endif
    struct hrtimer hrtick_timer;
#endif

#ifdef CONFIG_SCHEDSTATS
    /* latency stats */
    struct sched_info rq_sched_info;
    unsigned long long rq_cpu_time;
    /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

    /* sys_sched_yield() stats */
    unsigned int yld_count;

    /* schedule() stats */
    unsigned int sched_switch;
    unsigned int sched_count;
    unsigned int sched_goidle;

    /* try_to_wake_up() stats */
    unsigned int ttwu_count;
    unsigned int ttwu_local;

    /* BKL stats */
    unsigned int bkl_count;
#endif
};

每个CPU都有各自的就绪队列。每一个活动进程只出现一个就绪队列中。就绪队列并没有直接管理进程,而是由嵌入的特定调度类的就绪队列来负责的，比如struct cfs_rq cfs和struct rt_rq rt。

内核中的所有就绪队列都在runqueues[]数组中，该数组中的每一个元素分别对应于系统中一个cpu。该数组定义在kernel/sched.c文件中。

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

//针对runqueues的一些操作宏
#define cpu_rq(cpu)     (&per_cpu(runqueues, (cpu)))
#define this_rq()       (&__get_cpu_var(runqueues))
#define task_rq(p)      cpu_rq(task_cpu(p))
#define cpu_curr(cpu)       (cpu_rq(cpu)->curr)
#define raw_rq()        (&__raw_get_cpu_var(runqueues))

完全公平调度类CFS

CFS调度类在kernel/sched_fair.c文件中定义。

/*
 * All the scheduling class methods:
 */
static const struct sched_class fair_sched_class = {
    .next           = &idle_sched_class,
    .enqueue_task       = enqueue_task_fair,
    .dequeue_task       = dequeue_task_fair,
    .yield_task     = yield_task_fair,

    .check_preempt_curr = check_preempt_wakeup,

    .pick_next_task     = pick_next_task_fair,
    .put_prev_task      = put_prev_task_fair,

#ifdef CONFIG_SMP
    .select_task_rq     = select_task_rq_fair,

    .rq_online      = rq_online_fair,
    .rq_offline     = rq_offline_fair,

    .task_waking        = task_waking_fair,
#endif

    .set_curr_task          = set_curr_task_fair,
    .task_tick      = task_tick_fair,
    .task_fork      = task_fork_fair,

    .prio_changed       = prio_changed_fair,
    .switched_to        = switched_to_fair,

    .get_rr_interval    = get_rr_interval_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
    .moved_group        = moved_group_fair,
#endif
};

CFS就绪队列

CFS就绪队列数据结构struct cfs_rq嵌入在就绪队列数据结构struct rq中，提供给CFS调度算法来使用，在kernel/sched.c文件定义。

/* CFS-related fields in a runqueue */
struct cfs_rq {
    struct load_weight load; //负载信息
    unsigned long nr_running; //队列上可运行的进程数目

    u64 exec_clock;
    u64 min_vruntime; //所有进程中最小的虚拟运行时间，该值是逐渐递增的

    struct rb_root tasks_timeline; //按时间排序的红黑树
    struct rb_node *rb_leftmost; //指向树上最左边的节点(即最需要被调度的节点)

    struct list_head tasks;
    struct list_head *balance_iterator;

    /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr, *next, *last;//curr-表示当前正在执行的调度实体

    unsigned int nr_spread_over;

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */

    /*
     * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
     * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
     * (like users, containers etc.)
     *
     * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
     * list is used during load balance.
     */
    struct list_head leaf_cfs_rq_list;
    struct task_group *tg;  /* group that "owns" this runqueue */

#ifdef CONFIG_SMP
    /*
     * the part of load.weight contributed by tasks
     */
    unsigned long task_weight;

    /*
     *   h_load = weight * f(tg)
     *
     * Where f(tg) is the recursive weight fraction assigned to
     * this group.
     */
    unsigned long h_load;

    /*
     * this cpu's part of tg->shares
     */
    unsigned long shares;

    /*
     * load.weight at the time we set shares
     */
    unsigned long rq_weight;
#endif
#endif
};

虚拟时间

CFS调度依赖虚拟时间，虚拟时间用于度量进程能够得到的CPU时间。与虚拟时间相关的计算都是在update_curr()函数中执行，该函数在系统中各个不同地方调用，包含周期性调度器之内。

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock;
    unsigned long delta_exec;

    if (unlikely(!curr))
        return;

    /*
     * Get the amount of time the current task was running
     * since the last time we changed load (this cannot
     * overflow on 32 bits):
     */
    //计算当前进程从上次更新exec_start到现在的执行时间差
    delta_exec = (unsigned long)(now - curr->exec_start);
    if (!delta_exec)
        return;

    __update_curr(cfs_rq, curr, delta_exec);//统计当前进程的运行时间和vruntime
    curr->exec_start = now;

    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}

/*
 * Update the current task's runtime statistics. Skip current tasks that
 * are not in our scheduling class.
 */
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;

    schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

    curr->sum_exec_runtime += delta_exec;//累积总的运行时间
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);//计算虚拟时间权重增量

    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);//更新cfs_rq的min_vruntime
}

update_curr()函数主要作用是更新当前进程的累积物理运行时间和虚拟时间，以及cfs_rq的min_vruntime。

虚拟时间权重增量delta_exec_weighted的计算公式：

delta_exec_weighted = delta_exec * (NICE_0_LOAD / curr->load.weight)

系统内部定义了一个优化级与权重之间的换算关系，优化级越高(nice越低)，权重就越大，每次计算虚拟时间权重时，累积增量也就越小，进程的vruntime就增长的越缓慢。

vruntime增长越快的进程将会向红黑树右边移动，则增长越缓慢则向左边移动。这样就可以保证优化级越高的进程会优先被调度。

入队

当进程变成可运行状态(被唤醒)或者通过fork()创建的新进程时，CFS通过入队操作将进程添加到红黑树中，并缓存最左叶子节点。

enqueue_task_fair()函数是CFS调度器类入队操作的入口，它调用enqueue_entity()函数来执行入队真正的入队操作。

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, bool head)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;
    int flags = 0;

    if (wakeup)
        flags |= ENQUEUE_WAKEUP;
    if (p->state == TASK_WAKING)
        flags |= ENQUEUE_MIGRATE;

    for_each_sched_entity(se) {
        if (se->on_rq) //如果已经在就绪队列中,则不用再入队
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, flags);
        flags = ENQUEUE_WAKEUP;
    }

    hrtick_update(rq);
}

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update the normalized vruntime before updating min_vruntime
     * through callig update_curr().
     */
    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))
        se->vruntime += cfs_rq->min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);//更新当前进程的运行时间和vruntime
    //更新实体入队的一些统计信息和状态
    //如累加cfs_rq->load，cfs_rq->nr_running++，se->on_rq = 1
    account_entity_enqueue(cfs_rq, se);

    if (flags & ENQUEUE_WAKEUP) {
        place_entity(cfs_rq, se, 0);
        enqueue_sleeper(cfs_rq, se);
    }

    update_stats_enqueue(cfs_rq, se);
    check_spread(cfs_rq, se);
    if (se != cfs_rq->curr)//如果需要入队的实体并非当前正在执行的实体
        __enqueue_entity(cfs_rq, se);//将实体添加到rbtree中
}

enqueue_entity()函数更新入队实体的vruntime和其它一些统计数据，然后再调用__enqueue_entity()函数将实体插入到rbtree中。

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    s64 key = entity_key(cfs_rq, se);//key = se->vruntime - cfs_rq->min_vruntime
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     * 查找插入新节点的合适位置
     */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (key < entity_key(cfs_rq, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    //不存在key值比新添加的节点更小的节点
    if (leftmost)
        cfs_rq->rb_leftmost = &se->run_node;//如果新添加的节点的key最小,则缓存起来

    rb_link_node(&se->run_node, parent, link);//将实体节点添加到红黑树中
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);//对添加的新节点着色平衡
}

从上面的算法可以看出，rbtime节点组织方式是左节点的key比右节点的key小。而key的值相当于实体的vruntime值。vruntime值越小，则越向rbtree左边移动。rb_leftmost缓存了vruntime值最小的叶子节点。

出队

当进程变成不可运行状态(被阻塞)或者终止时，CFS将进程从红黑树中移除。

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        dequeue_entity(cfs_rq, se, sleep);//将实体从就绪队列中移除
        /* Don't dequeue parent if it has other entities besides us */
        if (cfs_rq->load.weight)
            break;
        sleep = 1;
    }

    hrtick_update(rq);
}

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

    update_stats_dequeue(cfs_rq, se);
    if (sleep) {
#ifdef CONFIG_SCHEDSTATS
        if (entity_is_task(se)) {
            struct task_struct *tsk = task_of(se);

            if (tsk->state & TASK_INTERRUPTIBLE)
                se->sleep_start = rq_of(cfs_rq)->clock;
            if (tsk->state & TASK_UNINTERRUPTIBLE)
                se->block_start = rq_of(cfs_rq)->clock;
        }
#endif
    }

    clear_buddies(cfs_rq, se);

    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    //更新实体入队的一些统计信息和状态
    //如累减cfs_rq->load，cfs_rq->nr_running--，se->on_rq = 0
    account_entity_dequeue(cfs_rq, se);
    update_min_vruntime(cfs_rq);

    /*
     * Normalize the entity after updating the min_vruntime because the
     * update can refer to the ->curr item and we need to reflect this
     * movement in our normalized position.
     */
    if (!sleep)
        se->vruntime -= cfs_rq->min_vruntime;
}

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    //如果要移除的节点恰好也是缓存的最左节点，则需要更新新的最左节点
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node; //更新rb_leftmost
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);//将节点从rbtree中移除
}

选择下一个进程

CFS选取下一个待运行的进程，是所有进程中vruntime最小的那个，也就是CFS缓存起来的最左的叶子节点。

当前执行的进程不能存放在rbtree中，所以需要将进程从rbtree中移除，并通过cfs_rq->curr成员来引用跟踪。

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;

    //如果队列中没有可运行的进程，则立即返回
    if (!cfs_rq->nr_running)
        return NULL;

    do {
        se = pick_next_entity(cfs_rq);//获取一下待运行的进程
        set_next_entity(cfs_rq, se);//将实体从rbtree中移除，并更新实体的开始执行时间
        cfs_rq = group_cfs_rq(se);
    } while (cfs_rq);

    p = task_of(se);
    hrtick_start_fair(rq, p);

    return p;
}

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct sched_entity *se = __pick_next_entity(cfs_rq);
    struct sched_entity *left = se;

    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    clear_buddies(cfs_rq, se);

    return se;
}

static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct rb_node *left = cfs_rq->rb_leftmost;//获取红黑树上缓存的最左节点

    if (!left)
        return NULL;

    return rb_entry(left, struct sched_entity, run_node);//获取节点对应的调度实体
}

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    /* 'current' is not kept within the tree. */
    if (se->on_rq) {
        /*
         * Any task has to be enqueued before it get to execute on
         * a CPU. So account for the time it spent waiting on the
         * runqueue.
         */
        update_stats_wait_end(cfs_rq, se);
        __dequeue_entity(cfs_rq, se);//将调度实体从就绪队列中移出
    }

    update_stats_curr_start(cfs_rq, se);//更新进程当前开始执行的时间
    cfs_rq->curr = se;
#ifdef CONFIG_SCHEDSTATS
    /*
     * Track our maximum slice length, if the CPU's load is at
     * least twice that of our own weight (i.e. dont track it
     * when there are only lesser-weight tasks around):
     */
    if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
        se->slice_max = max(se->slice_max,
            se->sum_exec_runtime - se->prev_sum_exec_runtime);
    }
#endif
    se->prev_sum_exec_runtime = se->sum_exec_runtime;//保存总的运行时间，后面用于计算进程在cpu上的运行时间来判断进程是否已经超过了期望时长，是否需要重新调度
}

周期性调度

周期性调度器定时调用特定调度器的task_tick函数，用于定时更新当前执行进程的执行时间，同时跟踪当前进程执行时间是否超过期望的时长，如果超过则发起进程调度请求(设置重调标志TIF_NEED_RESCHED)，核心调度器会在下一个适当时机发起调度。

CFS的task_tick函数为task_tick_fair():

/*
 * scheduler tick hitting a task of our scheduling class:
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued);
    }
}

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
    /*
     * queued ticks are scheduled to match the slice, so don't bother
     * validating it and just reschedule.
     */
    if (queued) {
        resched_task(rq_of(cfs_rq)->curr);
        return;
    }
    /*
     * don't let the period tick interfere with the hrtick preemption
     */
    if (!sched_feat(DOUBLE_TICK) &&
            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif

    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr);//检查当前进程运行时长，判断是否需要抢占
}

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;

    ideal_runtime = sched_slice(cfs_rq, curr);//计算当前进程可运行的期望时长
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;//进程的运行时长
    //如果进程运行时长已经超过了期望时长，则发起调度请求
    if (delta_exec > ideal_runtime) {
        resched_task(rq_of(cfs_rq)->curr);//发起调度请求
        /*
         * The current task ran long enough, ensure it doesn't get
         * re-elected due to buddy favours.
         */
        clear_buddies(cfs_rq, curr);
        return;
    }

    /*
     * Ensure that a task that missed wakeup preemption by a
     * narrow margin doesn't have to wait for a full slice.
     * This also mitigates buddy induced latencies under load.
     */
    if (!sched_feat(WAKEUP_PREEMPT))
        return;

    if (delta_exec < sysctl_sched_min_granularity)
        return;

    if (cfs_rq->nr_running > 1) {
        struct sched_entity *se = __pick_next_entity(cfs_rq);
        s64 delta = curr->vruntime - se->vruntime;

        if (delta > ideal_runtime)
            resched_task(rq_of(cfs_rq)->curr);
    }
}

static void resched_task(struct task_struct *p)
{
    int cpu;

    assert_raw_spin_locked(&task_rq(p)->lock);

    if (test_tsk_need_resched(p))
        return;

    set_tsk_need_resched(p);//设置重新调度标志TIF_NEED_RESCHED

    cpu = task_cpu(p);
    if (cpu == smp_processor_id())
        return;

    /* NEED_RESCHED must be visible before we test polling */
    smp_mb();
    if (!tsk_is_polling(p))
        smp_send_reschedule(cpu);
}

唤醒抢占

当调用try_to_wake_up()和wake_up_new_task()函数来唤醒新进程时，内核通过调用调度器的check_preempt_curr函数来检查新的进程是否可以抢占当前运行的进程。

CFS通过check_preempt_wakeup()函数来完成抢占检测的操作。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se;
    struct cfs_rq *cfs_rq = task_cfs_rq(curr);
    int sync = wake_flags & WF_SYNC;
    int scale = cfs_rq->nr_running >= sched_nr_latency;

    //如果新进程是实时进程，则立即请求重新调度
    if (unlikely(rt_prio(p->prio)))
        goto preempt;

    if (unlikely(p->sched_class != &fair_sched_class))
        return;

    if (unlikely(se == pse))
        return;

    if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
        set_next_buddy(pse);

    /*
     * We can come here with TIF_NEED_RESCHED already set from new task
     * wake up path.
     */
    //已经设置了TIF_NEED_RESCHED标志，则不用再设置了
    if (test_tsk_need_resched(curr))
        return;

    /*
     * Batch and idle tasks do not preempt (their preemption is driven by
     * the tick):
     */
    if (unlikely(p->policy != SCHED_NORMAL))
        return;

    /* Idle tasks are by definition preempted by everybody. */
    //如果当前执行的进程是空闲进程，则直接请求重新调度
    if (unlikely(curr->policy == SCHED_IDLE))
        goto preempt;

    if (sched_feat(WAKEUP_SYNC) && sync)
        goto preempt;

    if (sched_feat(WAKEUP_OVERLAP) &&
            se->avg_overlap < sysctl_sched_migration_cost &&
            pse->avg_overlap < sysctl_sched_migration_cost)
        goto preempt;

    if (!sched_feat(WAKEUP_PREEMPT))
        return;

    update_curr(cfs_rq);
    find_matching_se(&se, &pse);
    BUG_ON(!pse);
    //如果se->vruntime > pse->vruntime + gran，则请求重新调度
    //gran表示最小时间限额，用来确保进程不至于切换得太频繁而影响性能
    if (wakeup_preempt_entity(se, pse) == 1)
        goto preempt;

    return;

preempt:
    resched_task(curr);
    /*
     * Only set the backward buddy when the current task is still
     * on the rq. This can happen when a wakeup gets interleaved
     * with schedule on the ->pre_schedule() or idle_balance()
     * point, either of which can * drop the rq lock.
     *
     * Also, during early boot the idle thread is in the fair class,
     * for obvious reasons its a bad idea to schedule back to it.
     */
    if (unlikely(!se->on_rq || curr == rq->idle))
        return;

    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
        set_last_buddy(se);
}

进程调度入口

schedule()函数

进程调度向外界提供的功能接口就是schedule()函数，它定义在kernel/sched.c文件中。
schedule()作为通用的进程调度入口，隐藏了具体特定调度器类调度细节。该函数通过调用pick_next_task()函数委托给特定调度类来获取优化级最高的进程，再调用context_switch()函数完成上下文切换。

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    preempt_disable();
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_sched_qs(cpu);
    prev = rq->curr;
    switch_count = &prev->nivcsw;

    release_kernel_lock(prev);
need_resched_nonpreemptible:

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    raw_spin_lock_irq(&rq->lock);
    update_rq_clock(rq);
    clear_tsk_need_resched(prev); //清除TIF_NEED_RESCHED标志
    //如果当前进程处于睡眠状态(TASK_RUNNING值为0)
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        //如果接收到信号(有未处理的信号)，则将进程设置为运行状态
        if (unlikely(signal_pending_state(prev->state, prev)))
            prev->state = TASK_RUNNING;
        else
            deactivate_task(rq, prev, 1);//停止进程并将其从就绪队列移除
        switch_count = &prev->nvcsw;
    }

    pre_schedule(rq, prev);

    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);

    put_prev_task(rq, prev); //通过调度器当前运行进程将要被另一个进程代替
    next = pick_next_task(rq);//获取优化级最高的进程(任务)

    if (likely(prev != next)) {
        sched_info_switch(prev, next);
        perf_event_task_sched_out(prev, next);

        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;
        //执行上下文切换
        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * the context switch might have flipped the stack from under
         * us, hence refresh the local variables.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);

    if (unlikely(reacquire_kernel_lock(current) < 0)) {
        prev = rq->curr;
        switch_count = &prev->nivcsw;
        goto need_resched_nonpreemptible;
    }

    preempt_enable_no_resched();
    if (need_resched())
        goto need_resched;
}
EXPORT_SYMBOL(schedule);

pick_next_task()函数依次遍历调度器类链表，直到找到优先级最高的进程为止。

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;

    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    //如果队列中所有的任务都属于cfs调度器类型，则直接调用相关的函数
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);
        if (likely(p))
            return p;
    }
    //遍历调度器类链表，直到找到优先级最高的进程为止
    class = sched_class_highest;
    for ( ; ; ) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
        /*
         * Will never be NULL as the idle class always
         * returns a non-NULL p:
         */
        class = class->next;
    }
}

schedule()函数一般在如下情况被调用：

从系统调用返回到用户空间之前，检查是否设置了重调标志TIF_NEED_RESCHED
从中断处理程序返回到用户空间之前
从中断处理程序退出，返回到内核空间之前
内核代码再次变为可抢占时
进程在内核中显式调用了schedule()函数
进程在内核中阻塞

上下文切换

上下文切换是指将一个可运行的进程切换到另一个进程，切换过程包含虚拟内存和寄存器状态等信息的切换。该操作是由定义在kernel/sched.c文件中的context_switch()函数来完成。

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
           struct task_struct *next)
{
    //......
    switch_mm(oldmm, mm, next);

    //......
    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);

    //......
}

switch_mm()函数在<asm/mmu_context.h>文件中声明，该函数负责将上一个进程的虚拟内存映射切换到新的进程的虚拟内存映射。

switch_to()函数在<asm/system.h>文件中声明，该函数负责处理器状态的切换。包括保存和恢复栈信息、处理器寄存器以及与特定体系架构相关的状态信息。

最后编辑于：2019.02.22 15:49:57

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,126评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,254评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,445评论 0赞 341
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,185评论 1赞 278
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,178评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,970评论 1赞 284
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,276评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,927评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,400评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,883评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,997评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,646评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,213评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,204评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,423评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,423评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,722评论 2赞 345