在spark源码阅读之scheduler模块①中,分析了DAGScheduler如何提交Job,并且将Job划分为stage提交给TaskScheduler,最后调用了TaskScheduler的submitTasks方法,这篇文章将继续分析TaskScheduler接着DAGScheduler后又做了哪些事
提交Tasks
TaskScheduler的唯一实现TaskSchedulerImpl,调用其submitTasks方法来提交封装好的TaskSet,在这个方法中取出TaskSet中的Tasks,然后维护了一些TaskSchedulerImpl的数据结构,最后调用backend的reviveOffers方法来分配资源,源码如下:
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks //从传入的TaskSet中取出tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets: mutable.HashMap[Int, TaskSetManager] =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
val conflictingTaskSet: Boolean = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
backend.reviveOffers() //调用SchedulerBackend的reviveOffers方法来申请资源
}
分配资源
调用SchdulerBackend的reviveOffers方法,实际调用的是其继承类CoarseGrainedSchedulerBackend的reviveOffers方法,方法体中就一个简单的语句:
override def reviveOffers() {
driverEndpoint.send(ReviveOffers)
}
向driverEndpoint发送ReviveOffers消息
而DriverEndpoint在接受到ReviveOffers消息后也只有一个简单的语句:
case ReviveOffers =>
makeOffers()
调用makeOffers方法,在executor模块的分析中,我们见过这个方法,在这个方法中,我们选出合适的Executors并且调度相关资源,最后开始加载Tasks,这次我们主要分析调度资源的部分,想要了解如何在Executors上加载Tasks,可以跳转到spark源码阅读之executor模块③
以下贴出makeOffers方法的源码:
private def makeOffers() {
// Make sure no executor is killed while some task is launching on it
val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map {
case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toIndexedSeq
scheduler.resourceOffers(workOffers)
}
if (!taskDescs.isEmpty) {
launchTasks(taskDescs)
}
}
其中用于调度资源的语句为:scheduler.resourceOffers(workOffers)
resourceOffers源码较长,我们将其分为三段来理解,首先第一段是维护各种数据结构,如下所示:
/**
* Called by cluster manager to offer resources on slaves. We respond by asking our active task
* sets for tasks in order of priority. We fill each node with tasks in a round-robin manner so
* that tasks are balanced across the cluster.
* 被集群调用来为salves提供资源,按照优先级给TaskSets分配资源,我们将会采用轮询的方式来保证tasks均匀的分布于集群上
*/
def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
// Mark each slave as alive and remember its hostname
// Also track if new executor is added
var newExecAvail = false
for (o <- offers) {
if (!hostToExecutors.contains(o.host)) {
hostToExecutors(o.host) = new HashSet[String]()
}
if (!executorIdToRunningTaskIds.contains(o.executorId)) {
hostToExecutors(o.host) += o.executorId
executorAdded(o.executorId, o.host)
executorIdToHost(o.executorId) = o.host
executorIdToRunningTaskIds(o.executorId) = HashSet[Long]()
newExecAvail = true
}
for (rack <- getRackForHost(o.host)) { //这里维护了host和rack的关系
hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
}
}
// Before making any offers, remove any nodes from the blacklist whose blacklist has expired. Do
// this here to avoid a separate thread and added synchronization overhead, and also because
// updating the blacklist is only relevant when task offers are being made.
blacklistTrackerOpt.foreach(_.applyBlacklistTimeout())
val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
offers.filter { offer =>
!blacklistTracker.isNodeBlacklisted(offer.host) &&
!blacklistTracker.isExecutorBlacklisted(offer.executorId)
}
}.getOrElse(offers)
...
这一段没有什么特别的,唯一需要注意的就是在这里维护了Executor所在的host主机与rack机架的关系,这在后面spark实现本地化策略的时候会用到。
接下来,第二段代码开始分配资源,首先将资源提供者即executors打乱,然后算出每个offer上可以运行的tasks数量,根据调度规则将TaskSets进行排序并放入队列中,如果有新的Executor,先给分配TaskSet,以下为源码:
val shuffledOffers = shuffleOffers(filteredOffers)
// Build a list of tasks to assign to each worker.
val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
val availableCpus = shuffledOffers.map(o => o.cores).toArray
val sortedTaskSets = rootPool.getSortedTaskSetQueue
for (taskSet <- sortedTaskSets) {
logDebug("parentName: %s, name: %s, runningTasks: %s".format(
taskSet.parent.name, taskSet.name, taskSet.runningTasks))
if (newExecAvail) {
taskSet.executorAdded()
}
}
其中tasks是一个tasks的list,对于不同的offers维护了数组,数组的大小确定为(o.cores / CPUS_PER_TASK)-这台机器的cores/每个Tasks需要的cores ,就是这台机器可以分配的tasks数量。
那首先给哪个TaskSet分配资源呢?分配TaskSet给哪个worker的executor呢?这两个关键问题就隐藏在以下语句中:
val sortedTaskSets = rootPool.getSortedTaskSetQueue
rootPool是Pool的实例,Pool的构造方法如下:
private[spark] class Pool(
val poolName: String,
val schedulingMode: SchedulingMode,
initMinShare: Int,
initWeight: Int)
extends Schedulable with Logging {
其中schedulingMode对应参数调度模式,目前有两种:FIFO、FAIR
initMinShare、initWeight两个参数是这两种调度模式需要的参数,确切的来说主要服务于FAIR
rootPool在实例化之后,被用于创建schedulableBuilder实例,在TaskSchedulerImpl中的initialize方法中实现,源码如下所示:
def initialize(backend: SchedulerBackend) {
this.backend = backend
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
case _ =>
throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
s"$schedulingMode")
}
}
schedulableBuilder.buildPools()
}
可以看到,根据SchedulingMode的不同(FIFO/FAIR),实例化对象也不同,分别来看
FIFO的调度池
private[spark] class FIFOSchedulableBuilder(val rootPool: Pool)
extends SchedulableBuilder with Logging {
override def buildPools() {
// nothing
}
override def addTaskSetManager(manager: Schedulable, properties: Properties) {
rootPool.addSchedulable(manager)
}
}
FIFO的buildPools方法是一个空实现,rootPool中直接放入TaskSetManager
FIAR的调度池
FIAR的buildPools方法可不是一个空的,它根据本地传入的配置文件生成了一个调度树,以下是其buildPools方法:
override def buildPools() {
var fileData: Option[(InputStream, String)] = None
try {
fileData = schedulerAllocFile.map { f => //从配置文件获取配置信息
val fis = new FileInputStream(f)
logInfo(s"Creating Fair Scheduler pools from $f")
Some((fis, f))
}.getOrElse { //如果没有自定义的配置文件,就找默认的
val is: InputStream = Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
if (is != null) {
logInfo(s"Creating Fair Scheduler pools from default file: $DEFAULT_SCHEDULER_FILE")
Some((is, DEFAULT_SCHEDULER_FILE))
} else { //如果都没找到,就默认使用FIFO的方式
logWarning("Fair Scheduler configuration file not found so jobs will be scheduled in " +
s"FIFO order. To use fair scheduling, configure pools in $DEFAULT_SCHEDULER_FILE or " +
s"set $SCHEDULER_ALLOCATION_FILE_PROPERTY to a file that contains the configuration.")
None
}
}
// 根据配置文件的信息来构建池
fileData.foreach { case (is, fileName) => buildFairSchedulerPool(is, fileName) }
} catch {
case NonFatal(t) =>
val defaultMessage = "Error while building the fair scheduler pools"
val message = fileData.map { case (is, fileName) => s"$defaultMessage from $fileName" }
.getOrElse(defaultMessage)
logError(message, t)
throw t
} finally {
fileData.foreach { case (is, fileName) => is.close() }
}
// finally create "default" pool
buildDefaultPool()
}
可见FIAR模式根据配置文件的参数,实例化了一个调度池,这个配置文件默认为:$SPARK_HOME/conf/fairscheduler.xml.template,启用的时候把.template去掉编辑即可,当然还需要设置调度模式为FAIR,可以看一下里面的内容:
<?xml version="1.0"?>
<allocations>
<pool name="production">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>3</minShare>
</pool>
</allocations>
每个调度池主要维护两个参数,weight和minShare
weight: 该调度池的权重,各调度池根据该参数分配系统资源。每个调度池得到的资源数为weight / sum(weight),weight为2的分配到的资源为weight为1的两倍。
minShare: 该调度池需要的最小资源数(CPU核数)。fair调度器首先会尝试为每个调度池分配最少minShare资源,然后剩余资源才会按照weight大小继续分配。
两种调度比较器的实现
理解两种模式的调度池结构之后,再来看getSortedTaskSetQueue方法,返回一个排好序的TaskSetManager队列,其源码如下:
override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
val sortedSchedulableQueue =
schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
for (schedulable <- sortedSchedulableQueue) {
sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
}
sortedTaskSetQueue
}
其中最主要的排序语句为:
schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
可见,排序的依据为taskSetSchedulingAlgorithm的比较器
taskSetSchedulingAlgorithm实例生成的方法如下所示:
private val taskSetSchedulingAlgorithm: SchedulingAlgorithm = {
schedulingMode match {
case SchedulingMode.FAIR =>
new FairSchedulingAlgorithm()
case SchedulingMode.FIFO =>
new FIFOSchedulingAlgorithm()
case _ =>
val msg = s"Unsupported scheduling mode: $schedulingMode. Use FAIR or FIFO instead."
throw new IllegalArgumentException(msg)
}
}
FIFO/FAIR两种模式分别对应不同的SchedulingAlgorithm实例,实现了不同逻辑的比较器,我们分别来分析。
FIFO比较器
private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val priority1 = s1.priority
val priority2 = s2.priority
var res = math.signum(priority1 - priority2)
if (res == 0) {
val stageId1 = s1.stageId
val stageId2 = s2.stageId
res = math.signum(stageId1 - stageId2)
}
res < 0
}
}
FIFO直译为先入先出(first in first out),观察其比较器的实现可以发现很符合它的名字,逻辑很简单:
- 首先比较priority(即job id),Job ID较小的先被调度
- 如果是同一个Job,Stage ID小的先被调度
FAIR比较器
private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val minShare1 = s1.minShare
val minShare2 = s2.minShare
val runningTasks1 = s1.runningTasks
val runningTasks2 = s2.runningTasks
val s1Needy = runningTasks1 < minShare1
val s2Needy = runningTasks2 < minShare2
val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
var compare = 0
if (s1Needy && !s2Needy) {
return true
} else if (!s1Needy && s2Needy) {
return false
} else if (s1Needy && s2Needy) {
compare = minShareRatio1.compareTo(minShareRatio2)
} else {
compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
}
if (compare < 0) {
true
} else if (compare > 0) {
false
} else {
s1.name < s2.name
}
}
FAIR比较器的比较逻辑如下:
- 需要运行的Tasks数量小于minShare的优先级高,即先用最小需求资源来满足Tasks的运行需求
- 如果两个调度池的条件1相同,那么比较minShareRatio,较小的优先级高,即优先满足最小需求资源高的调度池
- 如果条件2相同,再比较Tasks的数量与权重的比值,较小的优先级高,即权重较高的优先级较高
- 如果以上条件都相同,比较两个调度池的名字
由于其中的minShare参数和weight参数都是由参数文件传入的,可由用户自定义,每个调度池中的调度顺序则按照默认的FIFO策略来分配资源。
本地化策略
最后我们来看TaskSchedulerImpl的resourceOffers方法的最后一段
// Take each TaskSet in our scheduling order, and then offer it each node in increasing order
// of locality levels so that it gets a chance to launch local tasks on all of them.
// NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
// 拿到排序好的TaskSet队列,按照本地化的顺序来分配资源
for (taskSet <- sortedTaskSets) {
var launchedAnyTask = false
var launchedTaskAtCurrentMaxLocality = false
for (currentMaxLocality <- taskSet.myLocalityLevels) {
do {
launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(
taskSet, currentMaxLocality, shuffledOffers, availableCpus, tasks)
launchedAnyTask |= launchedTaskAtCurrentMaxLocality
} while (launchedTaskAtCurrentMaxLocality)
}
if (!launchedAnyTask) {
taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
}
}
if (tasks.size > 0) {
hasLaunchedTask = true
}
return tasks
其中,for (currentMaxLocality <- taskSet.myLocalityLevels)这个循环实现了本地化策略,每个TaskSet的本地化优先顺序为:
PROCESS_LOCAL > NODE_LOCAL > NO_PREF > RACK_LOCAL > Any
拿到TaskSet的本地化级别之后,会去尝试着在指定的级别去加载Tasks,如果超过了一定时间(由spark.locality.wait来设置),则本地化级别降一级,再去尝试加载Tasks,以此类推。
总结
至此,Spark core的scheduler模块源码已经分析结束,剩下的工作就是加载、计算Tasks并把结果信息返回给Driver,这一部分内容已经在Executor模块分析过。Spark的调度模块首先利用DAGScheduler这个高层的调度层将提交的Job划分为多个Stage,并巧妙的利用RDD的Dependency关系,递归的提交划分完成的stage和封装好的TaskSets,底层的调度接口TaskScheduler拿到提交的TaskSets,按照调度策略和本地化策略来确定调度顺序和分配资源,充分体现了其优雅和高效的设计思想。