转载于:http://matt33.com/2020/03/15/flink-taskmanager-7/
TaskManager 启动流程
与 JobManager 类似,TaskManager 的启动类是 TaskManagerRunner
,大概的流程如下图所示:
TaskManager 启动的入口方法是 runTaskManager()
,它会首先初始化 TaskManager 一些相关的服务,比如:初始化 RpcService、初始化 HighAvailabilityServices 等等,这些都是为 TaskManager 服务的启动做相应的准备工作。其实 TaskManager 初始化主要分为下面两大块:
- TaskManager 相关 service 的初始化:比如:内存管理器、IO 管理器、TaskSlotTable(TaskSlot 的管理是在这里进行的)等,这里也包括 TaskExecutor 的初始化,注意这里对于一些需要启动的服务在这一步并没有启动;
- TaskExecutor 的启动:它会启动 TM 上相关的服务,Task 的提交和运行也是在 TaskExecutor 中处理的,上一步 TM 初始化的那些服务也是在 TaskExecutor 中使用的。
TM 的服务真正 Run 起来之后,核心流程还是在 TaskExecutor
中。
TaskManager 相关服务的初始化
这里,先从 TaskManager 的入口 runTaskManager()
来看 TaskManager 相关服务的初始化流程,总结来看流程如下:
// 1. 入口方法
runTaskManager()
// 2. 创建 TaskManagerRunner 对象
TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, resourceId);
// 3. 启动 TaskManager 服务
startTaskManager()
// 4. 初始化相关的服务
TaskManagerServices.fromConfiguration()
首先看下具体的代码实现:
// TaskManagerRunner.java
//note: 启动 TaskManagerRunner
public static void runTaskManager(Configuration configuration, ResourceID resourceId) throws Exception {
final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, resourceId);
taskManagerRunner.start();
}
//note: 初始化 TaskManagerRunner
public TaskManagerRunner(Configuration configuration, ResourceID resourceId) throws Exception {
this.configuration = checkNotNull(configuration);
this.resourceId = checkNotNull(resourceId);
//note: akka 超时设置
timeout = AkkaUtils.getTimeoutAsTime(configuration);
this.executor = java.util.concurrent.Executors.newScheduledThreadPool(
Hardware.getNumberCPUCores(),
new ExecutorThreadFactory("taskmanager-future"));
//note: HA 的配置及服务初始化
highAvailabilityServices = HighAvailabilityServicesUtils.createHighAvailabilityServices(
configuration,
executor,
HighAvailabilityServicesUtils.AddressResolution.TRY_ADDRESS_RESOLUTION);
//note: create rpc service
rpcService = createRpcService(configuration, highAvailabilityServices);
//note: 初始化心跳服务
HeartbeatServices heartbeatServices = HeartbeatServices.fromConfiguration(configuration);
//note: metrics 服务
metricRegistry = new MetricRegistryImpl(
MetricRegistryConfiguration.fromConfiguration(configuration),
ReporterSetup.fromConfiguration(configuration));
//note: 启动相应的 metrics 服务
final RpcService metricQueryServiceRpcService = MetricUtils.startMetricsRpcService(configuration, rpcService.getAddress());
metricRegistry.startQueryService(metricQueryServiceRpcService, resourceId);
//note: 初始化 blob 服务
blobCacheService = new BlobCacheService(
configuration, highAvailabilityServices.createBlobStore(), null
);
//note: 启动 TaskManager 服务及创建 TaskExecutor 对象
taskManager = startTaskManager(
this.configuration,
this.resourceId,
rpcService,
highAvailabilityServices,
heartbeatServices,
metricRegistry,
blobCacheService,
false,
this);
this.terminationFuture = new CompletableFuture<>();
this.shutdown = false;
//note: 周期性地输出内存相关的日志信息,直到 terminationFuture complete
MemoryLogger.startIfConfigured(LOG, configuration, terminationFuture);
}
在上面的流程中,初始化了一些最基本的服务,比如:rpc 服务,在方法的最后调用了 startTaskManager() 启动 TaskManager,其代码实现如下:
// TaskManagerRunner.java
//note: 创建并初始化 TaskExecutor 对象
public static TaskExecutor startTaskManager(
Configuration configuration,
ResourceID resourceID,
RpcService rpcService,
HighAvailabilityServices highAvailabilityServices,
HeartbeatServices heartbeatServices,
MetricRegistry metricRegistry,
BlobCacheService blobCacheService,
boolean localCommunicationOnly,
FatalErrorHandler fatalErrorHandler) throws Exception {
checkNotNull(configuration);
checkNotNull(resourceID);
checkNotNull(rpcService);
checkNotNull(highAvailabilityServices);
LOG.info("Starting TaskManager with ResourceID: {}", resourceID);
InetAddress remoteAddress = InetAddress.getByName(rpcService.getAddress());
//note: TM 服务相关的配置都维护在这个对象中,这里会把使用的相关参数解析并维护起来
TaskManagerServicesConfiguration taskManagerServicesConfiguration =
TaskManagerServicesConfiguration.fromConfiguration(
configuration,
resourceID,
remoteAddress,
EnvironmentInformation.getSizeOfFreeHeapMemoryWithDefrag(),
EnvironmentInformation.getMaxJvmHeapMemory(),
localCommunicationOnly);
//note: 初始化 TM 的 TaskManagerMetricGroup,并相应地初始化 TM 的基本状态(内存、CPU 等)监控
Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(
metricRegistry,
TaskManagerLocation.getHostName(remoteAddress),
resourceID,
taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());
//note: 初始化 TaskManagerServices(TM 相关服务的初始化都在这里)
TaskManagerServices taskManagerServices = TaskManagerServices.fromConfiguration(
taskManagerServicesConfiguration,
taskManagerMetricGroup.f1,
rpcService.getExecutor()); // TODO replace this later with some dedicated executor for io.
//note: TaskManager 相关的配置,主要用于 TaskExecutor 的初始化
TaskManagerConfiguration taskManagerConfiguration = TaskManagerConfiguration.fromConfiguration(configuration);
String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();
//note: 最后创建 TaskExecutor 对象
return new TaskExecutor(
rpcService,
taskManagerConfiguration,
highAvailabilityServices,
taskManagerServices,
heartbeatServices,
taskManagerMetricGroup.f0,
metricQueryServiceAddress,
blobCacheService,
fatalErrorHandler,
new PartitionTable<>());
}
这里,来着重看一下 TaskManagerServices.fromConfiguration() 这个方法,在这个方法初始了很多 TM 的服务,从下面的具体实现中也可以看出:
// TaskManagerServices.java
/**
* Creates and returns the task manager services.
* note:根据创建 TM 服务
*
* @param taskManagerServicesConfiguration task manager configuration
* @param taskManagerMetricGroup metric group of the task manager
* @param taskIOExecutor executor for async IO operations
* @return task manager components
* @throws Exception
*/
public static TaskManagerServices fromConfiguration(
TaskManagerServicesConfiguration taskManagerServicesConfiguration,
MetricGroup taskManagerMetricGroup,
Executor taskIOExecutor) throws Exception {
// pre-start checks
checkTempDirs(taskManagerServicesConfiguration.getTmpDirPaths());
//note: 创建 taskEventDispatcher
final TaskEventDispatcher taskEventDispatcher = new TaskEventDispatcher();
// start the I/O manager, it will create some temp directories.
//note: 创建 IO 管理器
final IOManager ioManager = new IOManagerAsync(taskManagerServicesConfiguration.getTmpDirPaths());
//note: 创建 ShuffleEnvironment 对象(默认是 NettyShuffleEnvironment)
final ShuffleEnvironment<?, ?> shuffleEnvironment = createShuffleEnvironment(
taskManagerServicesConfiguration,
taskEventDispatcher,
taskManagerMetricGroup);
final int dataPort = shuffleEnvironment.start();
//note: 创建 KvStateService 实例并启动
final KvStateService kvStateService = KvStateService.fromConfiguration(taskManagerServicesConfiguration);
kvStateService.start();
//note: 初始化 taskManagerLocation,记录 connection 信息
final TaskManagerLocation taskManagerLocation = new TaskManagerLocation(
taskManagerServicesConfiguration.getResourceID(),
taskManagerServicesConfiguration.getTaskManagerAddress(),
dataPort);
// this call has to happen strictly after the network stack has been initialized
//note: 初始化 MemoryManager
final MemoryManager memoryManager = createMemoryManager(taskManagerServicesConfiguration);
final long managedMemorySize = memoryManager.getMemorySize();
//note: 初始化 BroadcastVariableManager 对象
final BroadcastVariableManager broadcastVariableManager = new BroadcastVariableManager();
//note: 当前 TM 拥有的 slot 及每个 slot 的资源信息
final int numOfSlots = taskManagerServicesConfiguration.getNumberOfSlots();
final List<ResourceProfile> resourceProfiles =
Collections.nCopies(numOfSlots, computeSlotResourceProfile(numOfSlots, managedMemorySize));
//note: 注册一个超时(AKKA 超时设置)服务(在 TaskSlotTable 用于监控 slot 分配是否超时)
final TimerService<AllocationID> timerService = new TimerService<>(
new ScheduledThreadPoolExecutor(1),
taskManagerServicesConfiguration.getTimerServiceShutdownTimeout());
//note: 这里会维护 slot 相关列表
final TaskSlotTable taskSlotTable = new TaskSlotTable(resourceProfiles, timerService);
//note: 维护 jobId 与 JobManager connection 之间的关系
final JobManagerTable jobManagerTable = new JobManagerTable();
//note: 监控注册的 job 的 JobManger leader 信息
final JobLeaderService jobLeaderService = new JobLeaderService(taskManagerLocation, taskManagerServicesConfiguration.getRetryingRegistrationConfiguration());
final String[] stateRootDirectoryStrings = taskManagerServicesConfiguration.getLocalRecoveryStateRootDirectories();
final File[] stateRootDirectoryFiles = new File[stateRootDirectoryStrings.length];
for (int i = 0; i < stateRootDirectoryStrings.length; ++i) {
stateRootDirectoryFiles[i] = new File(stateRootDirectoryStrings[i], LOCAL_STATE_SUB_DIRECTORY_ROOT);
}
//note: 创建 TaskExecutorLocalStateStoresManager 对象:维护状态信息
final TaskExecutorLocalStateStoresManager taskStateManager = new TaskExecutorLocalStateStoresManager(
taskManagerServicesConfiguration.isLocalRecoveryEnabled(),
stateRootDirectoryFiles,
taskIOExecutor);
//note: 将上面初始化的这些服务,封装到一个 TaskManagerServices 对象中
return new TaskManagerServices(
taskManagerLocation,
memoryManager,
ioManager,
shuffleEnvironment,
kvStateService,
broadcastVariableManager,
taskSlotTable,
jobManagerTable,
jobLeaderService,
taskStateManager,
taskEventDispatcher);
}
看到这里,是否有点懵圈了,是不是感觉 TaskManager 实现还挺复杂的,但与 TaskManager 要做的功能相比,上面的实现还不够,真正在 TaskManager 中处理复杂繁琐工作的组件是 TaskExecutor,这个才是 TaskManager 的核心。
TaskExecutor 的启动
回顾一下文章最开始的流程图,TaskManagerRunner 调用 run() 方法之后,真正要启动的是 TaskExecutor 服务,其 onStart() 具体实现如下:
//note: 启动服务
@Override
public void onStart() throws Exception {
try {
//note: 启动 TM 的相关服务
startTaskExecutorServices();
} catch (Exception e) {
final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), e);
onFatalError(exception);
throw exception;
}
//note: 注册超时检测,如果超时还未注册完成,就抛出错误,启动失败
startRegistrationTimeout();
}
这里,主要分为两个部分:
startTaskExecutorServices(): 启动 TaskManager 相关的服务,结合流程图主要是四大块:
启动心跳服务;
向 Flink Master 的 ResourceManager 注册 TaskManager;
启动 TaskSlotTable 服务(TaskSlot 的维护主要在这个服务中);
启动 JobLeaderService 服务,它主要是监控各个作业 JobManager leader 的变化;
startRegistrationTimeout(): 启动注册超时的检测,默认是5 min,如果超过这个时间还没注册完成,就会抛出异常退出进程,启动失败。
TaskExecutor 启动的核心实现是在 startTaskExecutorServices() 中,其实现如下:
private void startTaskExecutorServices() throws Exception {
try {
//note: 启动心跳服务
startHeartbeatServices();
//note: 与集群的 ResourceManager 建立连接(并创建一个 listener)
// start by connecting to the ResourceManager
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
// tell the task slot table who's responsible for the task slot actions
//note: taskSlotTable 启动
taskSlotTable.start(new SlotActionsImpl());
// start the job leader service
//note: 启动 job leader 服务
jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());
fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
} catch (Exception e) {
handleStartTaskExecutorServicesException(e);
}
}
接下来,详细这块的实现。
1 启动心跳服务
TaskExecutor 启动的第一个服务就是 HeartbeatManager,这里会启动两个:
-
jobManagerHeartbeatManager
: 用于与 JobManager(如果 Job 有 task 在这个 TM 上,这个 Job 的 JobManager 就与 TaskManager 有心跳通信)之间的心跳通信管理,如果 timeout,这里会重连; -
resourceManagerHeartbeatManager
:用于与 ResourceManager 之间的通信管理,如果 timeout,这里也会重连。
// TaskExecutor.java
//note: 启动心跳服务
private void startHeartbeatServices() {
final ResourceID resourceId = taskExecutorServices.getTaskManagerLocation().getResourceID();
//note: 创建一个与 JM 通信的心跳管理器
jobManagerHeartbeatManager = heartbeatServices.createHeartbeatManager(
resourceId,
new JobManagerHeartbeatListener(),
getMainThreadExecutor(),
log);
//note: 创建一个与 RM 通信的心跳管理器
resourceManagerHeartbeatManager = heartbeatServices.createHeartbeatManager(
resourceId,
new ResourceManagerHeartbeatListener(),
getMainThreadExecutor(),
log);
}
2. 向 RM 注册 TM
TaskManger 向 ResourceManager 注册是通过 ResourceManagerLeaderListener 来完成的,它会监控 ResourceManager 的 leader 变化,如果有新的 leader 被选举出来,将会调用 notifyLeaderAddress() 方法去触发与 ResourceManager 的重连,其实现如下:
// TaskExecutor.java
/**
* The listener for leader changes of the resource manager.
* note:监控 ResourceManager leader 变化的 listener
*/
private final class ResourceManagerLeaderListener implements LeaderRetrievalListener {
//note: 如果 leader 被选举处理(包括挂掉之后重新选举),将会调用这个方法通知 TM
@Override
public void notifyLeaderAddress(final String leaderAddress, final UUID leaderSessionID) {
runAsync(
() -> notifyOfNewResourceManagerLeader(
leaderAddress,
ResourceManagerId.fromUuidOrNull(leaderSessionID)));
}
@Override
public void handleError(Exception exception) {
onFatalError(exception);
}
}
//note: 如果 RM 的 new leader 选举出来了,这里会新创建一个 ResourceManagerAddress 对象,并重新建立连接
private void notifyOfNewResourceManagerLeader(String newLeaderAddress, ResourceManagerId newResourceManagerId) {
resourceManagerAddress = createResourceManagerAddress(newLeaderAddress, newResourceManagerId);
reconnectToResourceManager(new FlinkException(String.format("ResourceManager leader changed to new address %s", resourceManagerAddress)));
}
//note: 重新与 ResourceManager 连接(可能是 RM leader 切换)
private void reconnectToResourceManager(Exception cause) {
closeResourceManagerConnection(cause);
//note: 注册超时检测,如果 timeout 还没注册成功,这里就会 failed
startRegistrationTimeout();
//note: 与 RM 重新建立连接
tryConnectToResourceManager();
}
//note: 建立与 ResourceManager 的连接
private void tryConnectToResourceManager() {
if (resourceManagerAddress != null) {
connectToResourceManager();
}
}
//note: 与 ResourceManager 建立连接
private void connectToResourceManager() {
assert(resourceManagerAddress != null);
assert(establishedResourceManagerConnection == null);
assert(resourceManagerConnection == null);
log.info("Connecting to ResourceManager {}.", resourceManagerAddress);
//note: 与 RM 建立连接
resourceManagerConnection =
new TaskExecutorToResourceManagerConnection(
log,
getRpcService(),
getAddress(),
getResourceID(),
taskManagerConfiguration.getRetryingRegistrationConfiguration(),
taskManagerLocation.dataPort(),
hardwareDescription,
resourceManagerAddress.getAddress(),
resourceManagerAddress.getResourceManagerId(),
getMainThreadExecutor(),
new ResourceManagerRegistrationListener());
resourceManagerConnection.start();
}
在上面的最后一步,创建了 TaskExecutorToResourceManagerConnection 对象,它启动后,会向 ResourceManager 注册 TM,具体的方法实现如下:
// TaskExecutorToResourceManagerConnection.java
@Override
protected CompletableFuture<RegistrationResponse> invokeRegistration(
ResourceManagerGateway resourceManager, ResourceManagerId fencingToken, long timeoutMillis) throws Exception {
Time timeout = Time.milliseconds(timeoutMillis);
return resourceManager.registerTaskExecutor(
taskExecutorAddress,
resourceID,
dataPort,
hardwareDescription,
timeout);
}
ResourceManager 在收到这个请求,会做相应的处理,主要要做的事情就是:先从缓存里移除旧的 TM 注册信息(如果之前存在的话),然后再更新缓存,并增加心跳监控,只有这些工作完成之后,TM 的注册才会被认为是成功的。
3. 启动 TaskSlotTable 服务
TaskSlotTable 从名字也可以看出,它主要是为 TaskSlot 服务的,它主要的功能有以下三点:
- 维护这个 TM 上所有 TaskSlot 与 Task、及 Job 的关系;
- 维护这个 TM 上所有 TaskSlot 的状态;
- TaskSlot 在进行 allocate/free 操作,通过 TimeService 做超时检测。
先看下 TaskSlotTable 是如何初始化的:
// TaskManagerServices.java
//note: 当前 TM 拥有的 slot 及每个 slot 的资源信息
//note: TM 的 slot 数由 taskmanager.numberOfTaskSlots 决定,默认是 1
final int numOfSlots = taskManagerServicesConfiguration.getNumberOfSlots();
final List<ResourceProfile> resourceProfiles =
Collections.nCopies(numOfSlots, computeSlotResourceProfile(numOfSlots, managedMemorySize));
//note: 注册一个超时(AKKA 超时设置)服务(在 TaskSlotTable 用于监控 slot 分配是否超时)
//note: 超时参数由 akka.ask.timeout 控制,默认是 10s
final TimerService<AllocationID> timerService = new TimerService<>(
new ScheduledThreadPoolExecutor(1),
taskManagerServicesConfiguration.getTimerServiceShutdownTimeout());
//note: 这里会维护 slot 相关列表
final TaskSlotTable taskSlotTable = new TaskSlotTable(resourceProfiles, timerService);
TaskSlotTable 的初始化,只需要两个变量:
resourceProfiles: TM 上每个 Slot 的资源信息;
timerService: 超时检测服务,来保证操作超时时做相应的处理。
TaskSlotTable 的启动流程如下:
// TaskExecutor.java
// tell the task slot table who's responsible for the task slot actions
//note: taskSlotTable 启动
taskSlotTable.start(new SlotActionsImpl());
//note: SlotActions 相关方法的实现
private class SlotActionsImpl implements SlotActions {
//note: 释放 slot 资源
@Override
public void freeSlot(final AllocationID allocationId) {
runAsync(() ->
freeSlotInternal(
allocationId,
new FlinkException("TaskSlotTable requested freeing the TaskSlot " + allocationId + '.')));
}
//note: 如果 slot 相关的操作(分配/释放)失败,这里将会调用这个方法
//note: 监控的手段是:操作前先注册一个 timeout 监控,操作完成后再取消这个监控,如果在这个期间 timeout 了,就会调用这个方法
//note: TimeService 的 key 是 AllocationID
@Override
public void timeoutSlot(final AllocationID allocationId, final UUID ticket) {
runAsync(() -> TaskExecutor.this.timeoutSlot(allocationId, ticket));
}
}
4. 启动 JobLeaderService 服务
TaskExecutor 启动的最后一步是,启动 JobLeader 服务,这个服务通过 JobLeaderListenerImpl 监控 Job 的 JobManager leader 的变化,如果 leader 被选举出来之后,这里将会与新的 JobManager leader 建立通信连接。
// TaskExecutor.java
// start the job leader service
//note: 启动 job leader 服务
jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());
//note: JobLeaderListener 的实现
private final class JobLeaderListenerImpl implements JobLeaderListener {
@Override
public void jobManagerGainedLeadership(
final JobID jobId,
final JobMasterGateway jobManagerGateway,
final JMTMRegistrationSuccess registrationMessage) {
//note: 建立与 JobManager 的连接
runAsync(
() ->
establishJobManagerConnection(
jobId,
jobManagerGateway,
registrationMessage));
}
@Override
public void jobManagerLostLeadership(final JobID jobId, final JobMasterId jobMasterId) {
log.info("JobManager for job {} with leader id {} lost leadership.", jobId, jobMasterId);
runAsync(() ->
closeJobManagerConnection(
jobId,
new Exception("Job leader for job id " + jobId + " lost leadership.")));
}
@Override
public void handleError(Throwable throwable) {
onFatalError(throwable);
}
}
到这里,TaskManager 的启动流程就梳理完了,TaskManager 在实现上整体的复杂度还是比较高的,毕竟它要做的事情是非常多的,下面的几个问题,将会进一步分析 TaskManager 内部的实现机制。
TaskManager 提供了哪些能力/功能?
要想知道 TaskManager 提供了哪些能力,个人认为有一个最简单有效的方法就是查看其对外提供的 API 接口,它向上层暴露哪些 API,这些 API 背后都是 TaskManager 能力的体现,TaskManager 对外的包括的 API 列表如下:
-
requestSlot()
: RM 向 TM 请求一个 slot 资源; -
requestStackTraceSample()
: 请求某个 task 在执行过程中的一个 stack trace 抽样; -
submitTask()
: JobManager 向 TM 提交 task; -
updatePartitions()
: 更新这个 task 对应的 Partition 信息; -
releasePartitions()
: 释放这个 job 的所有中间结果,比如 close 的时候触发; -
triggerCheckpoint()
: Checkpoint Coordinator 触发 task 的 checkpoint; -
confirmCheckpoint()
: Checkpoint Coordinator 通知 task 这个 checkpoint 完成; -
cancelTask()
: task 取消; -
heartbeatFromJobManager()
: 接收来自 JobManager 的心跳请求; -
heartbeatFromResourceManager()
: 接收来自 ResourceManager 的心跳请求; -
disconnectJobManager()
; -
disconnectResourceManager()
; -
freeSlot()
: JobManager 释放 Slot; -
requestFileUpload()
: 一些文件(log 等)的上传请求; -
requestMetricQueryServiceAddress()
: 请求 TM 的 metric query service 地址; -
canBeReleased()
: 检查 TM 是否可以被 realease;
把上面的 API 列表分分类,大概有以下几块:
- slot 的资源管理:slot 的分配/释放;
- task 运行:接收来自 JobManager 的 task 提交、也包括该 task 对应的 Partition(中间结果)信息;
- checkpoint 相关的处理;
- 心跳监控、连接建立等。
通常,可以任务 TaskManager 提供的功能主要是前三点,如下图所示:
TaskManager 提供的功能TaskManager 怎么发现 RM leader(在使用 ZK 做 HA 的情况下)?
这个是 Flink HA 内容,Flink HA 机制是有一套统一的框架,它跟这个问题(TM 如何维护 JobManager 的关系,如果 JobManager 挂掉,TM 会如何处理? )的原理是一样的,这里以 ResourceManager Leader 的发现为例简单介一下。
这里,我们以使用 Zookeeper 模式的情况来讲述,ZooKeeper 做 HA 是业内最常用的方案,Flink 在实现并没有使用 ZkClient
这个包,而是使用 curator
来做的(有兴趣可以看下这篇文章 跟着实例学习ZooKeeper的用法: 缓存)。
关于 Flink HA 的使用,可以参考官方文档——JobManager High Availability (HA)。这里 TaskExecutor 在注册完 ResourceManagerLeaderListener
后,如果 leader 被选举出来或者有节点有变化,就通过它的 notifyLeaderAddress()
方法来通知 TaskExecutor,核心还是利用了 ZK 的 watcher 机制。同理, JobManager leader 的处理也是一样。
TM Slot 资源是如何管理的?
TaskManager Slot 资源的管理主要是在 TaskSlotTable 中处理的,slot 资源的申请与释放都通过 它处理的,相关的流程如下图所示(图中只描述了主要逻辑,相关的异常处理没有展示在图中):
TaskManager slot 的分配与释放slot 的申请
这里先看下 slot 资源请求的处理,其实现如下:
// TaskExecutor.java
//note: slot 请求
@Override
public CompletableFuture<Acknowledge> requestSlot(
final SlotID slotId,
final JobID jobId,
final AllocationID allocationId,
final String targetAddress,
final ResourceManagerId resourceManagerId,
final Time timeout) {
// TODO: Filter invalid requests from the resource manager by using the instance/registration Id
log.info("Receive slot request {} for job {} from resource manager with leader id {}.",
allocationId, jobId, resourceManagerId);
try {
if (!isConnectedToResourceManager(resourceManagerId)) {
//note: 如果 TM 并没有跟这个 RM 通信,就抛出异常
final String message = String.format("TaskManager is not connected to the resource manager %s.", resourceManagerId);
log.debug(message);
throw new TaskManagerException(message);
}
if (taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
//note: Slot 状态是 free,还未分配出去
if (taskSlotTable.allocateSlot(slotId.getSlotNumber(), jobId, allocationId, taskManagerConfiguration.getTimeout())) {
log.info("Allocated slot for {}.", allocationId);
//note: allcate 成功
} else {
log.info("Could not allocate slot for {}.", allocationId);
throw new SlotAllocationException("Could not allocate slot.");
}
} else if (!taskSlotTable.isAllocated(slotId.getSlotNumber(), jobId, allocationId)) {
//note: slot 已经分配出去,但分配的并不是当前这个作业
final String message = "The slot " + slotId + " has already been allocated for a different job.";
log.info(message);
final AllocationID allocationID = taskSlotTable.getCurrentAllocation(slotId.getSlotNumber());
throw new SlotOccupiedException(message, allocationID, taskSlotTable.getOwningJob(allocationID));
}
if (jobManagerTable.contains(jobId)) {
//note: 如果 TM 已经有这个 JobManager 的 meta,这里会将这个 job 的 slot 分配再汇报给 JobManager 一次
offerSlotsToJobManager(jobId);
} else {
try {
//note: 监控这个作业 JobManager 的 leader 变化
jobLeaderService.addJob(jobId, targetAddress);
} catch (Exception e) {
// free the allocated slot
try {
taskSlotTable.freeSlot(allocationId);
} catch (SlotNotFoundException slotNotFoundException) {
// slot no longer existent, this should actually never happen, because we've
// just allocated the slot. So let's fail hard in this case!
onFatalError(slotNotFoundException);
}
// release local state under the allocation id.
localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
// sanity check
if (!taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
onFatalError(new Exception("Could not free slot " + slotId));
}
throw new SlotAllocationException("Could not add job to job leader service.", e);
}
}
} catch (TaskManagerException taskManagerException) {
return FutureUtils.completedExceptionally(taskManagerException);
}
return CompletableFuture.completedFuture(Acknowledge.get());
}
相应的处理逻辑如下:
首先检测这个这个 RM 是否当前建立连接的 RM,如果不是,就抛出相应的异常,需要等到 TM 连接上 RM 之后才能处理 RM 上的 slot 请求;
判断这个 slot 是否可以分配
如果 slot 是 FREE 状态,就进行分配(调用 TaskSlotTable 的 allocateSlot() 方法),如果分配失败,就抛出相应的异常;
如果 slot 已经分配,检查分配的是不是当前作业的 AllocationId,如果不是,也会抛出相应的异常,告诉 RM 这个 Slot 已经分配出去了;
如果 TM 已经有了这个 JobManager 的 meta,这里会将这个 job 在这个 TM 上的 slot 分配再重新汇报给 JobManager 一次;
而 TaskSlotTable 在处理 slot 的分配时,主要是根据内部缓存的信息做相应的检查,其 allocateSlot() 的方法的实现如下:
// TaskSlotTable.java
public boolean allocateSlot(int index, JobID jobId, AllocationID allocationId, Time slotTimeout) {
checkInit();
TaskSlot taskSlot = taskSlots.get(index);
//note: 分配这个 TaskSlot
boolean result = taskSlot.allocate(jobId, allocationId);
if (result) {
//note: 分配成功,记录到缓存中
// update the allocation id to task slot map
allocationIDTaskSlotMap.put(allocationId, taskSlot);
// register a timeout for this slot since it's in state allocated
timerService.registerTimeout(allocationId, slotTimeout.getSize(), slotTimeout.getUnit());
// add this slot to the set of job slots
Set<AllocationID> slots = slotsPerJob.get(jobId);
if (slots == null) {
slots = new HashSet<>(4);
slotsPerJob.put(jobId, slots);
}
slots.add(allocationId);
}
return result;
}
slot 的释放
这里再看下 Slot 的资源是如何释放的,代码实现如下:
// TaskExecutor.java
//note: 释放这个 slot 资源
@Override
public CompletableFuture<Acknowledge> freeSlot(AllocationID allocationId, Throwable cause, Time timeout) {
freeSlotInternal(allocationId, cause);
return CompletableFuture.completedFuture(Acknowledge.get());
}
//note: 将本地分配的 slot 释放掉(free the slot)
private void freeSlotInternal(AllocationID allocationId, Throwable cause) {
checkNotNull(allocationId);
log.debug("Free slot with allocation id {} because: {}", allocationId, cause.getMessage());
try {
final JobID jobId = taskSlotTable.getOwningJob(allocationId);
//note: 释放这个 slot
final int slotIndex = taskSlotTable.freeSlot(allocationId, cause);
if (slotIndex != -1) {
//note: 成功释放掉的情况下
if (isConnectedToResourceManager()) {
//note: 通知 ResourceManager 这个 slot 因为被释放了,所以可以变可用了
// the slot was freed. Tell the RM about it
ResourceManagerGateway resourceManagerGateway = establishedResourceManagerConnection.getResourceManagerGateway();
resourceManagerGateway.notifySlotAvailable(
establishedResourceManagerConnection.getTaskExecutorRegistrationId(),
new SlotID(getResourceID(), slotIndex),
allocationId);
}
if (jobId != null) {
closeJobManagerConnectionIfNoAllocatedResources(jobId);
}
}
} catch (SlotNotFoundException e) {
log.debug("Could not free slot for allocation id {}.", allocationId, e);
}
//note: 释放这个 allocationId 的相应状态信息
localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
}
总结一下,TaskExecutor 在处理 slot 释放请求的理逻辑如下:
先调用 TaskSlotTable 的 freeSlot() 方法,尝试释放这个 slot:
如果这个 slot 没有 task 在运行,那么 slot 是可以释放的(状态更新为 FREE);
先将 slot 状态更新为 RELEASING,然后再遍历这个 slot 上的 task,逐个将其标记为 failed;
如果 slot 被成功释放(状态是 FREE),这里将会通知 RM 这个 slot 现在又可用了;
更新缓存信息。