Flink典型的任务处理过程如下所示:
很容易发现,JobManager存在单点故障(SPOF:Single Point Of Failure),因此对Flink做HA,主要是对JobManager做HA,根据Flink集群的部署模式不同,分为Standalone、OnYarn,本文主要涉及Standalone模式。
JobManager的HA,是通过Zookeeper实现的,因此需要先搭建好Zookeeper集群,同时HA的信息,还要存储在HDFS中,因此也需要Hadoop集群,最后修改Flink中的配置文件。
1. conf/flink-conf.yaml修改
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: 10.108.4.203:2181,10.108.4.204:2181,10.108.4.205:2181
2. conf/masters修改
设置要启用JobManager的节点及端口:
10.108.4.202:8081
10.108.4.203:8081
3. conf/zoo.cfg修改
server.1=10.108.4.203:2888:3888
server.2=10.108.4.204:2888:3888
server.3=10.108.4.205:2888:3888
PS: 修改完后,使用scp命令将flink-conf.yaml、masters、 zoo.cfg文件同步到其他节点:
[root@hadoop2 conf]# scp flink-conf.yaml masters zoo.cfg root@hadoop3:/opt/flink-1.5.0/conf
4. 启动zookeeper 服务
[root@hadoop2 bin]# ./start-zookeeper-quorum.sh
Starting zookeeper daemon on host hadoop2.
Starting zookeeper daemon on host hadoop3.
Starting zookeeper daemon on host hadoop4.
5. 启动ha集群
[root@hadoop2 bin]# ./start-cluster.sh
Starting HA cluster with 2 masters.
Starting standalonesession daemon on host hadoop2.
Starting standalonesession daemon on host hadoop3.
Starting taskexecutor daemon on host hadoop3.
Starting taskexecutor daemon on host hadoop4.
可以看到,启动了两个JobManager,一个Leader,一个Standby
6. 停止集群服务
[root@hadoop2 bin]# bin/stop-cluster.sh
[root@hadoop2 bin]# bin/stop-cluster.sh
测试HA
1. 访问Leader的WebUI:
2. 访问StandBy的WebUI
官网HA文档:https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/jobmanager_high_availability.html