kafka版本:0.10.2
问题描述
线上kafka的生产者程序大概每一周都会抛出以下异常后停止,重启后恢复。观察监控发现,异常前网卡流量平稳,不存在抖动。(是不是很神奇)
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
追踪日志
server.log
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition
stat-change.log
[2019-02-25 15:01:03,023] TRACE Broker 2 received LeaderAndIsr request PartitionState(controllerEpoch=8, leader=3, leaderEpoch=10, isr=[1, 2, 3], zkVersion=19, replicas=[1, 2, 3]) correlation id 202 from controller 1 epoch 8 for partition [__consumer_offsets,36] (state.change.logger)
[2019-02-25 15:01:03,023] TRACE Broker 2 handling LeaderAndIsr request correlationId 202 from controller 1 epoch 8 starting the become-follower transition for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,023] TRACE Broker 2 stopped fetchers as part of become-follower request from controller 1 epoch 8 with correlation id 202 for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 truncated logs and checkpointed recovery boundaries for partition __consumer_offsets-36 as part of become-follower request with correlation id 202 from controller 1 epoch 8 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 started fetcher to new leader as part of become-follower request from controller 1 epoch 8 with correlation id 202 for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 completed LeaderAndIsr request correlationId 202 from controller 1 epoch 8 for the become-follower transition for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,045] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:1,2,3,LeaderEpoch:10,ControllerEpoch:8),ReplicationFactor:3),AllReplicas:1,2,3) for partition __consumer_offsets-36 in response to UpdateMetadata request sent by controller 1 epoch 8 with correlation id 203 (state.change.logger)
通过以下关键字received-->handling-->stopped-->truncated-->started-->completed LeaderAndIsr request-->cached leader info ,可见 partition 的leader选举还是非常快的,毫秒级
错误分析
我们的producer端的代码里没加 reties 参数,默认就发送一次,遇到leader选举时,找不到leader就会发送失败,造成程序停止
解决办法
producer端加上参数 reties=3, 重试发送三次(默认100ms重试一次 由 retry.backoff.ms控制);
如果还需要保证消息发送的有序性,记得加上参数 max.in.flight.requests.per.connection = 1 限制客户端在单个连接上能够发送的未响应请求的个数,设置此值是1表示kafka broker在响应请求之前client不能再向同一个broker发送请求。(注意:设置此参数是为了满足必须顺序消费的场景,比如binlog数据)