背景
收到告警,内容为:MySQL_主从复制延迟过高(实例:XXXX),当前值:1032.00,阈值:1000 - 100000000 ,发生时间:12/30/2021 02:27:16。
分析过程
一般情况下,出现复制延迟,我们首先看从库的复制状态和活动会话情况。第二天早上分析时,发现已经没有延迟了。还好我们的监控和运维脚本做得较完善,监控脚本是每分钟执行一次,发现延迟就会记录上述两个信息,所以从监控保留的日志看到,最早是2021-12-30 02:02:48就已经有9秒的延迟了。
- 查看复制状态
从复制状态可以看到,IO线程和SQL线程运行状态均是正常的Yes,复制延迟Seconds_Behind_Master为9秒,SQL线程的状态为Waiting for dependent transaction to commit,也就是等待事务提交。
[kingly@rhel7]$
mysql > show slave status\G
Relay_Master_Log_File: master-bin.039806
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 927883985
Relay_Log_Space: 929519379
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 9
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 10
Master_UUID: 157bcec0-fc94-11e9-a77d-48df372eb155
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:10146560435-10186081960
Executed_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:1-10186081526,
68eb64a3-fc94-11e9-b9c7-48df372eaef5:1-228645
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
- 查看活动会话情况
从活动会话看,几条DELETE语句均处于System lock状态,而出现System lock 的原因通常有:1)大量小事务,比如update大表的一行数据;2)表上没有主键;3)类似InnoDB lock,slave修改数据同时和sql_thread修改同样数据;4)IO压力过大。而且其他SQL回放线程均处于等待事务提交的状态。
[kingly@rhel7]$
mysql >show processlist;
2021-12-30 02:02:48 NULL NULL 114314 Waiting for master to send event NULL
2021-12-30 02:02:48 NULL NULL 0 Waiting for dependent transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_48 WHERE id=45702564
2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_75 WHERE id=47490222
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_84 WHERE id=46076462
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_95 WHERE id=46705915
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL
2021-12-30 02:02:48 NULL NULL 49 Waiting for an event from Coordinator NULL
2021-12-30 02:02:48 NULL NULL 49 Waiting for an event from Coordinator NULL
2021-12-30 02:02:48 NULL NULL 49 Waiting for an event from Coordinator NULL
2021-12-30 02:02:48 NULL NULL 55 Waiting for an event from Coordinator NULL
2021-12-30 02:02:48 NULL NULL 55 Waiting for an event from Coordinator NULL
2021-12-30 02:02:48 NULL NULL 55 Waiting for an event from Coordinator NULL
所以需要进一步分析,当前从库是否有执行的事务未提交,导致阻塞了SQL线程回放动作以及这些DELETE操作。通过分析问题开始时间点2021-12-30 2点左右的binlog,发现并没有短时间的大事务,只有少量基于主键的单行DELETE操作,按道理是会执行很快。而且我们从库是设置只读的,通常不会执行啥事务。因此,排查方向切换到服务器上。
- 查看服务器负载
通过部署的OSWBB脚本,分析其中的vmstat结果,发现在2021-12-30 2点左右,服务器只有约1-2个线程处于运行状态,有大量线程(约40多个)处于阻塞(block)状态,这是很不正常的,且CPU列可以看到有约10%的处于IO wait状态,说明当时服务器IO压力非常大。
zzz ***Thu Dec 30 02:02:35 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 42 126088 1259068 404 110934224 0 0 169 166 0 0 0 0 99 0 0
1 42 126088 1251800 404 110934512 0 0 612144 502 8380 4156 0 0 91 8 0
2 41 126088 1252672 404 110934576 0 0 620544 358 8844 4115 0 1 90 9 0
zzz ***Thu Dec 30 02:02:50 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 41 126088 1256476 404 110936800 0 0 169 166 0 0 0 0 99 0 0
1 41 126088 1249628 404 110937072 0 0 596992 2147 9037 4735 0 0 89 11 0
2 42 126088 1247176 404 110937536 0 0 630704 605 12100 6991 1 0 87 12 0
zzz ***Thu Dec 30 02:03:05 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 42 126088 1251024 404 110939968 0 0 169 166 0 0 0 0 99 0 0
3 37 126088 1242940 404 110940208 0 0 626448 475 8316 3920 0 0 89 10 0
1 42 126088 1246844 404 110940472 0 0 606976 11179 12130 8121 0 0 88 11 0
zzz ***Thu Dec 30 02:03:20 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 39 126088 1247112 404 110943920 0 0 169 166 0 0 0 0 99 0 0
1 43 126088 1240348 404 110944080 0 0 629248 442 9168 4325 0 0 88 11 0
1 41 126088 1242900 404 110944848 0 0 578576 498 7578 3871 0 0 89 10 0
- 分析数据库日志
查看数据库错误日志,也发现了端倪。大量的 [Note] InnoDB: page_cleane的提示,说明当时刷脏慢,进一步证明了服务器IO压力过大。
2021-12-30T01:55:17.492848+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270201857; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682298959400 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:57:18.156330+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270245889; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682611334700 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:59:20.712082+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270290945; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683053354600 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T02:01:20.422439+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270330881; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683501041500 waited (count) when Workers occupied = 2554254 waited when Workers occupied = 6778342662900
2021-12-30T02:01:33.788727+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5527ms. The settings might not be optimal. (flushed=74 and evicted=0, during the time.)
2021-12-30T02:01:52.969122+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18181ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:02:36.051478+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13227ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:03:22.916208+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270349313; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18770439067400 waited (count) when Workers occupied = 2554266 waited when Workers occupied = 6779684449500
2021-12-30T02:03:49.254684+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13291ms. The settings might not be optimal. (flushed=72 and evicted=0, during the time.)
2021-12-30T02:05:23.176365+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270361601; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18859577400400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:06:25.751734+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 19666ms. The settings might not be optimal. (flushed=70 and evicted=0, during the time.)
2021-12-30T02:07:38.989406+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18069ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:07:44.028187+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 141; events assigned = 270376961; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18970653888400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:08:22.735042+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 15828ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:09:32.283672+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 12917ms. The settings might not be optimal. (flushed=75 and evicted=0, during the time.)
问题真相
考虑我们是一个高配服务器,只运行数据库服务,无其他应用服务,部署的也是单实例数据库,使用的SAN存储做数据存放(比较奢侈),为啥会有IO瓶颈呢?结合问题发生时间点,联想到应该是备份任务导致的。具体备份策略如下,每天的2点会进行增备。
[kingly@rhel7]$crontab -l
00 00 * * 7 sh /mysql/scripts/backup/mysql_backup.sh FULL 1>/dev/null 2>&1
00 2 * * * sh /mysql/scripts/backup/mysql_backup.sh INCR 1>/dev/null 2>&1
为啥之前没出问题,最近就出问题了呢。其实是这套库数据量有数T,近期业务规模增长较大,考虑到备份时间较长,为避免备份过程中占用IO过高,影响了同交换机的其他系统,跟存储同事商量后,将此服务器的交换机端口进行了限速。结果由于xtrabackup备份在开始备份时,是直接拷贝ibd文件的,瞬时的IO流量会非常大,服务器本身的IO能力足够,所以导致这个IO流量在交换机端因限速导致积压,进而影响了服务器往SAN交换机写入数据,所以看到服务器出现了大量的IO等待,导致SQL线程回放过程慢,出现了复制延迟。
所以这次的MySQL复制延迟原因是IO能力不足。
总结
MySQL复制延迟原因非常多,后面有机会也会再整理一些案例。但是像IO能力不足导致的复制延迟现象平常可能比较少见,如果大家有服务器配置不高或IO能力不强的情况,可以留意下。
================欢迎大家同时关注我的其他平台文章===============
微信公众号 | 老王聊MySQL
知乎 | 老王聊MySQL | https://www.zhihu.com/people/old_king
CSDN | 老王聊MySQL | https://blog.csdn.net/wly619?spm=1010.2135.3001.5421
简书 | 老王聊MySQL | https://www.jianshu.com/u/80e4c9cd3af5