MySQL因IO能力不足导致复制延迟过大

背景

收到告警,内容为:MySQL_主从复制延迟过高(实例:XXXX),当前值:1032.00,阈值:1000 - 100000000 ,发生时间:12/30/2021 02:27:16。

分析过程

一般情况下,出现复制延迟,我们首先看从库的复制状态和活动会话情况。第二天早上分析时,发现已经没有延迟了。还好我们的监控和运维脚本做得较完善,监控脚本是每分钟执行一次,发现延迟就会记录上述两个信息,所以从监控保留的日志看到,最早是2021-12-30 02:02:48就已经有9秒的延迟了。

  • 查看复制状态

从复制状态可以看到,IO线程和SQL线程运行状态均是正常的Yes,复制延迟Seconds_Behind_Master为9秒,SQL线程的状态为Waiting for dependent transaction to commit,也就是等待事务提交。

[kingly@rhel7]$ 
mysql > show slave status\G
        Relay_Master_Log_File: master-bin.039806
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 927883985
              Relay_Log_Space: 929519379
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 9
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
                  Master_UUID: 157bcec0-fc94-11e9-a77d-48df372eb155
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:10146560435-10186081960
            Executed_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:1-10186081526,
68eb64a3-fc94-11e9-b9c7-48df372eaef5:1-228645
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: 
           Master_TLS_Version:
image.gif
  • 查看活动会话情况

从活动会话看,几条DELETE语句均处于System lock状态,而出现System lock 的原因通常有:1)大量小事务,比如update大表的一行数据;2)表上没有主键;3)类似InnoDB lock,slave修改数据同时和sql_thread修改同样数据;4)IO压力过大。而且其他SQL回放线程均处于等待事务提交的状态。

[kingly@rhel7]$ 
mysql >show processlist;           
2021-12-30 02:02:48     NULL    NULL    114314  Waiting for master to send event        NULL
2021-12-30 02:02:48     NULL    NULL    0       Waiting for dependent transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_48 WHERE  id=45702564
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_75 WHERE  id=47490222
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_84 WHERE  id=46076462
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_95 WHERE  id=46705915
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
image.gif

所以需要进一步分析,当前从库是否有执行的事务未提交,导致阻塞了SQL线程回放动作以及这些DELETE操作。通过分析问题开始时间点2021-12-30 2点左右的binlog,发现并没有短时间的大事务,只有少量基于主键的单行DELETE操作,按道理是会执行很快。而且我们从库是设置只读的,通常不会执行啥事务。因此,排查方向切换到服务器上。

  • 查看服务器负载

通过部署的OSWBB脚本,分析其中的vmstat结果,发现在2021-12-30 2点左右,服务器只有约1-2个线程处于运行状态,有大量线程(约40多个)处于阻塞(block)状态,这是很不正常的,且CPU列可以看到有约10%的处于IO wait状态,说明当时服务器IO压力非常大。

zzz ***Thu Dec 30 02:02:35 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 42 126088 1259068    404 110934224    0    0   169   166    0    0  0  0 99  0  0
 1 42 126088 1251800    404 110934512    0    0 612144   502 8380 4156  0  0 91  8  0
 2 41 126088 1252672    404 110934576    0    0 620544   358 8844 4115  0  1 90  9  0
zzz ***Thu Dec 30 02:02:50 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 41 126088 1256476    404 110936800    0    0   169   166    0    0  0  0 99  0  0
 1 41 126088 1249628    404 110937072    0    0 596992  2147 9037 4735  0  0 89 11  0
 2 42 126088 1247176    404 110937536    0    0 630704   605 12100 6991  1  0 87 12  0
zzz ***Thu Dec 30 02:03:05 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 42 126088 1251024    404 110939968    0    0   169   166    0    0  0  0 99  0  0
 3 37 126088 1242940    404 110940208    0    0 626448   475 8316 3920  0  0 89 10  0
 1 42 126088 1246844    404 110940472    0    0 606976 11179 12130 8121  0  0 88 11  0
zzz ***Thu Dec 30 02:03:20 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 39 126088 1247112    404 110943920    0    0   169   166    0    0  0  0 99  0  0
 1 43 126088 1240348    404 110944080    0    0 629248   442 9168 4325  0  0 88 11  0
 1 41 126088 1242900    404 110944848    0    0 578576   498 7578 3871  0  0 89 10  0
image.gif
  • 分析数据库日志

查看数据库错误日志,也发现了端倪。大量的 [Note] InnoDB: page_cleane的提示,说明当时刷脏慢,进一步证明了服务器IO压力过大。

2021-12-30T01:55:17.492848+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270201857; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682298959400 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:57:18.156330+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270245889; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682611334700 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:59:20.712082+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270290945; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683053354600 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T02:01:20.422439+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270330881; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683501041500 waited (count) when Workers occupied = 2554254 waited when Workers occupied = 6778342662900
2021-12-30T02:01:33.788727+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5527ms. The settings might not be optimal. (flushed=74 and evicted=0, during the time.)
2021-12-30T02:01:52.969122+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18181ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:02:36.051478+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13227ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:03:22.916208+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270349313; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18770439067400 waited (count) when Workers occupied = 2554266 waited when Workers occupied = 6779684449500
2021-12-30T02:03:49.254684+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13291ms. The settings might not be optimal. (flushed=72 and evicted=0, during the time.)
2021-12-30T02:05:23.176365+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270361601; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18859577400400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:06:25.751734+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 19666ms. The settings might not be optimal. (flushed=70 and evicted=0, during the time.)
2021-12-30T02:07:38.989406+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18069ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:07:44.028187+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 141; events assigned = 270376961; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18970653888400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:08:22.735042+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 15828ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:09:32.283672+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 12917ms. The settings might not be optimal. (flushed=75 and evicted=0, during the time.)
image.gif

问题真相

考虑我们是一个高配服务器,只运行数据库服务,无其他应用服务,部署的也是单实例数据库,使用的SAN存储做数据存放(比较奢侈),为啥会有IO瓶颈呢?结合问题发生时间点,联想到应该是备份任务导致的。具体备份策略如下,每天的2点会进行增备。

[kingly@rhel7]$crontab -l
00 00 * * 7 sh /mysql/scripts/backup/mysql_backup.sh FULL 1>/dev/null 2>&1
00 2 * * *   sh /mysql/scripts/backup/mysql_backup.sh INCR 1>/dev/null 2>&1

image.gif

为啥之前没出问题,最近就出问题了呢。其实是这套库数据量有数T,近期业务规模增长较大,考虑到备份时间较长,为避免备份过程中占用IO过高,影响了同交换机的其他系统,跟存储同事商量后,将此服务器的交换机端口进行了限速。结果由于xtrabackup备份在开始备份时,是直接拷贝ibd文件的,瞬时的IO流量会非常大,服务器本身的IO能力足够,所以导致这个IO流量在交换机端因限速导致积压,进而影响了服务器往SAN交换机写入数据,所以看到服务器出现了大量的IO等待,导致SQL线程回放过程慢,出现了复制延迟。

所以这次的MySQL复制延迟原因是IO能力不足。

总结

MySQL复制延迟原因非常多,后面有机会也会再整理一些案例。但是像IO能力不足导致的复制延迟现象平常可能比较少见,如果大家有服务器配置不高或IO能力不强的情况,可以留意下。

================欢迎大家同时关注我的其他平台文章===============
微信公众号 | 老王聊MySQL
知乎 | 老王聊MySQL | https://www.zhihu.com/people/old_king
CSDN | 老王聊MySQL | https://blog.csdn.net/wly619?spm=1010.2135.3001.5421
简书 | 老王聊MySQL | https://www.jianshu.com/u/80e4c9cd3af5

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,053评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,527评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,779评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,685评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,699评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,609评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,989评论 3 396
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,654评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,890评论 1 298
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,634评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,716评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,394评论 4 319
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,976评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,950评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,191评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,849评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,458评论 2 342

推荐阅读更多精彩内容