1.先看日志信息
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Starting automatic rewriting of AOF on 107914% growth
Background append only file rewriting started by pid 4143
AOF rewrite child asks to stop sending diffs.
Parent agreed to stop sending diffs. Finalizing AOF...
Concatenating 0.00 MB of AOF diff received from parent.
SYNC append only file rewrite performed
AOF rewrite: 2 MB of memory used by copy-on-write
Background AOF rewrite terminated with success
Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
Background AOF rewrite finished successfully
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
2.看一下自己Redis相关的配置
appendonly yes # 开启aof
appendfsync everysec # 设置aof策略,每秒写入一次
aof-use-rdb-preamble yes #开启aof rdb混合使用
aof-load-truncated yes # redis启动加载aof文件时,忽略掉错误的命令,尽可能多的加载可用命令
aof-rewrite-incremental-fsync yes # 分批刷入aof文件,可以有效利用顺序IO
no-appendfsync-on-rewrite no # 保证数据尽可能少的丢失,设置为no,最多丢失2s数据,设置为yes,最多会丢失30s数据
auto-aof-rewrite-min-size 67108864 # aof文件大小 64M
auto-aof-rewrite-percentage 100 #(aof_current_size-aof_base_size)/aof_base_size与100%比较
3.查看监控并分析问题原因
图一结合监控分析可以看到图一aof_delayed_fsync参数一致在持续增加,代表着aof在持续发生阻塞的情况
图二可以看到已经满足上述的aof进行rewrite的条件,aof在频繁的进行rewrite操作
原因:
1.客户端是用redis来做队列,又怕数据丢失,选择了aof做持久化,队列中的key还都很大,基本上都是30k左右的值,虽然监控上看内存的值是没有很大。
2.大量的大命令都堆积到了aof文件中,aof文件很快就达到了rewrite的触发条件,导致redis在不断的进行rewrite。
3.因为设置了no-appendfsync-on-rewrite no,所以在rewrite期间,是不允许追加fsync的,再加上频繁的rewrite操作,导致了aof的阻塞。
no-appendfsync-on-rewrite no / appendfsync everysec
每秒落盘一次,实际上不是1s,看下边的逻辑图,主线程在对比时间判断的是2s,此时最多丢失2s数据no-appendfsync-on-rewrite yes / appendfsync everysec = appendfsync no
那么缓存中的数据只能等到linux的sync执行的时候才会落盘,默认间隔30s,此时最多丢失30s数据
4.解决方案
对于redis来说,最好还是用来做缓存,用来做队列,还要使用aof来持久化是不建议的,建议将redis做队列的功能,更改为用kafka/rabbitmq/rocketmq等专业的队列中间件来实现,若想继续使用redis做的话,请关闭aof持久化,并减小参数值,避免redis的阻塞,至于数据丢失问题,可以外加数据补偿机制,如果redis宕机等以外情况发生可以自行重推数据.