背景
为了保证cassandra不同节点数据的一致性,需要定期进行repair
操作。但是,当数据量达到一定规模时,repair
操作并不简单,经常会遇到这样那样的问题,导致修复失败。本文梳理一些常见的错误,以及对应的解决办法。
Some repair failed 错误
执行nodetool repair keyspace table
命令,可能出现如下错误信息
java.lang.RuntimeException: Repair job has failed with the error message: [2020-08-28 16:27:23,499] Some repair failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
只从这个错误信息完全看不出问题出在哪里,需要到logs/system.log
日志里查询详细的错误信息。
用cat logs/system.log | grep ERROR -A10
查看日志。如果是Validation failed
错误,例如:
... Validation failed in /10.10.10.45
at org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:64) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:182) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:493) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:162) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.11.2.jar:3.11.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
这说明10.10.10.45
这个节点的数据有异常,校验失败。这时,需要到该节点上执行scrub
命令,丢弃掉损坏数据。之后,再返回上一步,重新执行repair
操作即可。
如果是其他错误,例如:
ERROR [GossipTasks:1] 2020-08-28 15:44:16,628 RepairSession.java:338 - [repair #f22193b0-e900-11ea-aee6-ef48888d996a] session completed with the following error
java.io.IOException: Endpoint /10.10.10.45 died
at org.apache.cassandra.repair.RepairSession.convict(RepairSession.java:337) ~[apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:307) [apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:802) [apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:68) [apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:194) [apache-cassandra-3.11.2.jar:3.11.2]
at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) [apache-cassandra-3.11.2.jar:3.11.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_171]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_171]
这类错误,在节点磁盘、网络等压力较大时会出现,此时再次执行repair
即可。
Repair进程卡住不动问题
执行repair
命令时,也可能会遇到进程卡着不动的情况,查看repair进程存在,但是所有节点都没有compaction
任务(repair和scrub会在相关节点上触发compaction
任务,可通过compactionstats
命令查看)。
这时,可以尝试更换修复方式,例如,全量修复repair -full -pr
卡住不动,可以尝试改为增量修复方式。
3.x版本最好使用增量修复方式,不加其他参数,默认就是增量修复。
流程总结
- 通过日志查看哪个节点有问题
- 到对应节点上执行scrub,丢弃已损坏的数据
- 重新执行repair
- 执行listsnapshots,查看快照(scrub会生成快照)
- 执行clearsnapshot,清除快照