有次突然观察到线上很多获取图片出现响应码500的错误。由于历史遗留原因,以前的图片文件保存在MongoDB的GridFS里面。这次突然出现获取图片500的错误,很有可能MongoDB出问题了。
用mongo命令连接到PRIMARY节点上去查看副本集的状态,如下:
ptcupload:PRIMARY> db.version()
2.4.8
ptcupload:PRIMARY> rs.status()
{
"set" : "ptcupload",
"date" : ISODate("2016-04-11T15:56:08Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "ptcupload-mongodb-001:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 6355878,
"optime" : Timestamp(1460390155, 1),
"optimeDate" : ISODate("2016-04-11T15:55:55Z"),
"self" : true
},
{
"_id" : 1,
"name" : "ptcupload-mongodb-002:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 45351,
"optime" : Timestamp(1459863715, 2),
"optimeDate" : ISODate("2016-04-05T13:41:55Z"),
"lastHeartbeat" : ISODate("2016-04-11T15:56:08Z"),
"lastHeartbeatRecv" : ISODate("2016-04-11T15:56:08Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "syncThread: 10334 BSONObj size: 1853163520 (0x0008756E) is invalid. Size must be between 0 and 16793600(16MB) First element: que: ?type=105",
"syncingTo" : "ptcupload-mongodb-001:27018"
},
{
"_id" : 2,
"name" : "ptcupload-mongodb-003:37018",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 5992973,
"lastHeartbeat" : ISODate("2016-04-11T15:56:07Z"),
"lastHeartbeatRecv" : ISODate("2016-04-11T15:56:07Z"),
"pingMs" : 0
}
],
"ok" : 1
}
整个副本集由一个PRIMARY、一个SECONDARY和一个ARBITER节点组成。从上面输出的副本集状态可以看到SECONDARY节点同步出现了问题:
"lastHeartbeatMessage" : "syncThread: 10334 BSONObj size: 1853163520 (0x0008756E) is invalid. Size must be between 0 and 16793600(16MB) First element: que: ?type=105"
还观察到SECONDARY节点的uptime
为45351秒(12.59小时),远小于PRIMARY节点的uptime
. 由此,怀疑12.59小时前SECONDARY节点宕机自动重启过。
我司MongoDB数据库有专门的数据库管理员管理,我们业务使用方没有查看服务器相关日志的权限,因此不方便调查故障原因。不过,对于这种情况,有一种修复办法,即将有问题的SECONDARY节点撤掉,重新加一个新的SECONDARY节点。