近日项目组上线了一个业务特性后,日志被异常日志刷了屏,这个频繁出现的日志内容为:
2018-05-30 14:27:46,672 | WARN | monPool-worker-6 | AsyncGetHandler | 361 - org.opendaylight.snmp - 2.1.0-SNAPSHOT | Exception when sending GET request
org.snmp4j.MessageException: No buffer space available (sendto failed)
at org.snmp4j.MessageDispatcherImpl.sendPdu(MessageDispatcherImpl.java:526)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.snmp4j.Snmp.sendMessage(Snmp.java:1004)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.snmp4j.Snmp.send(Snmp.java:974)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.snmp4j.Snmp.send(Snmp.java:958)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.opendaylight.snmp.plugin.internal.AsyncGetHandler.sendRequest(AsyncGetHandler.java:178)[361:org.opendaylight.snmp:2.1.0-SNAPSHOT]
at org.opendaylight.snmp.plugin.internal.AsyncGetHandler.getRpcResponse(AsyncGetHandler.java:186)[361:org.opendaylight.snmp:2.1.0-SNAPSHOT]
at org.opendaylight.snmp.plugin.internal.SNMPImpl.snmpGet(SNMPImpl.java:128)[361:org.opendaylight.snmp:2.1.0-SNAPSHOT]
at org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)[124:org.opendaylight.mdsal.yang-binding:0.10.3.Carbon]
at org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:52)[124:org.opendaylight.mdsal.yang-binding:0.10.3.Carbon]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:83)[205:org.opendaylight.controller.sal-binding-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:70)[205:org.opendaylight.controller.sal-binding-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:39)[203:org.opendaylight.controller.sal-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:177)[203:org.opendaylight.controller.sal-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102)[203:org.opendaylight.controller.sal-broker-impl:1.5.3.Carbon]
at Proxydb85dcd1_4bc5_4f96_b771_e4d4fd957676.invokeRpc(Unknown Source)[:]
at Proxy630aa470_14a8_4d02_96f2_cd4207c27a7a.invokeRpc(Unknown Source)[:]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke0(RpcServiceAdapter.java:65)[205:org.opendaylight.controller.sal-binding-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.access$000(RpcServiceAdapter.java:43)[205:org.opendaylight.controller.sal-binding-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter$RpcInvocationStrategy.invoke(RpcServiceAdapter.java:159)[205:org.opendaylight.controller.sal-binding-broker-impl:1.5.3.Carbon]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke(RpcServiceAdapter.java:96)[205:org.opendaylight.controller.sal-binding-broker-impl:1.5.3.Carbon]
at com.sun.proxy.$Proxy170.snmpGet(Unknown Source)[361:org.opendaylight.snmp:2.1.0-SNAPSHOT]
at com.sdn.service.devdiscvry.impl.SnmpTask.getSnmpFuture(SnmpTask.java:249)[362:com.sdn.service.devdiscvry.impl:2.0.0.SNAPSHOT]
at com.sdn.service.devdiscvry.impl.SnmpTask.run(SnmpTask.java:95)[362:com.sdn.service.devdiscvry.impl:2.0.0.SNAPSHOT]
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)[:1.8.0_152]
at java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1618)[:1.8.0_152]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)[:1.8.0_152]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)[:1.8.0_152]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)[:1.8.0_152]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)[:1.8.0_152]
Caused by: java.io.IOException: No buffer space available (sendto failed)
at java.net.PlainDatagramSocketImpl.send(Native Method)[:1.8.0_152]
at java.net.DatagramSocket.send(DatagramSocket.java:693)[:1.8.0_152]
at org.snmp4j.transport.DefaultUdpTransportMapping.sendMessage(DefaultUdpTransportMapping.java:117)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.snmp4j.transport.DefaultUdpTransportMapping.sendMessage(DefaultUdpTransportMapping.java:42)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.snmp4j.MessageDispatcherImpl.sendMessage(MessageDispatcherImpl.java:202)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
at org.snmp4j.MessageDispatcherImpl.sendPdu(MessageDispatcherImpl.java:503)[359:org.apache.servicemix.bundles.snmp4j:2.3.1.1]
... 28 more2018-05-30 14:27:46,672 | INFO | monPool-worker-6 | SnmpTask | 362 - com.sdn.service.devdiscvry.impl - 2.0.0.SNAPSHOT | Devdiscvry: snmp 1.3.6.1.2.1.1.5 getwalk hostname nw_YQ01_192.168.151.171
2018-05-30 14:27:46,672 | ERROR | monPool-worker-6 | ExecutionList | 104 - com.google.guava - 18.0.0 | RuntimeException while executing runnable com.google.common.util.concurrent.Futures$6@4fa08f52 with executor INSTANCE
java.lang.NullPointerException
从异常堆栈信息可知,在发送snmp get消息时,程序报了IOException: No buffer space available (sendto failed)
WARN,并在随后的代码中引发了NullPointerException
ERROR。而抛出异常的原始函数为java.net.PlainDatagramSocketImpl.send()
,这是一个udp报文的发送函数,我们看一下函数的内容:
protected native void send(DatagramPacket p) throws IOException;
这是一个native函数,说明异常是从jdk中抛出来的。
由于udp发送报文最终会通过send
、write
、sendto
等函数通过linux内核网络协议栈进行发送,查看这几个函数的返回错误码,与buffer空间相关的两个错误码为:
ENOBUFS
The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by tran-
sient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.)
ENOMEM No memory available.
ENOBUFS说明网络接口的输出队列满了,ENOMEM说明内核某个地方需要分配内存资源时,但是内存资源(物理的或逻辑限额)用完了。
同时查看内核日志/var/log/messages
,发现内核打印了大量的kernel: Neighbour table overflow
信息
# tail -f /var/log/messages
Jun 6 16:37:29 sdn-dcn-controller03 kernel: __ratelimit: 1013 callbacks suppressed
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:29 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:34 sdn-dcn-controller03 kernel: __ratelimit: 101 callbacks suppressed
Jun 6 16:37:34 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:34 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:34 sdn-dcn-controller03 kernel: Neighbour table overflow.
Jun 6 16:37:34 sdn-dcn-controller03 kernel: Neighbour table overflow.
说明邻居表溢出了。
查看一下打印kernel: Neighbour table overflow
信息的内核源代码,可知arp表满了,但是又没有旧的表项可以释放:
/* Neighbour tables are full and nothing
can be released. Try to shrink route cache,
it is most likely it holds some neighbour records.
*/
if (attempts-- > 0) {
int saved_elasticity = ip_rt_gc_elasticity;
int saved_int = ip_rt_gc_min_interval;
ip_rt_gc_elasticity = 1;
ip_rt_gc_min_interval = 0;
rt_garbage_collect(&ipv4_dst_ops);
ip_rt_gc_min_interval = saved_int;
ip_rt_gc_elasticity = saved_elasticity;
goto restart;
}
if (net_ratelimit())
printk(KERN_WARNING "Neighbour table overflow.\n");
rt_drop(rt);
return -ENOBUFS;
多次查看arp表的内容,发现表的大小经常在1000多,
# arp -an | wc
1023 6238 61876
查看跟arp表项大小相关的内核参数
# cat /proc/sys/net/ipv4/neigh/default/gc_interval
30
# cat /proc/sys/net/ipv4/neigh/default/gc_stale_time
60
# cat /proc/sys/net/ipv4/neigh/default/gc_thresh1
128
# cat /proc/sys/net/ipv4/neigh/default/gc_thresh2
512
# cat /proc/sys/net/ipv4/neigh/default/gc_thresh3
1024
gc_interval表示arp表垃圾收集的时间间隔,gc_stale_time表示arp表项老化时间,如果某个arp表项在gc_stale_time时间内没有被使用,在垃圾收集时会把它删除掉,gc_thresh1表示arp表项数小于该值时,不会进行垃圾收集,gc_thresh2表示如果arp表项数大于该值时,5秒内会执行垃圾收集,而gc_thresh3表示arp表的最大限额,如果发现arp表达到了最大限额,会执行垃圾收集,尝试释放失效的arp表项,如果没有表项可以释放,则会报Neighbour table overflow
异常。
从当前内核参数值来看,arp表项最大值为1024,表项老化时间为60s。跟模块开发人员交流并结合代码分析,了解到出现异常的背景为:有一个模块会周期性的多线程并发向同一个局域网内的几千台机器发送snmp get请求时。这种情况下1024条arp表项显然太小了,导致内核报Neighbour table overflow
,应用程序报No buffer space available (sendto failed)
异常
临时修改arp表最大限额参数gc_thresh3的值,
# echo "10240" > /proc/sys/net/ipv4/neigh/default/gc_thresh3
或永久修改:
# vim /etc/sysctl.conf
net.ipv4.neigh.default.gc_thresh3 = 10240
# sysctl -p
查看日志,没有继续打印开头显示的异常日志了,查看arp表大小,峰值为:
# ip nei | wc
5202 27946 273768
参考修改方法
1 扩大arp表项大小
2 多个设备负载均衡,每个控制器只向部分设备发送snmg get,降低arp表项资源需求
3 周期内分时间间隔发送snmp get请求,每个时间间隔只发送部分请求,同时间隔要大于表项老化时间