记一次应用hang起问题处理
一. 程序主体
处理逻辑应用开起10个线程分批处理任务,主体代码如下:
private Executor executor = Executors.newFixedThreadPool(THREAD_SIZE);
for (LoanGroup group : groups) {
LOGGER.info("商户ID:{}转换资产开始...", group.getMerchantId());
Integer totalSize = loanMapper.countNeedCastLoans(group.getMerchantId());
Integer times = totalSize % BATCH_SIZE == 0 ? totalSize / BATCH_SIZE : totalSize / BATCH_SIZE + 1;
CountDownLatch countDownLatch = new CountDownLatch(times);
for (int i = 0; i < times; i++) {
final int time = i;
executor.execute(new Runnable() {
@Override
public void run() {
try {
processBatch(group.getMerchantId(), time);
} catch (Exception e) {
LOGGER.error("商户ID:{}开启多线程转换,转换批次:{}...错误:{}", group.getMerchantId(), time, e);
} finally {
countDownLatch.countDown();
}
}
});
}
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
二. 问题现象
应用执行过程中发现任务日志不再打印,且可以确定的是当前任务并没有执行完毕,通过jstat -gcutil 1000 10 采样垃圾回收情况发现连续10次打印结果都没有变化,通过jmap -heap pid 查看虚拟机内存使用情况,发现虚拟机内存还有足够的空间使用。后又通过jstack pid dump出线程栈信息如下:
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode):
"Attach Listener" #47 daemon prio=9 os_prio=31 tid=0x00007ffe38e7a000 nid=0x8707 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"pool-1-thread-10" #46 prio=5 os_prio=31 tid=0x00007ffe33e8f000 nid=0x8503 waiting on condition [0x00007000125a3000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007415c10d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
"pool-1-thread-9" #45 prio=5 os_prio=31 tid=0x00007ffe33167800 nid=0x8303 runnable [0x000070001249e000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:101)
at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:144)
at com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:174)
- locked <0x0000000740cf8780> (a com.mysql.jdbc.util.ReadAheadInputStream)
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3008)
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2212)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3410)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2501)
- locked <0x0000000740cf36c8> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
- locked <0x0000000740cf36c8> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
- locked <0x0000000740cf36c8> (a com.mysql.jdbc.JDBC4Connection)
at com.alibaba.druid.pool.DruidPooledPreparedStatement.execute(DruidPooledPreparedStatement.java:493)
10个线程中6个线程处于waiting状态,其他4个线程全如thread-9一样处于RUNNABLE状态。而且处于RUNNABLE状态的线程全都阻塞在at com.mysql.jdbc.util.ReadAheadInputStream.fill处。
三.分析处理
根据现象猜测可能是网络抖动丢包引起jdbc客户端始终hang在套接字读取数据处。
排查jdbc数据库连接池配置以及mybatis的statement相关配置参数,发现并没有对statement的timeout参数有相关配置,而且整个处理逻辑中也没有用到事务。
定位问题应该就是引网络抖动引起而超时设置又没有配置引起
解决方法:增加mybatis.configuration.default-statement-timeout 配置
参考资料:
1.http://www.importnew.com/2466.html
2.http://www.cubrid.org/blog/understanding-jdbc-internals-and-timeout-configuration