一次同事反馈, 说线上访问某个存储的 Spark Executor 进程卡主不动了, 本该几分钟计算完成的任务, 跑了一个多小时都没有完成. 上去看 stacktrace 如下:
"Executor task launch worker-1" daemon prio=10 tid=0x00007fee81cee000 nid=0x2bfa in Object.wait() [0x00007fee63113000]
java.lang.Thread.State: RUNNABLE
at com.amazonaws.auth.internal.AWS4SignerRequestParams.<init>(AWS4SignerRequestParams.java:85)
at com.amazonaws.auth.AWS4Signer.sign(AWS4Signer.java:184)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:800)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:576)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:362)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:328)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:307)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3644)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3597)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:680)
at com.liulishuo.protoss.s3.S3ObjectSummaryReader.getNextObjectSummary(S3ObjectSummaryReader.java:62)
反编译线上对应的jar 包, AWS4SignerRequestParams.java:85
行代码如下:
85: this.formattedSigningDate = AWS4SignerUtils.formatDateStamp(this.signingDateTimeMilli);
那问题就来了: 这一行应该是调用 AWS4SignerUtils.formatDateStamp
方法进入到下一层 stack, 怎么会卡在这里? 而且线程的状态是 java.lang.Thread.State: RUNNABLE
和 Object.wait() [0x00007fee63113000]
. 百思不得其解.
交代一下线上环境:
- OS: Amazon Linux
- JDK: java-1.7.0-openjdk-1.7.0.99
接着使用 debuginfo-install java-1.7.0-openjdk-devel-1.7.0.99-2.6.5.0.66.amzn1.x86_64
安装 debuginfo 并使用 pstack 查看 native stacktrace:
Thread 21 (Thread 0x7fee63115700 (LWP 11258)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fee8843f59b in os::PlatformEvent::park (this=0x7fee70457b00) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:5820
#2 0x00007fee8842cbc5 in ObjectMonitor::wait (this=0x24d5528, millis=millis@entry=0, interruptible=interruptible@entry=false, __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/objectMonitor.cpp:1552
#3 0x00007fee88539102 in ObjectSynchronizer::waitUninterruptibly (obj=..., millis=millis@entry=0, __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/synchronizer.cpp:404
#4 0x00007fee8821a8b2 in waitUninterruptibly (__the_thread__=0x7fee81cee000, this=0x7fee631137e0) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/synchronizer.hpp:159
#5 instanceKlass::initialize_impl (this_oop=..., __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/instanceKlass.cpp:462
#6 0x00007fee8821ae01 in instanceKlass::initialize (this=0x7f26b7170, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/instanceKlass.cpp:246
#7 0x00007fee8836485b in LinkResolver::resolve_static_call (result=..., resolved_klass=..., method_name=0x7fee8224b7b0, method_signature=0x7fee6db74880, current_klass=..., check_access=<optimized out>, initialize_class=true, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:722
#8 0x00007fee88369fbe in resolve_invokestatic (__the_thread__=0x7fee81cee000, index=6, pool=..., result=...) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:1171
#9 LinkResolver::resolve_invoke (result=..., recv=..., recv@entry=..., pool=..., pool@entry=..., index=6, byte=byte@entry=Bytecodes::_invokestatic, __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:1143
#10 0x00007fee882512ba in InterpreterRuntime::resolve_invoke (thread=0x7fee81cee000, bytecode=Bytecodes::_invokestatic) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/interpreterRuntime.cpp:712
#11 0x00007fee7d01d91a in ?? ()
#12 0x00007fee63113bb8 in ?? ()
#13 0x00007fee7d01d8e3 in ?? ()
#14 0x00000154ecf04b93 in ?? ()
#15 0x000000070fb877f8 in ?? ()
#16 0x00000007261dba70 in ?? ()
#17 0x00007fee63113b78 in ?? ()
#18 0x00000007f26b57c7 in ?? ()
#19 0x00007fee63113bf0 in ?? ()
#20 0x00000007f26b6420 in ?? ()
#21 0x0000000000000000 in ?? ()
简单翻了一下 JDK 源码(之前没看过), 猜测如下:
- JVM 在
resolve_static_call
, 说明上述的 Java 代码的地方是正确的 -
instanceKlass::initialize
的时候被锁了, 难道是 ClassLoader 的死锁? - 可以知道的是, 包含上述 Java Class 的 jar 包在 Java 进程中有三个.
#6 0x00007fee8821ae01 in instanceKlass::initialize (this=0x7f26b7170, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/instanceKlass.cpp:246
#7 0x00007fee8836485b in LinkResolver::resolve_static_call (result=..., resolved_klass=..., method_name=0x7fee8224b7b0, method_signature=0x7fee6db74880, current_klass=..., check_access=<optimized out>, initialize_class=true, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:722
#8 0x00007fee88369fbe in resolve_invokestatic (__the_thread__=0x7fee81cee000, index=6, pool=..., result=...) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:1171
补上所有的stacktrace:
暂且写到这里. 原因未明.
-- EOF --