问题现象
- 复现步骤
- 手机(Android 7.0)
- Monkey测试,Camera2 app偶现native crash
定位分析
- 相关log
- tombstone
Revision: '0' ABI: 'arm' pid: 20403, tid: 20414, name: HeapTaskDaemon >>> com.android.camera2 <<< signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr -------- Abort message: 'art/runtime/thread_list.cc:621] Check failed: cur_val == 0 (cur_val=1, 0=0) ' r0 00000000 r1 00004fbe r2 00000006 r3 00000008 r4 a0c7f978 r5 00000006 r6 a0c7f920 r7 0000010c r8 a7085900 r9 a6fae8c7 sl a0c7f2ec fp 00000001 ip 0000000b sp a0c7f1f0 lr a7224597 pc a7226df4 cpsr 20010010 d0 0000000000000000 d1 0000000000000000 d2 0000000000000000 d3 0000000000000000 d4 0000000000000000 d5 0000000000000000 d6 0000000000000000 d7 0000000000000000 d8 0000000000000000 d9 0000000000000000 d10 0000000000000000 d11 0000000000000000 d12 0000000000000000 d13 0000000000000000 d14 0000000000000000 d15 0000000000000000 d16 2e74736973726570 d17 2e6761742e676f6c d18 a72360a3a723609b d19 a72360afa72360ab d20 656d72612c613776 d21 0065006d00720061 d22 656d72612c613776 d23 0000000000000000 d24 0000003f0000003f d25 0000000000000000 d26 0000000000000000 d27 0000000000000000 d28 0000000000000000 d29 0000000000000000 d30 0000000000000000 d31 00000000a70e7940 scr 80000093 backtrace: #00 pc 00049df4 /system/lib/libc.so (tgkill+12) #01 pc 00047593 /system/lib/libc.so (pthread_kill+34) #02 pc 0001d855 /system/lib/libc.so (raise+10) #03 pc 000193a1 /system/lib/libc.so (__libc_android_abort+34) #04 pc 00017014 /system/lib/libc.so (abort+4) #05 pc 003188f5 /system/lib/libart.so (_ZN3art7Runtime5AbortEv+252) #06 pc 000b4e79 /system/lib/libart.so (_ZN3art10LogMessageD2Ev+864) #07 pc 0033c0cd /system/lib/libart.so (_ZN3art10ThreadList18SuspendAllInternalEPNS_6ThreadES2_S2_b+1400) #08 pc 0033c2cf /system/lib/libart.so (_ZN3art10ThreadList10SuspendAllEPKcb+374) #09 pc 0016ebc5 /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector11ScopedPauseC2EPS2_+32) #10 pc 0017385d /system/lib/libart.so (_ZN3art2gc9collector9MarkSweep9RunPhasesEv+160) #11 pc 0016e5d5 /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector3RunENS0_7GcCauseEb+248) #12 pc 00191aed /system/lib/libart.so (_ZN3art2gc4Heap22CollectGarbageInternalENS0_9collector6GcTypeENS0_7GcCauseEb+2364) #13 pc 001973cd /system/lib/libart.so (_ZN3art2gc4Heap12ConcurrentGCEPNS_6ThreadEb+68) #14 pc 0019bf03 /system/lib/libart.so (_ZN3art2gc4Heap16ConcurrentGCTask3RunEPNS_6ThreadE+18) #15 pc 001b42c3 /system/lib/libart.so (_ZN3art2gc13TaskProcessor11RunAllTasksEPNS_6ThreadE+30) #16 pc 7413f047 /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2845000)
- tombstone
初步分析
从堆栈来看,GC时去做suspend ,但suspend过程中检测到异常并主动abort.
01-02 00:04:16.124 20403 20414 F art : art/runtime/thread_list.cc:621] Check failed: cur_val == 0 (cur_val=1, 0=0)
01-02 00:04:16.765 20403 20414 F art : art/runtime/runtime.cc:404] Runtime aborting...
01-02 00:04:16.765 20403 20414 F art : art/runtime/runtime.cc:404] Aborting thread:
01-02 00:04:16.765 20403 20414 F art : art/runtime/runtime.cc:404] "HeapTaskDaemon" prio=5 tid=8 WaitingPerformingGc
01-02 00:04:16.765 20403 20414 F art : art/runtime/runtime.cc:404] | group="" sCount=0 dsCount=0 obj=0x12c3e4c0 self=0xa7085900
01-02 00:04:16.765 20403 20414 F art : art/runtime/runtime.cc:404] | sysTid=20414 nice=0 cgrp=default sched=0/0 handle=0xa0c7f920
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] | state=R schedstat=( 0 0 0 ) utm=16 stm=2 core=2 HZ=100
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] | stack=0xa0b7d000-0xa0b7f000 stackSize=1038KB
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] | held mutexes= "abort lock"
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] native: #00 pc 0034e109 /system/lib/libart.so (_ZN3art15DumpNativeStackERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEEiP12BacktraceMapPKcPNS_9ArtMethodEPv+128)
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] native: #01 pc 0032ea05 /system/lib/libart.so (_ZNK3art6Thread9DumpStackERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEEbP12BacktraceMap+308)
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] native: #02 pc 00321ba5 /system/lib/libart.so (_ZNK3art10AbortState10DumpThreadERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEEPNS_6ThreadE+24)
01-02 00:04:16.766 20403 20414 F art : art/runtime/runtime.cc:404] native: #03 pc 00321a2d /system/lib/libart.so (_ZNK3art10AbortState4DumpERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+424)
深入分析
-
关键code如下
... ... while (true) { int32_t cur_val = pending_threads.LoadRelaxed(); if (LIKELY(cur_val > 0)) { #if ART_USE_FUTEXES if (futex(pending_threads.Address(), FUTEX_WAIT, cur_val, &wait_timeout, nullptr, 0) != 0) { // EAGAIN and EINTR both indicate a spurious failure, try again from the beginning. if ((errno != EAGAIN) && (errno != EINTR)) { if (errno == ETIMEDOUT) { LOG(kIsDebugBuild ? FATAL : ERROR) << "Unexpected time out during suspend all."; } else { PLOG(FATAL) << "futex wait failed for SuspendAllInternal()"; } } } else { cur_val = pending_threads.LoadRelaxed(); CHECK_EQ(cur_val, 0); ==>check failed break; } #else // Spin wait. This is likely to be slow, but on most architecture ART_USE_FUTEXES is set. #endif } else { CHECK_EQ(cur_val, 0); break; } } ... ...
CHECK_EQ(cur_val, 0);失败的原因在于当前获取的cur_val为1,表示当前仍有一个线程在运行,并未挂起,但Google原生逻辑认为此时所有的线程已suspend.
-
查看futex()用法如下
int futex(int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3); FUTEX_WAIT This operation atomically verifies that the futex address uaddr still contains the value val, and sleeps awaiting FUTEX_WAKE on this futex address. If the timeout argument is non-NULL, its contents specify the duration of the wait. (This interval will be rounded up to the system clock granular‐ ity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.) If timeout is NULL, the call blocks indefinitely. The arguments uaddr2 and val3 are ignored.
可看到,futex()正常返回0的时候表示uaddr监测的内存内容发生变化,但变化并不等同于val变为一个特定的值(如0).
-
另外,GC suspend慢的原因在于某些thread做了耗时操作,但目前log和debug 信息已无法确认哪些threads可能存在问题(线程均已suspend).
Thread[tid = 20403, name = main ]: flag = 1, state = 90 Thread[tid = 20408, name = Jit thread pool worker thread 0 ]: flag = 1, state = 89 Thread[tid = 20409, name = Signal Catcher ]: flag = 1, state = 81 Thread[tid = 20410, name = JDWP ]: flag = 1, state = 77 Thread[tid = 20411, name = ReferenceQueueDaemon ]: flag = 1, state = 71 Thread[tid = 20412, name = FinalizerDaemon ]: flag = 1, state = 71 Thread[tid = 20413, name = FinalizerWatchdogDaemon ]: flag = 1, state = 69 Thread[tid = 20414, name = HeapTaskDaemon ]: flag = 0, state = 74 Thread[tid = 20415, name = Binder:20403_1 ]: flag = 1, state = 89 Thread[tid = 20416, name = Binder:20403_2 ]: flag = 1, state = 89 Thread[tid = 20417, name = Profile Saver ]: flag = 1, state = 89 Thread[tid = 20425, name = pool-2-thread-1 ]: flag = 1, state = 71 Thread[tid = 20427, name = OnCreateAsync Task ]: flag = 1, state = 89 Thread[tid = 20428, name = CameraRequest Task ]: flag = 1, state = 89 Thread[tid = 20429, name = pool-2-thread-2 ]: flag = 1, state = 71 Thread[tid = 20430, name = pool-2-thread-3 ]: flag = 1, state = 71 Thread[tid = 20431, name = pool-2-thread-4 ]: flag = 1, state = 71 Thread[tid = 20432, name = pool-2-thread-5 ]: flag = 1, state = 71 Thread[tid = 20433, name = Camera Handler Thread ]: flag = 1, state = 89 Thread[tid = 20435, name = Camera2 Handler Thread ]: flag = 1, state = 89 Thread[tid = 20434, name = Camera Job Dispatch Thread ]: flag = 1, state = 71 Thread[tid = 20436, name = Camera Job Dispatch Thread ]: flag = 1, state = 71 Thread[tid = 20437, name = pool-5-thread-1 ]: flag = 1, state = 71 Thread[tid = 20438, name = AsyncTask #1 ]: flag = 1, state = 68 Thread[tid = 20442, name = dreamcamera ]: flag = 1, state = 89 Thread[tid = 20491, name = RenderThread ]: flag = 1, state = 89 Thread[tid = 20496, name = AsyncTask #2 ]: flag = 1, state = 90 Thread[tid = 20497, name = pool-3-thread-1 ]: flag = 1, state = 89 Thread[tid = 20499, name = <native thread without managed peer> ]: flag = 1, state = 89
Root Cause
- futex()系统调用正常返回后,当前监控的内存内容不一定为减少为0, 判断逻辑有误.移除后可再次循环判断是否所有线程已完成suspend.
解决方案
-
patch
- Google已于去年AOSP提交关于此bug的修复patch
- Android 7.x均未包含该patch,cherry pick过来即可
- Android 8.0已包含此patch
-
验证
编译通过,手机能正常启动和运行.