CTS/GTS问题分析4
问题初探
测试命令:
run cts -m CtsOsTestCases -t android.os.cts.SeccompTest#testIsolatedServicePolicy
报错堆栈:
07-24 00:50:08.627 2633 4112 I ActivityManager: Process android.os.cts (pid 13402) has died: vis SF
07-24 00:50:08.627 2022 2022 I Zygote : Process 13402 exited due to signal (31)
07-24 00:50:08.627 13379 13399 I TestRunner: failed: testIsolatedServicePolicy(android.os.cts.SeccompTest)
07-24 00:50:08.627 13379 13399 I TestRunner: ----- begin exception -----
07-24 00:50:08.628 2633 4112 I AutoStartManagerService: MIUILOG- Reject RestartService packageName :android.os.cts uid : 10148
07-24 00:50:08.628 5696 5801 D PowerKeeper.Event: notifyAMProcDied pacakageName: android.os.cts, pid:13402
07-24 00:50:08.628 13379 13399 I TestRunner: android.os.DeadObjectException
关键在Process 13402 exited due to signal (31) 由于产生了signal 31导致的问题,导致测试进程被杀死,signal 31代表SIGSYS,即调用了bad system call导致了case fail,但是抓了bugreport里面没有tombstone,看不到堆栈,找不到调用了什么syscall,以及哪里调用了syscall导致的问题
测试期间打开tracing_on,执行:
echo 1 > /d/tracing/events/signal/enable
echo 1 > /d/tracing/tracing_on
cat trace_pipe
echo 0 > /d/tracing/tracing_on
<...>-14399 [001] d..2 923.002473: signal_generate: sig=31 errno=1 code=458753 comm=Binder:14381_3 pid=14399 grp=0 res=0
<...>-14399 [001] d..2 923.002511: signal_deliver: sig=31 errno=1 code=458753 sa_handler=0 sa_flags=0
<...>-14381 [000] d..2 923.002953: signal_deliver: sig=9 errno=0 code=0 sa_handler=0 sa_flags=0
08-01 22:05:17.549 2342 5098 I ActivityManager: Start proc 14381:android.os.cts/u0i1 for service android.os.cts/.SeccompTest$IsolatedService caller=android.os.cts
关键点:一个测试case会引发两个进程,其中一个测试进程发出sig=31,导致测试进程android.os.cts被杀死,程序退出,但是这也只能证明确实是signal 31造成的影响,看不到进一步的信息
问题分析
但是还是看不到上面的调用栈,首先在seccomp_send_sigsys加log
static void seccomp_send_sigsys(int syscall, int reason)
{
struct siginfo info;
memset(&info, 0, sizeof(info));
info.si_signo = SIGSYS;
info.si_code = SYS_SECCOMP;
info.si_call_addr = (void __user *)KSTK_EIP(current);
info.si_errno = reason;
info.si_arch = syscall_get_arch();
info.si_syscall = syscall;
printk("syscall = %d",syscall);
dump_stack();
}
<4>[ 560.550830] syscall = 120
<4>[ 560.550888] CPU: 2 PID: 13457 Comm: Binder:13442_2 Tainted: G W O 3.18.31-perf-g23a16ea #1
<4>[ 560.550912] Hardware name: Qualcomm Technologies, Inc. MSM8953 QRD SKU3 (DT)
<0>[ 560.550927] Call trace:
<4>[ 560.550961] [<ffffffc00008a028>] dump_backtrace+0x0/0x24c
<4>[ 560.550981] [<ffffffc00008a294>] show_stack+0x20/0x28
<4>[ 560.551003] [<ffffffc000dca008>] dump_stack+0x80/0xa4
<4>[ 560.551025] [<ffffffc000142c84>] seccomp_send_sigsys+0xbc/0x104
<4>[ 560.551042] [<ffffffc0001436a4>] seccomp_phase1+0x154/0x22c
<4>[ 560.551060] [<ffffffc000143870>] __secure_computing+0x18/0x40
<4>[ 560.551081] [<ffffffc000088d20>] syscall_trace_enter+0x44/0x1e8
这样只能看到内核栈,不过更进一步了,看到应该是syscall = 120造成的错误,但是这样还是不太有用,必须要对SIGSYS信号有所响应,将调用栈完整的抓出来才行
那么首先想到的是模仿其他信号,注册SISGSYS信号,在tombstone中打出信息
http://gerrit.pt.miui.com/#/c/357188/
http://gerrit.pt.miui.com/#/c/357196/
用了如上两条change(system/core/debuggerd & bionic/linker),不起作用,原因参考tombstone与debuggerd相关流程;CTS问题分析4拓展-无法抓取tombstone的原因;
但是又发现SIGSYS信号是能响应coredump的
#define sig_kernel_coredump(sig) siginmask(sig, SIG_KERNEL_COREDUMP_MASK)
431#define SIG_KERNEL_COREDUMP_MASK (\
432 rt_sigmask(SIGQUIT) | rt_sigmask(SIGILL) | \
433 rt_sigmask(SIGTRAP) | rt_sigmask(SIGABRT) | \
434 rt_sigmask(SIGFPE) | rt_sigmask(SIGSEGV) | \
435 rt_sigmask(SIGBUS) | rt_sigmask(SIGSYS) | \
436 rt_sigmask(SIGXCPU) | rt_sigmask(SIGXFSZ) | \
437 SIGEMT_MASK
那么简单了,直接抓coredump就行了
adb shell setprop persist.debug.trace 1
adb shell setenforce 0
运行一遍测试,将/data/core里面的coredump取出来
然后按照下面的步骤操作
1.在http://eng.pt.miui.com/?r=eng&dir=/symbols下载对应rom的symbols,并解压到本地;
2.执行gdb,gdb在rom源码中(prebuilts/gdb/linux-x86/bin/gdb)可找到,执行 源码根目录/prebuilts/gdb/linux-x86/bin/gdb ./gdb
3.加载app_process执行程序,若为32位应用则加载app_process32,64位应用加载app_process64, 该可执行程序位于下载的symbols中, 在gdb中执行:
(gdb) file out/target/product/****/symbols/system/bin/app_process32
4.加载coredump,在gdb中执行:
(gdb) core !system!bin!app_process32.5358.dboxed_process0
5.设置symbols,在gdb中执行:
(gdb) set sysroot out/target/product/sagit/symbols/
6.此时debug环境已经搭建好,可以通过gdb来debug native crash
(gdb) bt
#0 sched_getscheduler () at bionic/libc/arch-arm64/syscalls/sched_getscheduler.S:9
#1 0x0000007f9d641d2c in android::os::statistics::PerfEventReporter::beginReportPerfEvent (env=env@entry=0x7f91c1c900, threadPriorityInfo=...)
at out/target/product/mido/obj/SHARED_LIBRARIES/libandroid_runtime_intermediates/android/os/statistics/PerfEventReporter.cpp:60
#2 0x0000007f9d642b10 in android::os::statistics::reportConditionSatisfied (env=0x7f91c1c900, monitorId=548042869504, awakenReason=1, beginUptimeMs=<optimized out>, endUptimeMs=<optimized out>)
at out/target/product/mido/obj/SHARED_LIBRARIES/libandroid_runtime_intermediates/android/os/statistics/JavaVMSupervisionCallBacks.cpp:317
#3 0x0000007f99b53ee8 in art::Monitor::ReportConditionSatisfied (self=0x7f91c35c00, monitor=<optimized out>, wait_awaken_ms=553640, wait_start_ms=<optimized out>, awaken_reason=<optimized out>) at art/runtime/monitor.cc:439
#4 art::Monitor::Wait (this=<optimized out>, self=<optimized out>, ms=<optimized out>, ns=<optimized out>, interruptShouldThrow=<optimized out>, why=art::kWaiting) at art/runtime/monitor.cc:1018
#5 0x000000007060d84c in java.lang.Thread.sleep () from /home/weijuncheng/cts/第五十八周/HQ-23018/out/target/product/mido/symbols/system/framework/arm64/boot.oat
#6 0x000000007060d688 in java.lang.Thread.sleep () from /home/weijuncheng/cts/第五十八周/HQ-23018/out/target/product/mido/symbols/system/framework/arm64/boot.oat
#7 0x000000007060d294 in java.lang.Thread.sleep () from /home/weijuncheng/cts/第五十八周/HQ-23018/out/target/product/mido/symbols/system/framework/arm64/boot.oat
#8 0x0000007f998acaec in art_quick_invoke_static_stub () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1542
#9 0x0000007f998b9534 in art::ArtMethod::Invoke (this=0x6f520180, self=0x7f91c35c00, args=0x7f920d9a04, args_size=8, result=0x7f920d9bf8, shorty=0x702c1dc4 <oatdata+3218884> "VJ") at art/runtime/art_method.cc:291
#10 0x0000007f99a67c44 in art::interpreter::ArtInterpreterToCompiledCodeBridge (self=0x7f91c35c00, caller=0x7f8c5242c8, code_item=0x700939c0 <oatdata+932288>, shadow_frame=0x7f920d99c0, result=0x7f920d9bf8)
at art/runtime/interpreter/interpreter_common.cc:538f
这样调用栈就已经很明显了,调用thread.sleep时因为性能打点调用了syscall(sched_getscheduler)造成的问题
问题总结
生成tombstone的信号集和coredump的信号集不一样,前者是后者的子集;
问题拓展
调研为什么修改代码后tombstone不生成的原因
首先运行一遍测试,signal trace log如下:
Binder:28925_3-28943 [002] d..2 83514.980997: signal_generate: sig=31 errno=1 code=458753 comm=Binder:28925_3 pid=28943 grp=0 res=0
Binder:28925_3-28943 [002] d..2 83514.981028: signal_deliver: sig=31 errno=1 code=458753 sa_handler=7f7cd89d5c sa_flags=18000004
Binder:28925_3-28943 [002] d..2 83514.981146: signal_generate: sig=31 errno=1 code=458753 comm=Binder:28925_3 pid=28943 grp=0 res=0
Binder:28925_3-28943 [002] d..2 83514.981154: signal_deliver: sig=31 errno=1 code=458753 sa_handler=0 sa_flags=10000000
最终还是走到了信号的默认处理流程,生成coredump
根据sa_handler=7f7cd89d5c,在运行时通过cat /proc/$(pidof "android.os.cts")/maps找到
7f7cd83000-7f7ce2c000 r-xp 00000000 b3:18 749 /system/bin/linker64
7f7ce2c000-7f7ce2d000 r--p 00000000 00:00 0 [anon:atexit handlers]
7f7ce2d000-7f7ce30000 r--p 000a9000 b3:18 749 /system/bin/linker64
7f7ce30000-7f7ce31000 rw-p 000ac000 b3:18 749 /system/bin/linker64
信号处理函数在linker64中,再通过odjdump出其汇编代码,计算得到相关的handler,结果发现就是debuggerd_signal_handler
0000000000006d5c <__dl__ZL24debuggerd_signal_handleriP7siginfoPv>:
6d5c: a9bb67fa ldmibge fp!, {r1, r3, r4, r5, r6, r7, r8, r9, sl, sp, lr}
6d60: a9015ff8 stmdbge r1, {r3, r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
6d64: a90257f6 stmdbge r2, {r1, r2, r4, r5, r6, r7, r8, r9, sl, ip, lr}
6d68: a9034ff4 stmdbge r3, {r2, r4, r5, r6, r7, r8, r9, sl, fp, lr}
6d6c: a9047bfd stmdbge r4, {r0, r2, r3, r4, r5, r6, r7, r8, r9, fp, ip, sp, lr}
但是在其中加log根本没有被调用的迹象,这个相当奇怪;为了防止计算有误,这里还将注册的SIGSYS语句注释掉重新跑了一遍,再看signal trace log
Binder:7495_3-7514 [000] d..2 198.369068: signal_generate: sig=31 errno=1 code=458753 comm=Binder:7495_3 pid=7514 grp=0 res=0
Binder:7495_3-7514 [000] d..2 198.369097: signal_deliver: sig=31 errno=1 code=458753 sa_handler=0 sa_flags=0
<...>-7510 [005] d..2 198.369218: signal_deliver: sig=9 errno=0 code=0 sa_handler=0 sa_flags=0
Signal Catcher-7501 [000] d..2 198.369230: signal_deliver: sig=9 errno=0 code=0 sa_handler=0 sa_flags=0
发现果然handler相关的log没有了;
那么问题就变为,为什么注册了信号处理函数,在trace log中也打出来了,但好像并没有执行;添加的log没打出来,且最后没有生成tombstone