文章主要列出了常见的排查工具，也有部分线上问题案例；

JVM 问题排查

JDK工具包

javap

反编译工具
javac -verbose 类名.java
java -verbose 类名
javap -c 类名
javap -verbose 类名

image.png

jps

java进程查看工具，是ps的子集，比ps查看java进程更方便
jps -l 查看本机的所有java进程 -l 代表查看完整类名

image.png

jps -m 代表查看传给main方法的参数

image.png

jps -v 代表查看传给jvm的启动参数

jinfo

比jps更全面的jvm信息查看工具
常用 jinfo <pid> 或 jinfo -flags <pid>

image.png

jstack

如果java程序崩溃生成core文件，jstack工具可以用来获得core文件的java stack和native stack(-m参数)的信息，从而可以轻松地知道java程序是如何崩溃和在程序何处发生问题。

jstack工具还可以附属到正在运行的java程序中，看到当时运行的java程序的java stack和native stack的信息, 如果现在运行的java程序呈现hung的状态，jstack是非常有用的

jstat

主要用来查看jvm的gc状态，通过 jstat -options 查看可用参数
最常用的是 jstat -gcutil <pid> 1000 10
jstat使用详解

jmap

用来查看jvm的内存对象分布
最常用的 jmap -histo <pid>

image.png

[C is a char[] 因为string用的char数组，所以比较多
[S is a short[] 一般较少
[I is a int[] 一般较多
[B is a byte[] 最基础的类型，也会比较多
[[I is a int[][] 一般较少
最常用的 jmap -dump:format=b,file=outfile <pid>
生成的dump文件可以使用 jhat 简单分析

image.png

image.png

专业分析工具为MAT 下载链接
下载完成直接打开dump文件即可

jconsole

jdk提供的界面化监控工具，功能很强大
需要在jvm启动的时候添加参数：
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=8777
-Dcom.sun.management.jmxremote.rmi.port=8777
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false

image.png

jvisualvm

jdk提供的比jconsole更加强大的界面监控套件
如果要监控远程主机，跟jconsole添加的jmx参数是一样的

image.png

jdb

java调试工具

image.png

JDK参考资料

JDK里工具包
https://blog.csdn.net/tzs_1041218129/article/details/61630981

关于arthas的使用
https://alibaba.github.io/arthas/quick-start.html

Linux问题排查

系统模块一览

linux一览.png

系统模块详解

cpu

user: 表示CPU执行用户进程的时间,通常期望用户空间CPU越高越好.
sys: 表示CPU在内核运行时间,系统CPU占用率高,表明系统某部分存在瓶颈.通常值越低越好.
wait: CPU在等待I/O操作完成所花费的时间.系统不应该花费大量时间来等待I/O操作,否则就说明I/O存在瓶颈.
hirq: 系统处理硬中断所花费的时间百分比
sirq: 系统处理软中断所花费的时间百分比
util: CPU总使用的时间百分比 util = 1 - idle - iowait - steal
nice: 系统调整进程优先级所花费的时间百分比
steal: 被强制等待（involuntary wait）虚拟CPU的时间,此时hypervisor在为另一个虚拟处理器服务
ncpu: CPU的总个数

采集方式
CPU的占用率计算,都是根据/proc/stat计数器文件而来,stat文件的内容基本格式是:
cpu是总的信息,cpu0,cpu1等是各个具体cpu的信息,共有8个值,单位是ticks,分别是：

User time, 67793686

Nice time, 1353560

System time, 66172807

Idle time, 4167536491

Waiting time, 2705057

Hard Irq time, 0

SoftIRQ time, 195975

Steal time, 609768
CPU总时间=user+system+nice+idle+iowait+irq+softirq+Stl
各个状态的占用=状态的cpu时间/CPU总时间＊100%
比较特殊的是CPU总使用率的计算(util),目前的算法是: util = 1 - idle - iowait - steal

memory

free: 空闲的物理内存的大小
used: 已经使用的内存大小
buff: buff使用的内存大小,buffer is something that has yet to be "written" to disk.
cach: 操作系统会把经常访问的东西放在cache中加快执行速度,A cache is something that has been "read" from the disk and stored for later use
total: 系统总的内存大小
util: 内存使用率 util = (total - free - buff - cache) / total * 100%

采集方法
内存的计数器在/proc/meminfo,里面有一些关键项
含义就不解释了,主要介绍一下内存使用率的计算算法: util = (total - free - buff - cache) / total * 100%

load

load1: 一分钟的系统平均负载
load5: 五分钟的系统平均负载
load15:十五分钟的系统平均负载
runq: 在采样时刻,运行队列的任务的数目,与/proc/stat的procs_running表示相同意思
plit: 在采样时刻,系统中活跃的任务的个数（不包括运行已经结束的任务）

采集方法
/proc/loadavg文件中保存的有负载相关的数据：
0.00 0.01 0.00 1/271 23741
分别是1分钟负载,五分钟负载,十五分钟负载,运行进程／总进程最大的pid

只需要采集前五个数据既可得到所有信息
注意:只有当系统负载除cpu核数>1的时候,系统负载较高

traffic

bytin: 入口流量byte/s
bytout: 出口流量byte/s
pktin: 入口pkt/s
pktout: 出口pkt/s

采集方法
流量的计数器信息来自:/proc/net/dev
只需要对以eth和em开头的网卡数据进行了采集,像lo这种网卡直接就忽略掉了,流量的单位是byte。

tcp

active:主动打开的tcp连接数目
pasive:被动打开的tcp连接数目
iseg: 收到的tcp报文数目
outseg:发出的tcp报文数目
EstRes:Number of resets that have occurred at ESTABLISHED
AtmpFa:Number of failed connection attempts
CurrEs:当前状态为ESTABLISHED的tcp连接数
retran:系统的重传率

采集方法
tcp的相关计数器文件是:/proc/net/snmp
我们主要关注其中的ActiveOpens/PassiveOpens/AttemptFails/EstabResets/CurrEstab/InSegs/OutSegs/RetransSegs
主要关注一下重传率的计算方式: retran = (RetransSegs－last RetransSegs) ／ (OutSegs－last OutSegs) * 100%

udp

idgm: 收到的udp报文数目
odgm: 发送的udp报文数目
noport:udp协议层接收到目的地址或目的端口不存在的数据包
idmerr:udp层接收到的无效数据包的个数

采集方法
UDP的数据来源文件和TCP一样,也是在/proc/net/snmp

io

rrqms: The number of read requests merged per second that were issued to the device.
wrqms: The number of write requests merged per second that were issued to the device.
%rrqm: The percentage of read requests merged together before being sent to the device.
%wrqm: The percentage of write requests merged together before being sent to the device.
rs: The number of read requests that were issued to the device per second.
ws: The number of write requests that were issued to the device per second.
rsecs: The number of sectors read from the device per second.
wsecs: The number of sectors written to the device per second.
rqsize:The average size (in megabytes) of the requests that were issued to the device.
rarqsz:The average size (in megabytes) of the read requests that were issued to the device.
warqsz:The average size (in megabytes) of the write requests that were issued to the device.
qusize:The average queue length of the requests that were issued to the device.
await: The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
rawait:The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
wawait:The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
svctm: The average service time (in milliseconds) for I/O requests that were issued to the device.
util: Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device).Device saturation occurs when this value is close to 100%.

采集方法
IO的计数器文件是:/proc/diskstats,比如:
每一行字段的含义是:

major: 主设备号

minor: 次设备号,设备号是用来区分磁盘的类型和厂家信息

name: 设备名称

rd_ios: 读完成次数,number of issued reads. This is the total number of reads > > completed successfully

rd_merges: 合并读完成次数,为了效率可能会合并相邻的读和写.从而两次4K的读在它最终被处理到磁盘上之前可能会变成一次8K的读,才被计数（和排队）,因此只有一次I/O操作

rd_sectors: 读扇区的次数,number of sectors read. This is the total number of sectors read successfully.

rd_ticks: 读花费的毫秒数,number of milliseconds spent reading. This is the total number of milliseconds spent by all reads

wr_ios: 写完成次数,number of writes completed. This is the total number of writes completed successfully

wr_merges: 合并写完成次数,number of writes merged Reads and writes which are adjacent to each other may be merged for efficiency. Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O.

wr_sectors: 写扇区次数,number of sectors written. This is the total number of sectors written successfully

wr_ticks: 写花费的毫秒数,number of milliseconds spent writing. This is the total number of milliseconds spent by all writes.

cur_ios: 正在处理的输入/输出请求数,number of I/Os currently in progress. The only field that should go to zero. Incremented as requests are given to appropriate request_queue_t and decremented as they finish.

ticks: 输入/输出操作花费的毫秒数

aveq: 输入/输出操作花费的加权毫秒数

通过这些计数器可以算出来上面的每个字段的值

注意:

扇区一般都是512字节,因此有的地方除以2了

ws是指真正落到io设备上的写次数, wrqpms是指系统调用合并的写次数, 它们之间的大小关系没有可比性,因为不知道多少请求能够被合并,比如发起了100个read系统调用,每个读4K,假如这100个都是连续的读,由于硬盘通常允许最大的request为256KB,那么block层会把这100个读请求合并成2个request,一个256KB,另一个144KB,rrqpm/s为100,因为100个request都发生了合并,不管它最后合并成几个；r/s为2,因为最后的request数为2

partition磁盘分区

bfree: 分区空闲的字节
bused: 分区使用中的字节
btotl: 分区总的大小
util: 分区使用率
ifree: 可用文件结点数
itotl: 文件结点总数
iutil: 文件结点使用率

采集方法
首先通过/etc/mtab获取到分区信息,然后通过statfs访问该分区的信息,查询文件系统相关信息,包含:
然后就可以计算出tsar需要的信息,分区的字节数＝块数＊块大小＝f_blocks * f_bsize

pcsw 进程切换

cswch: 进程切换次数
proc: 新建的进程数

采集方法
计数器在/proc/stat:
分别代表进程切换次数,以及进程数

proc 进程监控

user: 某个进程用户态cpu消耗
sys: 某个进程系统态cpu消耗
total:某个进程总的cpu消耗
mem: 某个进程的内存消耗百分比
RSS: 某个进程的虚拟内存消耗,这是驻留在物理内存的一部分.它没有交换到硬盘.它包括代码,数据和栈
read: 进程io读字节
write:进程的io写字节

采集方法
/proc/pid/stat:获取进程的cpu信息 /proc/pid/status:获取进程的mem信息 /proc/pid/io:获取进程的读写IO信息

nginx

accept:总共接收的新连接数目
handle:总共处理的连接数目
reqs:总共产生请求数目
active:活跃的连接数,等于read+write+wait
read:读取请求数据的连接数目
write:向用户写响应数据的连接数目
wait:长连接等待的连接数目
qps:每秒处理的请求数
rt:平均响应时间ms
sslqps:每秒处理的SSL请求数
spdyps:每秒处理的spdy请求数
sslhst:平均ssl握手时间ms

采集方法
需要确保nginx配置location,并且能够访问curl http://localhost/nginx_status得到数据;
通过访问curl http://localhost/traffic_status能够得到如下字段的数据localhost,0,0,2,2,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0
请求到的数据每个字段的含义是:

kv 计算得到的req_status_zone指令定义变量的值,此时为domain字段

bytes_in_total 从客户端接收流量总和

bytes_out_total 发送到客户端流量总和

conn_total 处理过的连接总数

req_total 处理过的总请求数

2xx 2xx请求的总数

3xx 3xx请求的总数

4xx 4xx请求的总数

5xx 5xx请求的总数

other 其他请求的总数

rt_total rt的总数

upstream_req 需要访问upstream的请求总数

upstream_rt 访问upstream的总rt

upstream_tries upstram总访问次数

200 200请求的总数

206 206请求的总数

302 302请求的总数

304 304请求的总数

403 403请求的总数

404 404请求的总数

416 416请求的总数

499 499请求的总数

500 500请求的总数

502 502请求的总数

503 503请求的总数

504 504请求的总数

508 508请求的总数

detail_other 非以上13种status code的请求总数

lvs

stat: lvs状态,1正常
conns: 总的连接数
pktin: 收到的包数
pktout:发出的包数
bytin: 收到的字节数
bytout:发出的字节数
total: lvs所有的 session 数量, 包含 local 和 sync
local: lvs本机转发的 session 数量
lact: local session 中处于 establish 状态的数量
linact: local session 中处于非 establish 状态的数量
sync: 其他lvs同步过来的 session 数量
sact: sync session 中处于 establish 状态的数量
sinact: sync session 中处于非 establish 状态的数量
templ: 会话保持(模板) session 的数量

采集方法

内核版 lvs: 访问lvs的统计文件:/proc/net/ip_vs_stats, /proc/net/ip_vs_conn_stats netframe lvs: 访问 lvs 的命令行工具: slb_admin -ln --total --dump, appctl -cas

排查问题实践

1. CPU问题排查

cpu问题，大部分是Java程序里有bug，比如死循环等
也有部分特殊场景导致的cpu问题，比如: 1.Code Cache满导致的 JIT失效，从而导致cpu飙高 2. Docker场景，由于售卖过渡导致 cpu的steal值很高；
下面举一个死循环的排查例子:

通过top 命令找到占cpu高的进程

image.png

通过top -Hp $pid 命令找到占cpu高的线程

image.png

通过jstack $pid 命令找到占cpu高的具体代码

image.png

2. 内存问题排查

系统内存一般不会出问题，因为java开发的场景，内存都分配给JVM自己管理了，大部分场景是JVM自己内存管理问题
这里给一下系统内存常用工具：

image.png

total:总计物理内存的大小。
used:已使用多大。
free:可用有多少。
Shared:多个进程共享的内存总额。
Buffers/cached:磁盘缓存的大小。

3. 网络问题排查

如果网络不通，一般是网卡故障、端口故障、未接入到链路等，直接找SRE解决即可；
需要掌握 ifconfig ethtool route nslookup dig traceroute telnet netstat iptables

如果是网络慢（服务端接口RT很短）:

查看是否DNS解析时间过长

查看路由过程中，哪些节点有瓶颈

查看带宽是否有瓶颈

网络较慢的排查事实上比网络不通的排查更有挑战，因为很多时候可能是运营商、DNS等的原因，这些故障常常不在我们的控制范围之内，只能收集证据向其反馈或对其进行投诉。

通过 traceroute 查看网络中每一条的延迟

通过iftop 查看本机哪些网络连接占用的带宽多

通过ethtool确认本机的网卡为万兆网卡

终极大杀器 tcpdump 抓包

4. Load问题排查

Load是指正在运行running和准备好运行runnable以及不可中断uninteruptable (D状态)的进程的总数。load average就是一定时间内的load数量。
Load 高是指：IO型的App load值大于cpu核心数量*3 ， CPU型的App load值大于cpu数；
Load 高的原因：

cpu 使用率高，导致处理不过来，很多进程等待调度，导致等待队列长

不可中断之睡眠状态，通常和设备驱动等待磁盘或网络 I/O 有关
由于处于不可中断之睡眠状态的进程不会处理任何信号，所以 kill -9 也杀不掉它。解决此类进程的办法只有两个:搞定不可中断之睡眠状态进程所等待的资源，使资源可用； kill the world——重启系统。

image.png

通过vmstat可以查看秒级别的Load：
r 列表示运行和等待cpu时间片的进程数，代表采样瞬间的load，如果长期大于cpu数量，说明cpu不足，需要增加cpu。
b 列表示在等待资源的进程数，比如正在等待I/O、或者内存交换等。
si、so 正常情况应该长期为零，否则内存不够用了
bi 从块设备读入数据的总量（读磁盘）（每秒kb）
bo 块设备写入数据的总量（写磁盘）（每秒kb）

有时候线上会出现持续几秒Load高导致超时问题，但是1分钟、5分钟 Load正常；

Linux参考资料

Linux 网络问题排查
https://www.cnblogs.com/Security-Darren/p/4700387.html
Linux万兆网卡测速
http://fcsan.blog.sohu.com/199858025.html
Linux查看网络流量的命令
https://www.zhihu.com/question/19862245
Linux Load高问题排查
https://blog.csdn.net/m0_38110132/article/details/84187399#t1
https://blog.csdn.net/u011183653/article/details/19489603
Linux free命令详解
https://www.cnblogs.com/peida/archive/2012/12/25/2831814.html
Linux 进程状态
https://liam.page/2020/01/10/the-states-of-processes-on-Linux/
阿里开源组件Tsar
https://github.com/alibaba/tsar/blob/master/info.md

8、线上问题排查

8、线上问题排查

JVM 问题排查

JDK工具包

javap

jps

jinfo

jstack

jstat

jmap

jconsole

jvisualvm

jdb

JDK参考资料

Linux问题排查

系统模块一览

系统模块详解

cpu

memory

load

traffic

tcp

udp

io

partition磁盘分区

pcsw 进程切换

proc 进程监控

nginx

lvs

排查问题实践

1. CPU问题排查

通过top 命令找到占cpu高的进程

通过top -Hp $pid 命令找到占cpu高的线程

通过jstack $pid 命令找到占cpu高的具体代码

2. 内存问题排查

3. 网络问题排查

4. Load问题排查

Linux参考资料