day21-进程管理（2）

1. 管理进程状态

当程序正在运行，我们可以使用kill命令对进程发送关闭信号，停止进程。

列出kill当前系统所支持的信号

[root@ennan ~]# kill -l
 1) SIGHUP   2) SIGINT   3) SIGQUIT  4) SIGILL   5) SIGTRAP
 6) SIGABRT  7) SIGBUS   8) SIGFPE   9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG  24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF 28) SIGWINCH    29) SIGIO   30) SIGPWR
31) SIGSYS  34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

其中比较常用的信号有1、9、0

数字编号	信号含义	信号翻译
1	SIGHUP	通常用来重新加载配置文件
9	SIGKILL	强制杀死进程
15	SIGTERM	终止进程，默认kill使用该信号

使用kill命令杀死指定的进程。

[root@ennan ~]# ps aux | grep vsftpd
root     28085  0.0  0.0  53264   572 ?        Ss   15:00   0:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
root     28087  0.0  0.0 112708   980 pts/0    R+   15:00   0:00 grep --color=auto vsftpd

kill -1为发送从在信号，当修改了配置文件是，可通过这条命令重新加载

[root@ennan ~]# kill -1 28085
[root@ennan ~]# ps aux | grep vsftpd
root     28085  0.0  0.0  53264   748 ?        Ss   15:00   0:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
root     28113  0.0  0.0 112708   980 pts/0    R+   15:01   0:00 grep --color=auto vsftpd

发送停止信号，停止正在运行的服务

[root@ennan ~]# kill 28085
[root@ennan ~]# ps aux | grep vsftpd
root     28153  0.0  0.0 112708   980 pts/0    R+   15:04   0:00 grep --color=auto vsftpd

kill -9为发送强制停止信号，当无法停止服务时，可强制终止信号

[root@ennan ~]# ps aux | grep vsftpd
root     28186  0.0  0.0  53264   576 ?        Ss   15:06   0:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
root     28193  0.0  0.0 112708   976 pts/0    R+   15:06   0:00 grep --color=auto vsftpd
[root@ennan ~]# kill -9 28186
[root@ennan ~]# ps aux | grep vsftpd
root     28199  0.0  0.0 112708   980 pts/0    R+   15:07   0:00 grep --color=auto vsftpd

kill -9 PID可强制杀死进程，对于mysql这类有状态的进程慎用。

在Linux中，除了kill我们还可以用killall和pkill对进程进行管理。killall和pkill无需指定进程的PID就可以结束服务所都应的所有进程。
通过killall结束nginx的所有进程

[root@ennan ~]# ps aux | grep nginx
root     28474  0.0  0.2 125100  2112 ?        Ss   15:21   0:00 nginx: master process /usr/sbin/ngin
nginx    28475  0.0  0.3 127572  3572 ?        S    15:21   0:00 nginx: worker process
root     28485  0.0  0.0 112708   980 pts/0    R+   15:21   0:00 grep --color=auto nginx
[root@ennan ~]# killall nginx
[root@ennan ~]# ps aux | grep nginx
root     28491  0.0  0.0 112708   980 pts/0    R+   15:22   0:00 grep --color=auto nginx

通过pkill结束nginx的所有进程

[root@ennan ~]# ps aux | grep nginx
root     28420  0.0  0.2 125100  2116 ?        Ss   15:18   0:00 nginx: master process /usr/sbin/ngin
nginx    28421  0.0  0.3 127572  3576 ?        S    15:18   0:00 nginx: worker process
root     28433  0.0  0.0 112708   980 pts/0    R+   15:19   0:00 grep --color=auto nginx
[root@ennan ~]# pkill nginx
[root@ennan ~]# ps aux | grep nginx
root     28442  0.0  0.0 112708   980 pts/0    R+   15:19   0:00 grep --color=auto nginx

使用pkill踢出从远程登录到本机的用户，终止pts/1上所有进程, 并且bash也结束（用户被强制退出）

[root@ennan ~]# w
 15:24:10 up 10 days,  5:44,  2 users,  load average: 0.00, 0.04, 0.05
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    124.127.202.190  14:59    2.00s  0.08s  0.00s w
root     pts/1    124.127.202.102  15:23   17.00s  0.02s  0.02s -bash
[root@ennan ~]# pkill -9 -t pts/1
# -t为指定终端

2. 管理后台进程

通常进程都会在终端前台运行，一旦关闭终端，进程也会随着结束，那么此时我们就希望进程能在后台运行，就是将在前台运行的进程放入后台运行，这样及时我们关闭了终端也不影响进程的正常运行。
早期的时候大家都选择使用&符号将进程放入后台，然后在使用jobs、bg、fg等方式查看进程状态，但太麻烦了。在实际生成过程中screen使用起来更方便。

2.1 安装工具

[root@ennan ~]# yum install screen -y

2.2 开启一个screen窗口,指定名称

[root@ennan ~]# screen -S wget_CentOS

2.3 在新的窗口中可执行命令

[root@ennan ~]# wget https://mirrors.aliyun.com/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1810.iso
--2019-08-22 15:45:14--  https://mirrors.aliyun.com/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1810.iso
Resolving mirrors.aliyun.com (mirrors.aliyun.com)... 183.2.199.241, 183.2.199.242, 183.2.199.243, ...
Connecting to mirrors.aliyun.com (mirrors.aliyun.com)|183.2.199.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4588568576 (4.3G) [application/octet-stream]
Saving to: ‘CentOS-7-x86_64-DVD-1810.iso’

 2% [>                                                            ] 106,655,380 12.1MB/s  eta 6m 0s

2.4 平滑的退出screen,但不会终止screen中的任务。注意: 如果使用exit 才算真的关闭screen窗口

ctrl+a+d
[detached from 28852.wget_CentOS]

2.5 查看当前正在运行的screen有哪些

[root@ennan ~]# screen -list
There is a screen on:
    28852.wget_CentOS   (Detached)
1 Socket in /var/run/screen/S-root.

2.6 进入正在运行的screen

[root@ennan ~]# screen -r wget_CentOS

或者

[root@ennan ~]# screen -r 28852

3. 进程的优先级

优先级指的是优先享受资源，优先级高，CPU会优先处理。
在启动进程时，为不同的进程使用不同的调度策略。
nice值越高： 表示优先级越低，例如+19，该进程容易将CPU 使用量让给其他进程。
nice 值越低： 表示优先级越高，例如-20，该进程更不倾向于让出CPU。

3.1 查看nice优先级
3.1.1使用top可以查看nice优先级。 NI: 实际nice级别，默认是0。

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                           
    1 root      20   0  125476   3116   1768 S  0.0  0.3   0:25.64 systemd                                                  
    5 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/0:0H                      
    7 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 migration/0

3.1.2 使用ps查看进程优先级

[root@ennan ~]# ps axo command,nice
COMMAND                      NI
/usr/lib/systemd/systemd --   0
[kworker/0:0H]              -20
[rcu_sched]                   0
[lru-add-drain]             -20

3.2 nice指定程序的优先级。语法格式nice -n 优先级数字进程名称

[root@ennan ~]# nice -n -5 vim test
[root@ennan ~]# ps axo command,nice | grep vim
vim test                     -5

通过nice修改程序的优先级，最终的优先级和ssh的优先级有关系。比如当前ssh的nice是-20，当把vim的nice设置为5时，实际的vim的优先级为-15

3.3 renice命令修改一个正在运行的进程优先级。语法格式renice -n 优先级数字进程pid
3.3.1 修改ssh的nice

[root@ennan ~]# ps axo pid,command,nice | grep ssh
 2318 /usr/sbin/sshd -D             0
27994 sshd: root@pts/0              0
29478 grep --color=auto ssh         0
[root@ennan ~]# renice -n -20 2318
2318 (process ID) old priority 0, new priority -20

3.3.2 重新登陆后，ssh的nice都变为了-20

[root@ennan ~]# ps axo pid,command,nice | grep ssh
 2318 /usr/sbin/sshd -D           -20
29533 sshd: root@pts/2            -20
29591 sshd: root@pts/0            -20

为了防止因服务器假死出现ssh连接服务器困难的情况，可将ssh的nice调整为-20

4. 系统平均负载

平均负载其实就是单位时间内的活跃进程数。平均负载是指单位时间内，处于可运行状态和不可中断状态的进程数。所以，它不仅包括了正在使用 CPU 的进程，还包括等待 CPU 和等待 I/O 的进程。
例：假设现在在 4、2、1核的CPU上，如果平均负载为 2 时，意味着什么呢？
Q1.在4 个 CPU 的系统上，意味着 CPU 有 50% 的空闲。
Q2.在2 个 CPU 的系统上，意味着所有的 CPU 都刚好被完全占用。
Q3.而1 个 CPU 的系统上，则意味着有一半的进程竞争不到 CPU。

当平均负载高于 CPU 数量 70%的时候，你就应该分析排查负载高的问题了。一旦负载过高，就可能导致进程响应变慢，进而影响服务的正常功能。
例：假设我们在有2个 CPU 系统上看到平均负载为 2.73，6.90，12.98
那么说明在过去1 分钟内，系统有 136% 的超载 (2.73/2=136%)
而在过去 5 分钟内，有 345% 的超载 (6.90/2=345%)
而在过去15 分钟内，有 649% 的超载，(12.98/2=649%)
但从整体趋势来看，系统的负载是在逐步的降低。

平均负载案例分析实战

下面，我们以三个示例分别来看这三种情况，并用 stress、mpstat、pidstat 等工具，找出平均负载升高的根源。
stress是Linux系统压力测试工具，这里我们用作异常进程模拟平均负载升高的场景。
mpstat是多核 CPU 性能分析工具，用来实时查看每个CPU的性能指标，以及所有 CPU 的平均指标。
pidstat是一个常用的进程性能分析工具，用来实时查看进程的 CPU、内存、I/O 以及上下文切换等性能指标。

如果出现无法使用mpstat、pidstat命令查看%wait指标建议更新下软件包

[root@ennan ~]# wget http://pagesperso-orange.fr/sebastien.godard/sysstat-11.7.3-1.x86_64.rpm
[root@ennan ~]# rpm -Uvh sysstat-11.7.3-1.x86_64.rpm

场景一：CPU 密集型进程
第一个终端运行 stress 命令，模拟一个 CPU 使用率 100% 的场景

[root@ennan ~]# stress --cpu 1 --timeout 600
stress: info: [30367] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
# --cpu 产生n个进程,每个进程都反复不停的计算随机数的平方根
# --timeout 指定运行N秒后停止

通过top查看，CPU的使用率为100%，平均负载在不断的增加。

[root@ennan ~]# top
top - 17:33:47 up 10 days,  7:53,  3 users,  load average: 4.76, 2.14, 0.84
Tasks:  88 total,   4 running,  84 sleeping,   0 stopped,   0 zombie
%Cpu(s):100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1014888 total,   143192 free,   125864 used,   745832 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   671156 avail Mem

在第三个终端运行mpstat查看 CPU 使用率的变化情况

[root@ennan ~]# mpstat -P ALL 5
# -P ALL表示监控所有CPU，后面数字5表示间隔5秒后输出一组数据
Linux 3.10.0-957.21.3.el7.x86_64 (ennan)    08/22/2019  _x86_64_    (1 CPU)
# 单核CPU，所以只有ALL和0，利用率为100%
05:32:26 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:32:31 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:32:31 PM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

使用pidstat工具，间隔5秒后输出一组数据

[root@ennan ~]# pidstat -u 5 1
Linux 3.10.0-957.21.3.el7.x86_64 (ennan)    08/22/2019  _x86_64_    (1 CPU)
# 通过以下数据可看出stress所占用的CPU最高
05:49:09 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
05:49:14 PM     0      2389    0.00    0.20    0.00    0.00    0.20     0  NetworkManager
05:49:14 PM     0      2616    0.20    0.00    0.00    0.00    0.20     0  bcm-agent
05:49:14 PM     0     30644   99.00    0.00    0.00    1.20   99.00     0  stress
05:49:14 PM     0     30658    0.00    0.20    0.00    0.00    0.20     0  pidstat

问题分析：通过以上实验，我们可以看出平均负载和cpu的使用率再不断的提升，但是iowait为0。而stress占用了大量的CPU（99%），从而可以分析出，是stree占用CPU过高导致平均负载的升高。

CPU 密集型进程，使用大量 CPU 会导致平均负载升高，此时这两者是一致的；

场景二：I/O 密集型进程
运行 stress 命令，但这次模拟 I/O 压力，即不停地执行 sync

[root@ennan ~]# stress --io 1 --timeout 600s

通过top命令可看到平均负载在不断的升高，wa也会升高

[root@ennan ~]# top
top - 18:27:55 up 10 days,  8:48,  3 users,  load average: 3.93, 1.82, 0.99
Tasks:  92 total,   2 running,  90 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us, 94.0 sy,  0.0 ni,  0.0 id,  4.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem :  1014888 total,   101312 free,   129256 used,   784320 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   667648 avail Mem

通过pidstat可看出stress占用较多的资源

[root@ennan ~]# pidstat -u 5 1
# 间隔5秒后输出一组数据
Linux 3.10.0-957.21.3.el7.x86_64 (ennan)    08/22/2019  _x86_64_    (1 CPU)
06:33:55 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
06:34:00 PM     0     31173    0.00   20.96    0.00   59.28   20.96     0  stress
06:34:00 PM     0     31174    0.00   23.55    0.00   55.49   23.55     0  stress
06:34:00 PM     0     31175    0.20   31.54    0.00   63.67   31.74     0  stress
06:34:00 PM     0     31176    0.00   21.96    0.00   56.29   21.96     0  stress

问题分析：通过top中的wa及pidstat中的wait可分析出，是I/0过高导致了平均负载的提升。

I/O 密集型进程，等待 I/O 也会导致平均负载升高，但 CPU 使用率不一定很高；

场景三：大量进程的场景
使用 stress，模拟的是 4 个进程

[root@ennan ~]# stress -c 4 --timeout 600

通过top命令可看到平均负载在不断的升高，CPU利用率也再升高

[root@ennan ~]# top
top - 19:15:18 up 10 days,  9:35,  3 users,  load average: 8.77, 5.68, 3.19
Tasks:  89 total,   5 running,  83 sleeping,   0 stopped,   1 zombie
%Cpu(s): 99.7 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem :  1014888 total,    94984 free,   123008 used,   796896 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   674120 avail Mem

通过pidstat可看出stress占用较多的资源

[root@ennan ~]# pidstat -u 5 1
# 间隔5秒后输出一组数据
Linux 3.10.0-957.21.3.el7.x86_64 (ennan)    08/22/2019  _x86_64_    (1 CPU)
07:15:58 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
07:16:03 PM     0     31503   24.95    0.00    0.00   75.25   24.95     0  stress
07:16:03 PM     0     31504   24.75    0.00    0.00   75.65   24.75     0  stress
07:16:03 PM     0     31505   25.15    0.00    0.00   76.05   25.15     0  stress
07:16:03 PM     0     31506   24.95    0.00    0.00   74.45   24.95     0  stress

问题分析：4 个进程在争抢 1 个 CPU，每个进程等待 CPU 的时间（也就是代码块中的 %wait 列）高达 75%。这些超出 CPU 计算能力的进程，最终导致 CPU 过载。

大量等待 CPU 的进程调度也会导致平均负载升高，此时的 CPU 使用率也会比较高。