问题描述
SSH 远程执行一个极为耗时的操作,比如训练层次随机图的马尔科夫链蒙特卡洛,断开会话进程就终止。查阅资料,发现是Linux信号机制造成的。
If the process receiving SIGHUP is a Unix shell, then as part of job control it will often intercept the signal and ensure that all stopped processes are continued before sending the signal to child processes (more precisely, process groups, represented internally by the shell as a "job"), which by default terminates them.
This can be circumvented in two ways. Firstly, the Single UNIX Specification describes a shell utility called nohup, which can be used as a wrapper to start a program and make it ignore SIGHUP by default. Secondly, child process groups can be "disowned" by invoking disown with the job id, which removes the process group from the shell's job table (so they will not be sent SIGHUP), or (optionally) keeps them in the job table but prevents them from receiving SIGHUP on shell termination.
解决方法
网上说用 nohup
。使用nohup命令让程序在关闭会话时仍然可以运行,但是这个命令重定向会输出文件,由于模型训练的日志一直在输出,数据量大,所以输出文本显然无法接受。
最后使用的tmux
,新开一个窗口。