现象
filebeat的日志中出现
2019-04-23T14:28:30.304+0800 WARN transport/tcp.go:36 DNS lookup failure "systemlog-collect-2.novalocal": lookup systemlog-collect-2.novalocal: too many open files
2019-04-23T14:28:39.689+0800 ERROR pipeline/output.go:74 Failed to connect: lookup systemlog-collect-1.novalocal: too many open files
查看设置的max open files
进程的(最终以这个为准):
[root]# cat /proc/<pid>/limits | grep "Max open files"
Max open files 1024 4096 files
系统配置的:
[root@hdp1-hadoop-datanode-6 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127966
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 32768
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 32768
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
解决方法:修改filebeat的限制
filebeat是通过systemd控制的,修改服务的 LimitNOFILE 配置即可
[Unit]
Description=filebeat
Documentation=https://www.elastic.co/guide/en/beats/filebeat/current/index.html
Wants=network-online.target
After=network-online.target
[Service]
LimitNOFILE=32768
ExecStart=/usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
重启
systemctl daemon-reload
systemctl restart filebeat
查看打开数 lsof
只有FD列中数字开头的才算限制的范围内。 “NODE”这个是文件inode,同一个文件可能会被打开多次,那么就有多个句柄。
线程与进程:进程是资源的最小单位,直接使用lsof不加参数可以看到所有线程的打开文件,但从fd那列可以看到实际上是重复的
[root@hdp1-hadoop-datanode-6 ~]# lsof -c filebeat
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
filebeat 30309 root cwd DIR 253,0 4096 64 /
filebeat 30309 root rtd DIR 253,0 4096 64 /
filebeat 30309 root txt REG 253,0 47580429 36674332 /usr/share/filebeat/bin/filebeat
filebeat 30309 root mem REG 253,0 61752 1373 /usr/lib64/libnss_files-2.17.so
filebeat 30309 root mem REG 253,0 2116736 1355 /usr/lib64/libc-2.17.so
filebeat 30309 root mem REG 253,0 19344 1361 /usr/lib64/libdl-2.17.so
filebeat 30309 root mem REG 253,0 143352 1381 /usr/lib64/libpthread-2.17.so
filebeat 30309 root mem REG 253,0 155064 1348 /usr/lib64/ld-2.17.so
filebeat 30309 root 0r CHR 1,3 0t0 1028 /dev/null
filebeat 30309 root 1u unix 0xffff8800025a8000 0t0 348814743 socket
filebeat 30309 root 2u unix 0xffff8800025a8000 0t0 348814743 socket
filebeat 30309 root 3r CHR 1,9 0t0 1033 /dev/urandom
filebeat 30309 root 4u a_inode 0,9 0 5868 [eventpoll]
filebeat 30309 root 5w REG 253,0 86117783 768307 /matrix/data/logs/filebeat/filebeat
filebeat 30309 root 6u sock 0,7 0t0 2873047023 protocol: TCP
lsof的使用
https://www.cnblogs.com/peida/archive/2013/02/26/2932972.html
常用参数:
-a 列出打开文件存在的进程
-c<进程名> 列出指定进程所打开的文件
-g 列出GID号进程详情
-d<文件号> 列出占用该文件号的进程
+d<目录> 列出目录下被打开的文件
+D<目录> 递归列出目录下被打开的文件
-n<目录> 列出使用NFS的文件
-i<条件> 列出符合条件的进程。(4、6、协议、:端口、 @ip )
-p<进程号> 列出指定进程号所打开的文件
-u 列出UID号进程详情
golang设置max open files
package main
/*
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
int rlimit_init() {
printf("setting rlimit\n");
struct rlimit limit;
if (getrlimit(RLIMIT_NOFILE, &limit) == -1) {
printf("getrlimit error\n");
return 1;
}
limit.rlim_cur = limit.rlim_max = 50000;
if (setrlimit(RLIMIT_NOFILE, &limit) == -1) {
printf("setrlimit error\n");
return 1;
}
printf("set limit ok\n");
return 0;
}
*/
import "C"
func main() {
C.rlimit_init()
}
或者: syscall 包(linux下)
package main
import (
"fmt"
"os"
"syscall"
)
func main() {
var rlim syscall.Rlimit
err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &rlim)
if err != nil {
fmt.Println("get rlimit error: " + err.Error())
os.Exit(1)
}
rlim.Cur = 50000
rlim.Max = 50000
err = syscall.Setrlimit(syscall.RLIMIT_NOFILE, &rlim)
if err != nil {
fmt.Println("set rlimit error: " + err.Error())
os.Exit(1)
}
}