一、 问题背景
机房某台物理机故障,触发虚拟化系统对该物理机上的虚拟机的漂移重启操作,发现新起的虚拟机上某些应用重启失败。
看相关应用启动日志,显示无法解析主机名,但是明明用到的主机名解析已经写在/etc/hosts了!
xx.xx.xx.xx oa.bogon.com
ping: oa.bogon.com: Name or service not known
于是用业务进程运行用户身份 ping oa.bogon.com ,发现还真是解析不了;nslookup oa.bogon.com走DNS 解析却可以正常解析。
可是,当你 su - root 用户后 再ping,却都可以正常解析!
二、 问题追踪
对Linux服务器而言,一般不都是 /etc/hosts 的解析优先级最高吗,现在怎么 /etc/hosts 不生效了
当然,此处的不生效有限定条件,那就是只针对普通用户,当使用root用户时候是完全没问题的!
于是自然开始怀疑是不是跟解析有关的文件、网络权限有关?
用strace 追踪不同用户的解析过程的系统调用:
# su - root
# strace -e trace=open ping oa.bogon.com
open("/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libonion.so", O_RDONLY|O_CLOEXEC) = 3
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libidn.so.11", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcrypto.so.10", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/etc/pki/tls/legacy-settings", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 4
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
open("/lib64/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 4
open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4
PING oa.bogon.com (10.0.8.7) 56(84) bytes of data.
open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=1 ttl=64 time=0.033 ms
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=2 ttl=64 time=0.044 ms
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=3 ttl=64 time=0.044 ms
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=4 ttl=64 time=0.043 ms
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=5 ttl=64 time=0.042 ms
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=6 ttl=64 time=0.045 ms
64 bytes from oa.bogon.com (10.0.8.7): icmp_seq=7 ttl=64 time=0.045 ms
strace: Process 18039 detached
# su - test
$ strace -e trace=open ping oa.bogon.com
open("/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libonion.so", O_RDONLY|O_CLOEXEC) = 3
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libidn.so.11", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcrypto.so.10", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/etc/pki/tls/legacy-settings", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
ping: socket: Operation not permitted
+++ exited with 2 +++
将关注点放在如下3个文件身上:
/etc/hosts
/etc/host.conf
/etc/nsswitch.conf
$ ls -l /etc/hosts
-rw-r--r-- 1 root root 257 Jul 2 11:58 /etc/hosts
$ ls -l /etc/host.conf
-rw-r--r-- 1 root root 9 Jun 7 2013 /etc/host.conf
$ ls -l /etc/nsswitch.conf
-rw-rw----. 1 root root 1746 Mar 7 2019 /etc/nsswitch.conf
$ cat /etc/nsswitch.conf
cat: /etc/nsswitch.conf: Permission denied
三、解决方法
# chmod 644 /etc/hosts
# chmod 644 /etc/host.conf
# chmod 644 /etc/nsswitch.conf
nsswitch.conf(name service switch configuration,名字服务切换配置)文件位于/etc目录下,由它规定通过哪些途径以及按照什么顺序以及通过这些途径来查找特定类型的信息,还可以指定某个方法奏效或失效时系统将采取什么动作。
$ cat /etc/nsswitch.conf
hosts: files dns myhostname
先使用/etc/hosts 搜索;如果失败的话,根据/etc/resolv.conf文件中nameserver搜索;如果再次失败的话,核对myhostname找出主机信息。
三、问题处理复盘
底层物理机故障导致上面的虚拟机漂移重启(虚拟机化机制),漂移重启后的虚拟机 /etc/nsswitch.conf文件权限变成了660,默认应该是644
如果没有root权限用户ping作为对比,可能一时找不到方向
通过使用root用户 strace 追踪 ping 系统调用,找到相关打开的文件
普通用户如果没有对 /etc/nsswitch.conf read权限,那么就无法使用 /etc/hosts
四、参考
/etc/hosts entries not being used for non-root users
https://www.unixsherpa.com/solution/etchosts-entries-not-being-used-for-non-root-users/
Cannot resolve host as non-root user
https://serverfault.com/questions/637274/cannot-resolve-host-as-non-root-user
"Can't resolve host" as user, but works fine as root
https://www.linuxquestions.org/questions/linux-networking-3/can%27t-resolve-host-as-user-but-works-fine-as-root-494270/·`
nslookup-OK-but-ping-fail
https://plantegg.github.io/2019/01/09/nslookup-OK-but-ping-fail
Linux 能PING IP 但不能PING 主机域名的解决方法
https://www.cnblogs.com/gaoyuechen/p/8378138.html
Linux系统下的/etc/nsswitch.conf文件
https://www.bbsmax.com/A/Ae5RaXXLJQ
https://blog.csdn.net/waqwn/article/details/51687719
系统管理指南:命名和目录服务(DNS、NIS 和 LDAP)
https://docs.oracle.com/cd/E24847_01/html/E22302/a12swit-22067.html
Linux神器 strace解析
https://www.cnblogs.com/johnny666888/p/12629216.html