很多地方提到Prometheus Operator是kubernetes集群监控的终极解决方案,但是目前Prometheus Operator已经不包含完整功能,完整的解决方案已经变为kube-prometheus。项目地址为:
https://github.com/coreos/kube-prometheus
安装
下载软件
#git clone https://github.com/coreos/kube-prometheus.git
查看清单文件
#cd manifests
#ls
00namespace-namespace.yaml node-exporter-clusterRole.yaml
0prometheus-operator-0alertmanagerCustomResourceDefinition.yaml node-exporter-daemonset.yaml
0prometheus-operator-0prometheusCustomResourceDefinition.yaml node-exporter-serviceAccount.yaml
0prometheus-operator-0prometheusruleCustomResourceDefinition.yaml node-exporter-serviceMonitor.yaml
0prometheus-operator-0servicemonitorCustomResourceDefinition.yaml node-exporter-service.yaml
0prometheus-operator-clusterRoleBinding.yaml prometheus-adapter-apiService.yaml
0prometheus-operator-clusterRole.yaml prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
0prometheus-operator-deployment.yaml prometheus-adapter-clusterRoleBindingDelegator.yaml
0prometheus-operator-serviceAccount.yaml prometheus-adapter-clusterRoleBinding.yaml
0prometheus-operator-serviceMonitor.yaml prometheus-adapter-clusterRoleServerResources.yaml
0prometheus-operator-service.yaml prometheus-adapter-clusterRole.yaml
alertmanager-alertmanager.yaml prometheus-adapter-configMap.yaml
alertmanager-secret.yaml prometheus-adapter-deployment.yaml
alertmanager-serviceAccount.yaml prometheus-adapter-roleBindingAuthReader.yaml
alertmanager-serviceMonitor.yaml prometheus-adapter-serviceAccount.yaml
alertmanager-service.yaml prometheus-adapter-service.yaml
grafana-dashboardDatasources.yaml prometheus-clusterRoleBinding.yaml
grafana-dashboardDefinitions.yaml prometheus-clusterRole.yaml
grafana-dashboardSources.yaml prometheus-prometheus.yaml
grafana-deployment.yaml prometheus-roleBindingConfig.yaml
grafana-serviceAccount.yaml prometheus-roleBindingSpecificNamespaces.yaml
grafana-serviceMonitor.yaml prometheus-roleConfig.yaml
grafana-service.yaml prometheus-roleSpecificNamespaces.yaml
kube-state-metrics-clusterRoleBinding.yaml prometheus-rules.yaml
kube-state-metrics-clusterRole.yaml prometheus-serviceAccount.yaml
kube-state-metrics-deployment.yaml prometheus-serviceMonitorApiserver.yaml
kube-state-metrics-roleBinding.yaml prometheus-serviceMonitorCoreDNS.yaml
kube-state-metrics-role.yaml prometheus-serviceMonitorKubeControllerManager.yaml
kube-state-metrics-serviceAccount.yaml prometheus-serviceMonitorKubelet.yaml
kube-state-metrics-serviceMonitor.yaml prometheus-serviceMonitorKubeScheduler.yaml
kube-state-metrics-service.yaml prometheus-serviceMonitor.yaml
node-exporter-clusterRoleBinding.yaml prometheus-service.yaml
修改prometheus-serviceMonitorKubelet.yaml中的port,由https-metrics改为http-metrics,并将scheme改为http
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
port: http-metrics
scheme: http #很多资料上没有提到scheme
alertmanager-service.yaml 增加nodeport 30093的配置
apiVersion: v1
kind: Service
metadata:
labels:
alertmanager: main
name: alertmanager-main
namespace: monitoring
spec:
ports:
- name: web
port: 9093
targetPort: web
nodePort: 30093
type: NodePort
selector:
alertmanager: main
app: alertmanager
sessionAffinity: ClientIP
grafana-service.yaml 增加nodeport 32000的配置
apiVersion: v1
kind: Service
metadata:
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
ports:
- name: http
port: 3000
targetPort: http
nodePort: 32000
type: NodePort
selector:
app: grafana
prometheus-service.yaml 增加nodeport 30090的配置
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
ports:
- name: web
port: 9090
targetPort: web
nodePort: 30090
type: NodePort
selector:
app: prometheus
prometheus: k8s
sessionAffinity: ClientIP
创建资源,过程中会报资源不存在,建议执行两次
#kubectl apply -f .
查看自定义资源crd
#kubectl get crd | grep coreos
alertmanagers.monitoring.coreos.com 2019-06-03T09:17:48Z
prometheuses.monitoring.coreos.com 2019-06-03T09:17:48Z
prometheusrules.monitoring.coreos.com 2019-06-03T09:17:48Z
servicemonitors.monitoring.coreos.com 2019-06-03T09:17:48Z
查看新建的pod
#kubectl -n monitoring get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-main-0 2/2 Running 0 16h 10.244.196.134 node01 <none> <none>
alertmanager-main-1 2/2 Running 0 15h 10.244.241.204 ingressnode02 <none> <none>
alertmanager-main-2 2/2 Running 0 15h 10.244.114.4 node05 <none> <none>
grafana-69c7b8468d-l8p2b 1/1 Running 0 16h 10.244.17.198 prometheus01 <none> <none>
kube-state-metrics-65b5ccc84-kwfgh 4/4 Running 0 15h 10.244.17.199 prometheus01 <none> <none>
node-exporter-62mkc 2/2 Running 0 16h 22.22.3.235 master02 <none> <none>
node-exporter-6bsrb 2/2 Running 0 16h 22.22.3.239 node04 <none> <none>
node-exporter-8b5h8 2/2 Running 0 16h 22.22.3.241 prometheus01 <none> <none>
node-exporter-chssb 2/2 Running 0 16h 22.22.3.243 ingressnode02 <none> <none>
node-exporter-dwqkc 2/2 Running 0 16h 22.22.3.240 node05 <none> <none>
node-exporter-kf2cr 2/2 Running 0 16h 22.22.3.242 ingressnode01 <none> <none>
node-exporter-krsm4 2/2 Running 0 16h 22.22.3.238 node03 <none> <none>
node-exporter-lv4gx 2/2 Running 0 16h 22.22.3.236 node01 <none> <none>
node-exporter-v5f9v 2/2 Running 0 16h 22.22.3.234 master01 <none> <none>
node-exporter-zgsr2 2/2 Running 0 16h 22.22.3.237 node02 <none> <none>
prometheus-adapter-6c75d8686d-gq8bn 1/1 Running 0 16h 10.244.17.197 prometheus01 <none> <none>
prometheus-k8s-0 3/3 Running 1 16h 10.244.140.68 node02 <none> <none>
prometheus-k8s-1 3/3 Running 1 16h 10.244.248.198 node04 <none> <none>
prometheus-operator-74d449f6b4-q6bjn 1/1 Running 0 16h 10.244.17.196 prometheus01 <none> <none>
确认网页都能正常打开
配置prometheus
展开Status菜单,查看targets,可以看到只有图中两个监控任务没有对应的目标,这和serviceMonitor资源对象有关
查看yaml文件prometheus-serviceMonitorKubeScheduler,selector匹配的是service的标签,但是kube-system namespace中并没有k8s-app=kube-scheduler的service
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: monitoring
spec:
endpoints:
- interval: 30s
port: http-metrics
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
k8s-app: kube-scheduler
新建prometheus-kubeSchedulerService.yaml
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler
labels:
k8s-app: kube-scheduler #与servicemonitor中的selector匹配
spec:
selector:
component: kube-scheduler # 与scheduler的pod标签一直
ports:
- name: http-metrics
port: 10251
targetPort: 10251
protocol: TCP
创建service kube-scheduler
#kubectl apply -f prometheus-kubeSchedulerService.yaml
同理新建prometheus-kubeControllerManagerService.yaml
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-controller-manager
labels:
k8s-app: kube-controller-manager
spec:
selector:
component: kube-controller-manager
ports:
- name: http-metrics
port: 10252
targetPort: 10252
protocol: TCP
创建service kube-controller-manager
#kubectl apply -f prometheus-kubeControllerManagerService.yaml
确认所有targets变为正常
配置grafana
使用admin/admin登录并修改密码
可以看到数据源已经与prometheus关联
自定义监控项
以监控etcd为例
将需要的etcd证书保存到secret对象etcd-certs中
# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
secret/etcd-certs created
修改prometheus资源k8s,在prometheus-prometheus.yaml里面增加secrets
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
baseImage: quay.io/prometheus/prometheus
nodeSelector:
beta.kubernetes.io/os: linux
replicas: 2
secrets:
- etcd-certs
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.7.2
应用prometheus-prometheus.yaml
#kubectl apply -f prometheus-prometheus.yaml
在pod中查看证书是否导入成功
# kubectl -n monitoring exec -it prometheus-k8s-0 /bin/sh
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
# ls -l /etc/prometheus/secrets/etcd-certs/
total 0
lrwxrwxrwx 1 root root 13 Jun 4 09:12 ca.crt -> ..data/ca.crt
lrwxrwxrwx 1 root root 29 Jun 4 09:12 healthcheck-client.crt -> ..data/healthcheck-client.crt
lrwxrwxrwx 1 root root 29 Jun 4 09:12 healthcheck-client.key -> ..data/healthcheck-client.key
/prometheus $ cat /etc/prometheus/secrets/etcd-certs/ca.crt
-----BEGIN CERTIFICATE-----
MIIC9zCCAd+gAwIBAgIJAMiN3pOWJVGOMA0GCSqGSIb3DQEBCwUAMBIxEDAOBgNV
BAMMB2V0Y2QtY2EwHhcNMTkwNTI3MDgzNDExWhcNMzkwNTIyMDgzNDExWjASMRAw
DgYDVQQDDAdldGNkLWNhMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
rG1xQcAwZ67XXG84PzqIIqoqnq/zM3Ru+02PELbzgiZ4MrNPte32vZuj6HK/JDDQ
nEirgnQQxQJ6OxvnDrFVwyxveNI8jrd+FRfuh2ae0NIiqkWk88O42OioACBW6cJA
hILpIcn066+E+t2vh/3TmqMduV8eY5p8VAwRT1B04fJAQVcr0sJh3JXExppbtdWL
Z0T25QTbbbZ/I6oxLMu/NkS171R5l397rSpD2ox0NV0GASoqiitffPznOHBPa1Zs
UwOlQnZlWaBM5XQHFhRQTG/Bxxhe45azmmPT3DGCpATk+/GnYDPnt4TSZiX9gZ6O
beRsGUzPDrX/LOEV/Uv+VQIDAQABo1AwTjAdBgNVHQ4EFgQUxQl8C8RdG+tU2U+T
gy901tOxUNUwHwYDVR0jBBgwFoAUxQl8C8RdG+tU2U+Tgy901tOxUNUwDAYDVR0T
BAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEAica5i0wN9ZuCICQOGwMcuVgadBqV
w4dOyP4EPyD2SKx3YpYREMGXOafYkrX2rWKqsCBqS9xUT34x2DQ4/KuoPY/Ee37h
pJ+/i47sq8pmiHxqQRUACyGA6SqWtcApfW62+O97qHnRtyUcCftKKLYEu3djzTJd
FOn6xPehbFzhL9H4tsiZ+kFaXqWDUbhSCAd/LeJ+dxzmOE+Rd0hsPHIyzdmWUKwe
CTkSaf9X4KPWjBUCqPzB/Td6Mz3HHg8zZo2FgkyI98a7c83rHl3aTfBJEi4LND8x
PTFwgOGNlZXa6OnUmkn/sHvoNc88EqDm/GjPI6xfLr7BSWE4jJCIwWROvg==
-----END CERTIFICATE-----
创建serviceMonitor etcd-k8s prometheus-serviceMonitorEtcd.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: etcd-k8s
name: etcd-k8s
namespace: monitoring
spec:
endpoints:
- port: port
interval: 30s
scheme: https
#port: https-metrics
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.cert
certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.cert
keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
insecureSkipVerify: true
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
k8s-app: etcd
应用prometheus-serviceMonitorEtcd.yaml
#kubectl apply -f prometheus-serviceMonitorEtcd.yaml
创建关联的service,因为etcd是外部的,所以需要手动创建endpoints.prometheus-service-etcd.yaml
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: etcd
name: etcd-k8s
namespace: kube-system
spec:
ports:
- name: port
port: 2379
protocol: TCP
type: ClusterIP
clusterIP: None
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-k8s
namespace: kube-system
labels:
k8s-app: etcd
subsets:
- addresses:
- ip: 22.22.3.231
nodeName: etcd01
- ip: 22.22.3.232
nodeName: etcd02
- ip: 22.22.3.233
nodeName: etcd03
ports:
- name: port
port: 2379
protocol: TCP
应用prometheus-service-etcd.yaml
#kubectl apply -f prometheus-service-etcd.yaml
到https://grafana.com/dashboards 找到etcd相关dashboard
https://grafana.com/dashboards/3070
下载json文件,并导入到grafana,需要修改prometheus为prometheus
查看dashboard
- Prometheus和serviceMonitor的配置错误可能导致pod prometheus-k8s-0和prometheus-k8s-1不正常,从而导致prometheus无法打开,只要将配置修改正确就可以恢复。