Prometheus+Grafana监控JVM

概要

体系结构.jpg

组件简介

spring-actuator
可以帮助你监控和管理Spring Boot应用，比如健康检查、审计、统计和HTTP追踪等
通过JMX或者HTTP endpoints来获取数据
spring-boot-admin
对接spring-actuator，提供展示界面
micrometer
提供多种监控平台可用的JVM应用指标查询接口，类似SLF4J
Prometheus
可存储多维时间序列数据的监控和报警工具
Graphite / InfluxDB / OpenTSDB
时间序列数据库
Granfana
可视化指标分析平台

监控搭建

spring-actuator + micrometer + Prometheus + Grafana + <WebHook>

Prometheus

特征

多维度数据模型-由指标键值对标识的时间序列数据组成
PromQL，一种灵活的查询语言
不依赖分布式存储; 单个服务器节点是自治的
以HTTP方式，通过pull模型拉取时间序列数据
支持通过中间网关推送时间序列数据
通过服务发现或者静态配置，来发现目标服务对象
支持多种多样的图表和界面展示

架构

prometheus-architecture.jpg

数据采集

GET /actuator/prometheus

指标查询接口数据.jpg

数据查询

prometheus提供了web界面执行PromSQL查询时序数据，上面的指标就可以直接作为查询条件语句，且支持多种函数查询，参见 https://prometheus.io/docs/prometheus/latest/querying/functions/

promSQL查询.jpg

预警规则

通过配置文件中的规则对每次查询的时序数据进行预警判定，可以在web界面中展示已加载的规则项
预警信息发送到AlertManager组件中进行统一预警管理

规则配置：

groups:
- name: Instances
  rules:
  - alert: InstanceDown
    expr: up != 1  # 规则表达式，支持PromSQL查询
    for: 1m  # 首次命中规则1分钟后发送报警，若延迟区间内对数据再次检查没有命中规则，就不再报警
    labels:  # 报警信息标签
      severity: page # 预警严重程度，后面可以根据这个字段抑制某些不需要的告警
      status: High
    annotations:  # 报警信息描述
      description: "Application: {{ $labels.job }} Instance: {{ $labels.instance }} is Down ! ! !"
      value: '{{ $value }}'
      summary:  "Instance {{ $labels.instance }} down"

预警状态
Inactive： 这里什么都没有发生。
Pending： 已触发阈值，但未满足报警持续时间（即rule中的for字段）
Firing： 已触发阈值且满足告警持续时间。警报发送到Notification Pipeline，经过处理，发送报警
Resolved： 预警已解决，只会出现报警的请求报文中，在web界面看不到

AlertManager

Prometheus服务器中的警报规则向AlertManager发送警报。然后，警报管理器管理这些警报，包括沉默、抑制、聚合和通过电子邮件、待命通知系统和聊天平台等方法发送通知
这里我们选择使用WebHook方式，将报警信息发送到指定接口，我们可以针对报警数据自行选择通知方式和通知人

报警请求数据

{
  "receiver": "web\\.hook",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "InstanceDown",
        "instance": "10.10.10.10:9000",
        "job": "app-1",
        "severity": "page",
        "status": "High"
      },
      "annotations": {
        "description": "Application: bdp-gateway Instance: 10.10.10.10:9000 is Down ! ! !",
        "summary": "Instance 10.10.10.10:9000 down",
        "value": "0"
      },
      "startsAt": "2021-02-20T09:11:43.380766777Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://VM-102-32-centos:9090/graph?g0.expr=up+%21%3D+1\u0026g0.tab=1",
      "fingerprint": "8a9aadd8d34d09f7"
    },
    {
      "status": "resolved",
      "labels": {
        "alertname": "InstanceDown",
        "instance": "10.10.10.10:9001",
        "job": "app-2",
        "severity": "page",
        "status": "High"
      },
      "annotations": {
        "description": "Application: app-2 Instance: 10.10.10.10:9001 is Down ! ! !",
        "summary": "Instance 10.10.10.10:9001 down",
        "value": "0"
      },
      "startsAt": "2021-02-20T09:11:28.380766777Z",
      "endsAt": "2021-02-20T09:13:43.380766777Z",
      "generatorURL": "http://VM-102-32-centos:9090/graph?g0.expr=up+%21%3D+1\u0026g0.tab=1",
      "fingerprint": "6070b8cb7389ffc2"
    }
  ],
  "groupLabels": {
    "alertname": "InstanceDown"
  },
  "commonLabels": {
    "alertname": "InstanceDown",
    "severity": "page",
    "status": "High"
  },
  "commonAnnotations": {
    "value": "0"
  },
  "externalURL": "http://VM-102-32-centos:9093",
  "version": "4",
  "groupKey": "{}:{alertname=\"InstanceDown\"}",
  "truncatedAlerts": 0
}

Grafana

用于可视化大型测量数据的开源程序，他提供了强大和优雅的方式去创建、共享、浏览数据。dashboard中显示了你不同metric数据源中的数据

数据源

多种时序数据库（Prometheus、Graphite、OpenTSDB、InfluxDB）
文档数据库（ElasticSearch）
分布式追踪系统（Jaeger、Zipkin、Tempo）
SQL（MySQL、PostgreSQL、Sql Server）
等等
节点仪表盘

节点仪表盘.jpg
应用健康表

应用健康表.jpg

部署

应用端加入spring-actuator和micrometer组件

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

安装Prometheus & AlertManager

官网下载解压缩后直接运行，这里我们只需要用到prometheus和alertmanager
https://prometheus.io/download/
prometheus默认端口9090，alertmanager默认端口9093

目录结构.jpg

创建prometheus启动脚本start.sh

nohup ./prometheus --web.enable-lifecycle 2>&1 &

创建prometheus热加载脚本reload.sh

curl -XPOST http://localhost:9090/-/reload

Prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
    - "./rules/*.yml"
    
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'app-1'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    metrics_path: '/actuator/prometheus'
    static_configs:
    - targets: ['10.10.10.10:9000', '10.10.10.11:9000']

  - job_name: 'app-2'
    metrics_path: '/actuator/prometheus'
    static_configs:
    - targets: ['10.10.10.11:9001']

创建alertmanager启动脚本start.sh

nohup ./alertmanager 2>&1 &

创建alertmanager热加载脚本reload.sh

curl -XPOST http://localhost:9093/-/reload

alertmanager.yml

route:
  group_by: ['alertname']  # 报警分组依据字段
  group_wait: 20s  # 收到新组时等待时间，目的是为了等待同组的警报合并发送报警
  group_interval: 5m  # 同组报警发送的间隔时间，从上次发送报警的时间开始计算
  repeat_interval: 3m  # 报警发送间隔
  receiver: 'web.hook'  # 接收报警的名称
receivers:
  - name: 'web.hook'
    webhook_configs:
    - send_resolved: false  # 已报警的指标恢复后是否通知，默认true
      url: 'http://10.10.10.11:8080/xxx/xxx'  #报警通知接口
inhibit_rules:  # 告警抑制配置，避免当某种问题告警产生之后用户接收到大量由此问题导致的一系列的其它告警通知
  - source_match:  # 源报警规则
      severity: 'critical'
    target_match:  # 抑制的报警规则
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']  # 需要都有相同的标签及值，否则抑制不起作用

安装Grafana

官网下载解压缩后直接运行 ./bin/grafana-server
https://grafana.com/grafana/download
默认端口3000

grafana目录结构.jpg
创建Prometheus数据源

grafana添加数据源.jpg
导入仪表盘
仪表盘可以使用文件导入，在官网上可以查询想要的仪表盘下载后导入到自己的服务器上
https://grafana.com/grafana/dashboards/12856

grafana导入仪表盘.jpg

最后编辑于：2021.03.05 15:11:20

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,362评论 5赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,330评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,247评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,560评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,580评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,569评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,929评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,587评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,840评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,596评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,678评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,366评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,945评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,929评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,165评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,271评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,403评论 2赞 342