概述目标:通过Recording Rules降低查询成本、统一指标命名,并以Alerting Rules实现分级告警与抑制。适用:服务延迟/错误率、资源使用率、队列滞后等核心指标治理。核心与实战录制规则示例(`rules/recording.yml`):groups: - name: service-latency rules: - record: job:http_request_duration_seconds:p95 expr: percentile_over_time(0.95, http_request_duration_seconds_bucket[5m]) - record: job:error_rate:ratio expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) 告警规则示例(`rules/alerts.yml`):groups: - name: service-alerts rules: - alert: HighErrorRate expr: job:error_rate:ratio > 0.05 for: 10m labels: severity: critical service: api annotations: summary: "API错误率升高" description: "错误率超过5%持续10分钟" - alert: HighLatencyP95 expr: job:http_request_duration_seconds:p95 > 0.8 for: 5m labels: severity: warning service: api annotations: summary: "API P95 延迟高" description: "P95超过800ms持续5分钟" 示例Prometheus加载规则:rule_files: - "rules/recording.yml" - "rules/alerts.yml" Alertmanager路由与抑制(`alertmanager.yml`):route: receiver: default group_by: ['alertname','service'] group_wait: 30s group_interval: 5m repeat_interval: 2h routes: - matchers: - severity="critical" receiver: pager receivers: - name: default webhook_configs: [{ url: "http://ops:8080/alerts" }] - name: pager slack_configs: [{ channel: "#alerts", send_resolved: true }] 验证与监控规则校验:promtool check rules rules/recording.yml promtool check rules rules/alerts.yml 运行时检查:curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="recording")' curl -s http://prometheus:9090/api/v1/alerts | jq 告警流转:amtool alert 常见误区录制规则命名不规范导致下游混乱;应使用层级化前缀与用途后缀。`for`过短造成抖动与告警风暴;需结合历史数据设定合理持续时间。路由与抑制未配置导致重复通知;应按`severity/service`分组并抑制相关联告警。结语通过录制规则与分级告警策略,Prometheus与Alertmanager可实现稳定、可维护的监控体系,并以工具校验保证质量。

发表评论 取消回复