概览与核心价值Prometheus 提供强大的时序监控与规则计算能力。通过 Recording/Alert 规则与 Alertmanager,可实现服务可用性、延时与错误率的可靠监控与告警。关键规则示例Recording Rules(聚合加速)groups:
- name: recording.rules
rules:
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
- record: job:http_error_rate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
Alert 规则(错误率与可用性)groups:
- name: alerts.rules
rules:
- alert: HighErrorRate
expr: job:http_error_rate > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: 高错误率告警
description: "{{ $labels.job }} 错误率超过 5% 持续 10 分钟"
- alert: ServiceUnavailable
expr: sum(rate(http_requests_total{status=~"5..|4(0[13])"}[5m])) by (job) > 0
for: 5m
labels:
severity: warning
annotations:
summary: 服务可用性告警
description: "{{ $labels.job }} 存在不可用行为,需排查"
Alertmanager 路由route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receivers:
- name: default
webhook_configs:
- url: http://ops.example.com/alerts
参数与验证环境:`Prometheus v2.47+`、`Alertmanager v0.27+`。验证点:录制规则减少查询开销,仪表盘响应更快错误率 > 5% 持续 10m 触发告警,路由正常送达P95 时延曲线可用,趋势与业务观察一致最佳实践使用 Recording Rules 为复杂聚合提供缓存告警设置 `for` 避免瞬时波动误报与服务指标定义(SLI/SLO)对齐阈值与窗口结论通过 Recording 与 Alert 规则,结合 Alertmanager 路由,可构建稳定可靠的监控与告警系统,指标与阈值可验证与可审计。

发表评论 取消回复