概览与核心价值Prometheus 提供强大的时序监控与规则计算能力。通过 Recording/Alert 规则与 Alertmanager,可实现服务可用性、延时与错误率的可靠监控与告警。关键规则示例Recording Rules(聚合加速)groups:

- name: recording.rules

rules:

- record: job:http_request_duration_seconds:p95

expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

- record: job:http_error_rate

expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

Alert 规则(错误率与可用性)groups:

- name: alerts.rules

rules:

- alert: HighErrorRate

expr: job:http_error_rate > 0.05

for: 10m

labels:

severity: critical

annotations:

summary: 高错误率告警

description: "{{ $labels.job }} 错误率超过 5% 持续 10 分钟"

- alert: ServiceUnavailable

expr: sum(rate(http_requests_total{status=~"5..|4(0[13])"}[5m])) by (job) > 0

for: 5m

labels:

severity: warning

annotations:

summary: 服务可用性告警

description: "{{ $labels.job }} 存在不可用行为,需排查"

Alertmanager 路由route:

group_by: ['alertname']

group_wait: 30s

group_interval: 5m

repeat_interval: 2h

receivers:

- name: default

webhook_configs:

- url: http://ops.example.com/alerts

参数与验证环境:`Prometheus v2.47+`、`Alertmanager v0.27+`。验证点:录制规则减少查询开销,仪表盘响应更快错误率 > 5% 持续 10m 触发告警,路由正常送达P95 时延曲线可用,趋势与业务观察一致最佳实践使用 Recording Rules 为复杂聚合提供缓存告警设置 `for` 避免瞬时波动误报与服务指标定义(SLI/SLO)对齐阈值与窗口结论通过 Recording 与 Alert 规则,结合 Alertmanager 路由,可构建稳定可靠的监控与告警系统,指标与阈值可验证与可审计。

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论
立即
投稿

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部