Prometheus 指标设计与告警规则实战指标类型与设计Counter:累计值,命名以 `_total` 结尾Gauge:瞬时可增可减,适合库存与并发数Histogram:分桶统计,适合时延分布Summary:局部统计,适合应用内百分位采集与命名清晰标签模型,如 `service`、`endpoint`、`status`避免高基数标签(如用户 ID)PromQL 示例rate(http_requests_total{status=~"5.."}[5m])
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
告警规则groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: 高错误率告警
description: 5xx 比例超过 5% 持续 10 分钟
- alert: HighLatencyP95
expr: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: P95 时延过高
description: 近 5 分钟 P95 超过 800ms
Alertmanager 路由示例route:
group_by: ['alertname']
receiver: 'default'
receivers:
- name: 'default'
webhook_configs:
- url: 'https://alert.example.com/hook'
验证与监控使用 `/-/healthy` 与 `/-/ready` 检查服务状态通过规则页面与 `promtool` 验证告警语法总结合理的指标与告警设计可实现稳定的可观测性与快速响应能力。

发表评论 取消回复