Перейти до основного вмісту

Monitoring Guide: Prometheus, Grafana & OpenSearch

This document describes the monitoring stack using free, open-source tools to ensure visibility and alerting for your deployment.


1. Metrics Collection with Prometheus

Installation via Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus

Service Discovery

Annotate service pods to expose metrics:

metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/actuator/prometheus"
prometheus.io/port: "8080"

Key Metrics to Track

  • API Latency: HTTP request duration (p95/p99)
  • Error Rate: 4xx/5xx per minute
  • Kafka Consumer Lag: to monitor event processing
  • Pod Health: CPU/memory usage, restart count

2. Dashboards with Grafana

Installation via Helm

helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana

Dashboard Setup

  • Import official Prometheus dashboard (ID: 1860)

  • Import Kafka overview dashboard (ID: 7607)

  • Custom panels:

    • P99 latency (histogram_quantile)
    • Error budget burn rate (alerts per minute)

3. Log Aggregation with OpenSearch

Installation via Helm

helm repo add opensearch https://opensearch-project.github.io/helm-charts
helm install opensearch opensearch/opensearch
helm install dashboards opensearch/opensearch-dashboards

Filebeat DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat
spec:
selector:
matchLabels:
app: filebeat
template:
metadata:
labels:
app: filebeat
spec:
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:7.17.0
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
hostPath:
path: /var/log

Key Log Queries

  • Error logs: level:ERROR in last 5m
  • Slow requests: HTTP duration > 1s
  • Kafka errors: consumer group failures

  • Document Version: 1.0
  • Date: 2025-06-21
  • Author: ArturChernets