Uptime Monitoring - Downtime Notification Automation with Alertmanager

Uptime Monitoring - Downtime Notification Automation with Alertmanager

Detecting that a website or API has gone down before your customers do is a cornerstone of operational reliability. Instead of manual checks, you can set up automated uptime monitoring to continuously run HTTP, TCP, and DNS health checks, and send instant Slack, PagerDuty, or email notifications via

Detecting that a website or API has gone down before your customers do is a cornerstone of operational reliability. Instead of manual checks, you can set up automated uptime monitoring to continuously run HTTP, TCP, and DNS health checks, and send instant Slack, PagerDuty, or email notifications via Alertmanager when outages occur. This guide covers everything from setting up uptime monitoring with Prometheus Blackbox Exporter to Alertmanager configuration, alert rules to on-call management.

Why Is Uptime Monitoring Critical?

Downtime directly translates to revenue loss. For an e-commerce site, it can mean thousands of dollars in lost sales per minute; for a SaaS application, it can mean SLA violations and financial penalties. With proactive monitoring, you can detect and respond to issues before users notice them.

Uptime Target Annual Allowed Downtime Monthly Allowed Downtime
99% (two nines) 3 days 15 hours 7 hours 18 minutes
99.9% (three nines) 8 hours 46 minutes 43 minutes 50 seconds
99.99% (four nines) 52 minutes 36 seconds 4 minutes 23 seconds
99.999% (five nines) 5 minutes 16 seconds 26 seconds

💡 Realistic Target: For most web applications, 99.9% (three nines) is a reasonable target. This allows approximately 43 minutes of downtime per month. Targets of 99.99% and above require redundant infrastructure, automatic failover, and a 24/7 on-call team.

Endpoint Monitoring with Blackbox Exporter

Prometheus Blackbox Exporter checks endpoint availability over HTTP, HTTPS, TCP, ICMP, and DNS protocols. It simulates the real user experience by looking at your application from the outside (black-box).

docker-compose.yml
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml
    ports:
      - "9090:9090"
    networks:
      - monitoring

  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    volumes:
      - ./blackbox.yml:/etc/blackbox_exporter/config.yml
    ports:
      - "9115:9115"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

Blackbox Exporter Configuration

Define which checks the Blackbox Exporter performs using modules:

blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 201, 301, 302]
      follow_redirects: true
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: false

  tcp_connect:
    prober: tcp
    timeout: 5s

  dns_check:
    prober: dns
    timeout: 5s
    dns:
      query_name: hosted.cloud
      query_type: A
      transport_protocol: udp

  ssl_expiry:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200, 301, 302]
      tls_config:
        insecure_skip_verify: false

Prometheus Scrape Configuration

Define the scrape config for Prometheus to check target endpoints through the Blackbox Exporter:

prometheus.yml (excerpt)
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://hosted.cloud
          - https://api.hosted.cloud/health
          - https://admin.hosted.cloud
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Prometheus Alert Rules

Alert rules trigger Alertmanager notifications when specific conditions are met. Essential rules for uptime monitoring:

alert-rules.yml
groups:
  - name: uptime-alerts
    rules:
      # Endpoint completely unreachable
      - alert: EndpointDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been unreachable for 2 minutes."

      # Slow response time (over 2 seconds)
      - alert: HighResponseTime
        expr: probe_http_duration_seconds{phase="transfer"} > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow response: {{ $labels.instance }}"
          description: "{{ $labels.instance }} response time has been over 2s for 5 minutes."

      # SSL certificate expiring within 14 days
      - alert: SSLCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring: {{ $labels.instance }}"
          description: "{{ $labels.instance }} SSL certificate expires in {{ $value | humanize }} days."

      # HTTP 5xx status code
      - alert: HTTPStatusError
        expr: probe_http_status_code >= 500
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx: {{ $labels.instance }}"
          description: "{{ $labels.instance }} is returning HTTP {{ $value }} status code."

⚠️ Important: Don't set the for duration too short. Temporary network fluctuations will generate false positives. Wait at least 2 minutes for critical alerts and 5 minutes for warnings. However, adjust these durations based on your business requirements for fast notification during real outages.

Alertmanager Configuration

Alertmanager receives alerts from Prometheus, groups them, suppresses duplicate notifications, and routes them to the correct channel. Routing, grouping, and inhibition mechanisms prevent alert fatigue.

alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T.../B.../xxx'

route:
  receiver: slack-default
  group_by: [alertname, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: pagerduty-critical
      repeat_interval: 1h
      continue: true

    # All alerts also go to Slack
    - match_re:
        severity: critical|warning
      receiver: slack-alerts

receivers:
  - name: slack-default
    slack_configs:
      - channel: '#monitoring'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: slack-alerts
    slack_configs:
      - channel: '#alerts-critical'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'

  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        severity: critical

inhibit_rules:
  # Suppress HighResponseTime when EndpointDown is active
  - source_match:
      alertname: EndpointDown
    target_match:
      alertname: HighResponseTime
    equal: [instance]

Uptime Dashboard with Grafana

Visualize Blackbox Exporter metrics in Grafana to monitor the status of all your endpoints from a single panel. Key metrics:

  • probe_success Is the endpoint reachable? 1 = success, 0 = failure. Create a green/red status indicator with a Stat panel.
  • probe_http_duration_seconds HTTP response time (DNS, connect, TLS, transfer phases separately). Analyze trends with a Time series panel.
  • Uptime Percentage (PromQL) Calculate the last 30 days uptime percentage with avg_over_time(probe_success[30d]) * 100.

On-Call Management and Preventing Alert Fatigue

An effective alerting system should deliver the right information to the right person at the right time. Excessive alerting creates "alert fatigue" and causes teams to overlook real issues.

Severity Separation

Critical: immediate PagerDuty/phone. Warning: Slack notification. Info: dashboard only.

Inhibition Rules

If a server is completely down, suppress all service alerts on that server. A single root cause notification is sufficient.

Silence and Maintenance

Create Alertmanager silences during planned maintenance. They are automatically removed when maintenance ends.

For server metrics, check our Prometheus + Grafana guide. For centralized log management, see our ELK Stack guide. For SSL certificate management, review our Let's Encrypt Automatic Renewal guide. The Alertmanager Official Documentation and Blackbox Exporter GitHub are useful additional resources.

Frequently Asked Questions

Can I monitor internal services with Blackbox Exporter?

Yes, Blackbox Exporter is not limited to public endpoints. You can monitor internal APIs, database ports, and message queue connections with TCP probes. ClusterIP services in Kubernetes can also be checked.

How often should I check for uptime monitoring?

For critical services, 15-30 second intervals are recommended. For less critical services, 1-5 minutes is sufficient. Very frequent checks (less than 5 seconds) can put unnecessary load on the target service and generate false alarms.

How do I control alert repetition in Alertmanager?

Use the repeat_interval parameter to determine how often the same alert is resent. 1 hour for critical alerts and 4 hours for warning alerts are reasonable values. Unresolved alerts are re-notified at this interval.

Can I perform uptime checks from multiple locations?

Yes, you can set up multi-location monitoring by installing Blackbox Exporter in different geographic locations. This helps detect regional network issues and reduces false alarm rates. Checking from at least 2-3 different locations is recommended.

Conclusion

Continuously monitor your HTTP, TCP, and DNS endpoints with Prometheus Blackbox Exporter. Send notifications to the right team through the right channel during outages with Alertmanager. Prevent alert fatigue with inhibition rules and severity separation, and visualize your uptime metrics with Grafana dashboards.

Looking for Infrastructure That Runs Without Interruption?

Run your services reliably with Hosted Cloud's 99.9% uptime guaranteed servers.

Explore Cloud Server Plans →
A

Ahmet Yılmaz

Senior Infrastructure Engineer

With over 10 years of experience in cloud infrastructure and DevOps, Ahmet specializes in Kubernetes, Terraform, and high-availability architectures.

Comments coming soon