
Uptime Monitoring - Downtime Notification Automation with Alertmanager
Detecting that a website or API has gone down before your customers do is a cornerstone of operational reliability. Instead of manual checks, you can set up automated uptime monitoring to continuously run HTTP, TCP, and DNS health checks, and send instant Slack, PagerDuty, or email notifications via
Ahmet Yılmaz
Senior Infrastructure Engineer
Detecting that a website or API has gone down before your customers do is a cornerstone of operational reliability. Instead of manual checks, you can set up automated uptime monitoring to continuously run HTTP, TCP, and DNS health checks, and send instant Slack, PagerDuty, or email notifications via Alertmanager when outages occur. This guide covers everything from setting up uptime monitoring with Prometheus Blackbox Exporter to Alertmanager configuration, alert rules to on-call management.
Why Is Uptime Monitoring Critical?
Downtime directly translates to revenue loss. For an e-commerce site, it can mean thousands of dollars in lost sales per minute; for a SaaS application, it can mean SLA violations and financial penalties. With proactive monitoring, you can detect and respond to issues before users notice them.
| Uptime Target | Annual Allowed Downtime | Monthly Allowed Downtime |
|---|---|---|
| 99% (two nines) | 3 days 15 hours | 7 hours 18 minutes |
| 99.9% (three nines) | 8 hours 46 minutes | 43 minutes 50 seconds |
| 99.99% (four nines) | 52 minutes 36 seconds | 4 minutes 23 seconds |
| 99.999% (five nines) | 5 minutes 16 seconds | 26 seconds |
💡 Realistic Target: For most web applications, 99.9% (three nines) is a reasonable target. This allows approximately 43 minutes of downtime per month. Targets of 99.99% and above require redundant infrastructure, automatic failover, and a 24/7 on-call team.
Endpoint Monitoring with Blackbox Exporter
Prometheus Blackbox Exporter checks endpoint availability over HTTP, HTTPS, TCP, ICMP, and DNS protocols. It simulates the real user experience by looking at your application from the outside (black-box).
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml
ports:
- "9090:9090"
networks:
- monitoring
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/config.yml
ports:
- "9115:9115"
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
networks:
- monitoring
networks:
monitoring:
driver: bridge
Blackbox Exporter Configuration
Define which checks the Blackbox Exporter performs using modules:
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200, 201, 301, 302]
follow_redirects: true
preferred_ip_protocol: ip4
tls_config:
insecure_skip_verify: false
tcp_connect:
prober: tcp
timeout: 5s
dns_check:
prober: dns
timeout: 5s
dns:
query_name: hosted.cloud
query_type: A
transport_protocol: udp
ssl_expiry:
prober: http
timeout: 10s
http:
valid_status_codes: [200, 301, 302]
tls_config:
insecure_skip_verify: false
Prometheus Scrape Configuration
Define the scrape config for Prometheus to check target endpoints through the Blackbox Exporter:
scrape_configs:
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://hosted.cloud
- https://api.hosted.cloud/health
- https://admin.hosted.cloud
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Prometheus Alert Rules
Alert rules trigger Alertmanager notifications when specific conditions are met. Essential rules for uptime monitoring:
groups:
- name: uptime-alerts
rules:
# Endpoint completely unreachable
- alert: EndpointDown
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Endpoint {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been unreachable for 2 minutes."
# Slow response time (over 2 seconds)
- alert: HighResponseTime
expr: probe_http_duration_seconds{phase="transfer"} > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response: {{ $labels.instance }}"
description: "{{ $labels.instance }} response time has been over 2s for 5 minutes."
# SSL certificate expiring within 14 days
- alert: SSLCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "SSL cert expiring: {{ $labels.instance }}"
description: "{{ $labels.instance }} SSL certificate expires in {{ $value | humanize }} days."
# HTTP 5xx status code
- alert: HTTPStatusError
expr: probe_http_status_code >= 500
for: 1m
labels:
severity: critical
annotations:
summary: "HTTP 5xx: {{ $labels.instance }}"
description: "{{ $labels.instance }} is returning HTTP {{ $value }} status code."
⚠️ Important: Don't set the for duration too short. Temporary network fluctuations will generate false positives. Wait at least 2 minutes for critical alerts and 5 minutes for warnings. However, adjust these durations based on your business requirements for fast notification during real outages.
Alertmanager Configuration
Alertmanager receives alerts from Prometheus, groups them, suppresses duplicate notifications, and routes them to the correct channel. Routing, grouping, and inhibition mechanisms prevent alert fatigue.
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T.../B.../xxx'
route:
receiver: slack-default
group_by: [alertname, instance]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 1h
continue: true
# All alerts also go to Slack
- match_re:
severity: critical|warning
receiver: slack-alerts
receivers:
- name: slack-default
slack_configs:
- channel: '#monitoring'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: slack-alerts
slack_configs:
- channel: '#alerts-critical'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
- name: pagerduty-critical
pagerduty_configs:
- service_key: 'your-pagerduty-integration-key'
severity: critical
inhibit_rules:
# Suppress HighResponseTime when EndpointDown is active
- source_match:
alertname: EndpointDown
target_match:
alertname: HighResponseTime
equal: [instance]
Uptime Dashboard with Grafana
Visualize Blackbox Exporter metrics in Grafana to monitor the status of all your endpoints from a single panel. Key metrics:
-
probe_success Is the endpoint reachable? 1 = success, 0 = failure. Create a green/red status indicator with a Stat panel.
-
probe_http_duration_seconds HTTP response time (DNS, connect, TLS, transfer phases separately). Analyze trends with a Time series panel.
-
Uptime Percentage (PromQL) Calculate the last 30 days uptime percentage with
avg_over_time(probe_success[30d]) * 100.
On-Call Management and Preventing Alert Fatigue
An effective alerting system should deliver the right information to the right person at the right time. Excessive alerting creates "alert fatigue" and causes teams to overlook real issues.
Severity Separation
Critical: immediate PagerDuty/phone. Warning: Slack notification. Info: dashboard only.
Inhibition Rules
If a server is completely down, suppress all service alerts on that server. A single root cause notification is sufficient.
Silence and Maintenance
Create Alertmanager silences during planned maintenance. They are automatically removed when maintenance ends.
For server metrics, check our Prometheus + Grafana guide. For centralized log management, see our ELK Stack guide. For SSL certificate management, review our Let's Encrypt Automatic Renewal guide. The Alertmanager Official Documentation and Blackbox Exporter GitHub are useful additional resources.
Frequently Asked Questions
Can I monitor internal services with Blackbox Exporter?
Yes, Blackbox Exporter is not limited to public endpoints. You can monitor internal APIs, database ports, and message queue connections with TCP probes. ClusterIP services in Kubernetes can also be checked.
How often should I check for uptime monitoring?
For critical services, 15-30 second intervals are recommended. For less critical services, 1-5 minutes is sufficient. Very frequent checks (less than 5 seconds) can put unnecessary load on the target service and generate false alarms.
How do I control alert repetition in Alertmanager?
Use the repeat_interval parameter to determine how often the same alert is resent. 1 hour for critical alerts and 4 hours for warning alerts are reasonable values. Unresolved alerts are re-notified at this interval.
Can I perform uptime checks from multiple locations?
Yes, you can set up multi-location monitoring by installing Blackbox Exporter in different geographic locations. This helps detect regional network issues and reduces false alarm rates. Checking from at least 2-3 different locations is recommended.
Conclusion
Continuously monitor your HTTP, TCP, and DNS endpoints with Prometheus Blackbox Exporter. Send notifications to the right team through the right channel during outages with Alertmanager. Prevent alert fatigue with inhibition rules and severity separation, and visualize your uptime metrics with Grafana dashboards.
Looking for Infrastructure That Runs Without Interruption?
Run your services reliably with Hosted Cloud's 99.9% uptime guaranteed servers.
Explore Cloud Server Plans →Ahmet Yılmaz
Senior Infrastructure Engineer
With over 10 years of experience in cloud infrastructure and DevOps, Ahmet specializes in Kubernetes, Terraform, and high-availability architectures.
Comments coming soon