Downtime Detection
What This Means
Downtime occurs when your website becomes inaccessible to users. Detecting downtime quickly is critical for minimizing impact on users, revenue, and search rankings.
Impact
- Lost revenue - Every minute of downtime costs money
- SEO penalties - Prolonged downtime affects rankings
- User trust - Unreliable sites lose visitors
- Tracking gaps - Analytics data missing during outages
- SLA violations - Business and contractual impacts
Types of Downtime
| Type | Description | Detection |
|---|---|---|
| Full outage | Site completely inaccessible | HTTP request fails |
| Partial outage | Some pages/features unavailable | Specific endpoint monitoring |
| Degraded performance | Site accessible but slow | Response time monitoring |
| Regional outage | Issues in specific locations | Multi-location monitoring |
How to Monitor
External Monitoring Services
- UptimeRobot - Free tier available
- Pingdom - Advanced features
- StatusCake - Multi-location
- Better Uptime - Status pages included
What to Monitor
- Homepage - Primary availability check
- Key pages - Product pages, checkout, login
- API endpoints - Critical functionality
- Third-party services - Payment, CDN, etc.
Monitoring Configuration
# Example monitoring setup
checks:
- name: Homepage
url: https://example.com
interval: 60 # seconds
timeout: 30 # seconds
alerts:
- type: email
address: ops@example.com
- type: slack
webhook: https://hooks.slack.com/...
- name: API Health
url: https://api.example.com/health
interval: 30
expected_status: 200
expected_content: "ok"
Response Procedures
1. Immediate Response
When downtime detected:
- Acknowledge alert - Prevent escalation
- Verify issue - Check from multiple locations
- Check status pages - Hosting, CDN, third-party services
- Begin diagnosis - Server logs, error messages
2. Communication
- Update internal status channel
- Post to public status page if prolonged
- Notify affected customers if necessary
3. Resolution
- Implement fix or failover
- Verify recovery from multiple locations
- Document incident and root cause
4. Post-Incident
- Conduct post-mortem analysis
- Implement preventive measures
- Update monitoring if gaps identified
Status Page Best Practices
Create a public status page:
## Current Status
All systems operational ✅
## Components
- Website: Operational
- API: Operational
- Payments: Operational
- CDN: Operational
## Recent Incidents
- [Date] Brief description - Resolved
Hosting options:
- Statuspage.io
- Cachet (self-hosted)
- Upptime (GitHub-based)
Alerting Strategy
Severity Levels
| Level | Response Time | Examples |
|---|---|---|
| Critical | Immediate | Full outage, data loss |
| High | < 15 min | Partial outage, degraded performance |
| Medium | < 1 hour | Non-critical feature failure |
| Low | Next business day | Minor issues, cosmetic bugs |
Alert Routing
- Critical: Phone call, SMS, Slack, Email
- High: Slack, Email
- Medium: Email, Ticket
- Low: Ticket only