High availability
High availability (HA) configurations ensure Grafana remains accessible even when individual instances fail. This guide covers HA deployment architectures, configuration, and operational considerations.Understanding high availability mode
Grafana supports running multiple instances that share a common database. This configuration provides:- Redundancy - Service continues if an instance fails
- Load distribution - Requests distributed across instances
- Zero-downtime updates - Rolling updates without service interruption
- Horizontal scalability - Add instances to handle increased load
High availability mode setting
Thehigh_availability database setting controls how Grafana handles shared state:
true, Grafana:
- Relies on the database for coordination between instances
- Uses database-based locking for background tasks
- Enables distributed caching mechanisms
false (single instance mode):
- Runs background tasks in-process
- Uses simpler, non-distributed algorithms
- Assumes single instance deployment
true (refer to conf/defaults.ini:160)
Important: Only set to false if you run a single Grafana instance.
Architecture components
A typical HA deployment includes:Multiple Grafana instances
Run two or more Grafana instances with:- Identical configuration files
- Shared external database
- Shared session storage (optional but recommended)
- Same version and plugins
Shared database
Use an external database for shared state:- PostgreSQL - Recommended for production
- MySQL - Suitable for production
- SQLite - Not supported for HA (single file limitation)
Load balancer
Distribute requests across instances:- Layer 7 (HTTP) load balancing
- Health check endpoints (
/api/health) - Session affinity (optional, see session storage)
- TLS termination (recommended)
Session storage
Share sessions across instances:- Redis - Recommended for production
- Memcached - Alternative distributed cache
- Database - Default, uses Grafana database
Configuration for high availability
Database configuration
Configure Grafana to use a shared PostgreSQL database:<PASSWORD> with your database password.
MySQL example:
Session storage configuration
Configure Redis for shared sessions:<REDIS_PASSWORD> with your Redis password.
Memcached example:
conf/defaults.ini:231 for cache configuration options.
Server configuration
Configure each instance with unique identifier (optional but helpful for debugging):Live features configuration
For real-time features like live dashboards, configure a message bus:Deployment architectures
Basic HA deployment
- 2+ Grafana instances (GF1, GF2)
- Load balancer distributing traffic
- Shared PostgreSQL database
Production HA deployment
- 3+ Grafana instances for resilience
- HA load balancer pair
- PostgreSQL with replication
- Redis HA for session storage
Kubernetes deployment
Load balancing configuration
Health checks
Configure your load balancer to use Grafana health endpoints: Liveness check:- Endpoint:
/livez - Expected status:
200 OK - Use for: Detecting dead instances
- Endpoint:
/readyz - Expected status:
200 OK - Use for: Routing traffic only to ready instances
- Endpoint:
/api/health - Expected status:
200 OK - Use for: Comprehensive health verification
NGINX example
HAProxy example
Database high availability
The shared database is a critical component. Ensure database HA:PostgreSQL replication
Set up PostgreSQL streaming replication:- Configure primary server for replication
- Set up one or more replica servers
- Use connection pooler (PgBouncer) for load distribution
- Configure automatic failover (Patroni, repmgr)
MySQL replication
Set up MySQL replication:- Configure primary-replica replication
- Use ProxySQL or MySQL Router for connection routing
- Configure automatic failover with orchestration tools
Session affinity considerations
Session affinity (sticky sessions) is not required when using shared session storage (Redis/Memcached). Without shared storage, consider session affinity: With shared session storage:- No session affinity needed
- True load distribution
- Better failover handling
- Enable session affinity in load balancer
- Sessions lost on instance failure
- Less efficient load distribution
Scaling considerations
Horizontal scaling
Add more Grafana instances to handle increased load:- Monitor CPU and memory usage
- Add instances when metrics exceed thresholds
- Instances are stateless (with shared storage)
- Scale based on request rate and user count
Database scaling
Scale the database for performance:- Increase connection pool size per instance
- Add read replicas for query load
- Upgrade database server resources
- Optimize slow queries
Cache scaling
Scale Redis/Memcached for session storage:- Use Redis cluster for horizontal scaling
- Configure Redis persistence for durability
- Monitor cache hit rates
Operational procedures
Rolling updates
Perform zero-downtime updates:- Update one instance at a time
- Wait for health checks to pass
- Verify functionality before updating next instance
- Roll back if issues detected
Monitoring HA deployments
Monitor these metrics across all instances:- Instance health status
- Request distribution across instances
- Database connection pool usage
- Cache hit rates
- Response times per instance
- Error rates per instance
Backup in HA environments
Back up the shared database and configuration:- Database backups cover all instances
- Back up configuration from one instance
- Test restore procedures regularly
- Consider point-in-time recovery for databases
Troubleshooting
Instance registration issues
If instances don’t recognize each other:- Verify
high_availability = truein configuration - Check database connectivity from all instances
- Review logs for database connection errors
- Ensure instances use the same database schema version
Session persistence issues
If users experience session drops:- Verify Redis/Memcached connectivity
- Check
remote_cacheconfiguration - Monitor cache expiration settings
- Review load balancer session affinity settings
Load distribution problems
If traffic isn’t distributed evenly:- Check load balancer algorithm (use
leastconn) - Verify health checks are working
- Review instance resource utilization
- Check for instance-specific errors
Limitations and constraints
- Database requirement - External database required (PostgreSQL or MySQL)
- SQLite not supported - Cannot use SQLite in HA mode
- Plugin data - File-based plugin data not automatically synchronized
- Image rendering - May require additional configuration for rendering service
- Configuration drift - Ensure configurations stay synchronized across instances
Next steps
- Set up monitoring for HA metrics
- Configure backup procedures for shared database
- Plan upgrade procedures for rolling updates