High Availability Setup
Production deployment patterns for enterprise-grade reliability and uptime
High Availability Setup
Achieve enterprise-grade reliability with CoAI.Dev's high availability deployment patterns. This guide covers redundancy, failover, disaster recovery, and operational practices for mission-critical AI service deployments.
Overview
High availability (HA) ensures your CoAI.Dev platform remains operational even during:
- Hardware failures: Server, storage, or network component failures
- Software issues: Application crashes or service degradation
- Maintenance windows: Planned updates and maintenance
- Traffic spikes: Sudden increases in user demand
- Regional outages: Cloud provider or datacenter incidents
Reliability Target
Properly configured HA setups can achieve 99.9%+ uptime (less than 8.76 hours downtime per year) with automatic recovery from most failure scenarios.
Architecture Patterns
Multi-Tier Redundancy
Load Balanced Application Servers
# Docker Compose HA Setup
version: '3.8'
services:
# Load Balancer
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./ssl:/etc/ssl/certs
depends_on:
- app1
- app2
- app3
# Application Instances
app1:
image: programzmh/chatnio:latest
environment:
- NODE_ID=app1
- MYSQL_HOST=mysql-primary
- REDIS_HOST=redis-primary
volumes:
- shared-storage:/storage
restart: unless-stopped
app2:
image: programzmh/chatnio:latest
environment:
- NODE_ID=app2
- MYSQL_HOST=mysql-primary
- REDIS_HOST=redis-primary
volumes:
- shared-storage:/storage
restart: unless-stopped
app3:
image: programzmh/chatnio:latest
environment:
- NODE_ID=app3
- MYSQL_HOST=mysql-primary
- REDIS_HOST=redis-primary
volumes:
- shared-storage:/storage
restart: unless-stopped
volumes:
shared-storage:
driver: nfs
driver_opts:
share: "nfs-server:/exports/chatnio"
Nginx Configuration:
upstream chatnio_backend {
least_conn;
server app1:8000 max_fails=3 fail_timeout=30s;
server app2:8000 max_fails=3 fail_timeout=30s;
server app3:8000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://chatnio_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# Health check
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_next_upstream_tries 2;
}
}
Kubernetes High Availability
Production-Ready K8s Deployment
Create Namespace and ConfigMap
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: chatnio-prod
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: chatnio-config
namespace: chatnio-prod
data:
app.yml: |
database:
host: mysql-primary.chatnio-prod.svc.cluster.local
port: 3306
database: chatnio
charset: utf8mb4
redis:
sentinels:
- host: sentinel-1.chatnio-prod.svc.cluster.local
port: 26379
- host: sentinel-2.chatnio-prod.svc.cluster.local
port: 26379
- host: sentinel-3.chatnio-prod.svc.cluster.local
port: 26379
master_name: mymaster
logging:
level: info
format: json
Deploy Application with Redundancy
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: chatnio-app
namespace: chatnio-prod
labels:
app: chatnio
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
selector:
matchLabels:
app: chatnio
template:
metadata:
labels:
app: chatnio
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- chatnio
topologyKey: kubernetes.io/hostname
containers:
- name: chatnio
image: programzmh/chatnio:stable
ports:
- containerPort: 8000
env:
- name: CONFIG_FILE
value: /config/app.yml
envFrom:
- secretRef:
name: chatnio-secrets
volumeMounts:
- name: config
mountPath: /config
- name: storage
mountPath: /storage
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
volumes:
- name: config
configMap:
name: chatnio-config
- name: storage
persistentVolumeClaim:
claimName: chatnio-storage
Configure Load Balancing
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: chatnio-service
namespace: chatnio-prod
spec:
selector:
app: chatnio
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: chatnio-ingress
namespace: chatnio-prod
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
tls:
- hosts:
- yourdomain.com
secretName: chatnio-tls
rules:
- host: yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: chatnio-service
port:
number: 80
Set Up Horizontal Pod Autoscaler
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: chatnio-hpa
namespace: chatnio-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: chatnio-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
Disaster Recovery
Backup and Recovery Strategy
Automated MySQL Backup:
#!/bin/bash
# backup-mysql.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/mysql"
DB_HOST="mysql-primary"
DB_NAME="chatnio"
DB_USER="backup_user"
DB_PASS="backup_password"
# Create backup directory
mkdir -p $BACKUP_DIR
# Perform backup
mysqldump -h $DB_HOST -u $DB_USER -p$DB_PASS \
--single-transaction \
--routines \
--triggers \
--events \
--hex-blob \
--quick \
--lock-tables=false \
$DB_NAME | gzip > $BACKUP_DIR/chatnio_$DATE.sql.gz
# Upload to cloud storage
aws s3 cp $BACKUP_DIR/chatnio_$DATE.sql.gz s3://chatnio-backups/mysql/
# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.sql.gz" -mtime +30 -delete
# Verify backup
if [ $? -eq 0 ]; then
echo "Backup completed successfully: chatnio_$DATE.sql.gz"
# Send notification
curl -X POST "https://hooks.slack.com/your-webhook" \
-d '{"text":"✅ MySQL backup completed: chatnio_'$DATE'.sql.gz"}'
else
echo "Backup failed!"
curl -X POST "https://hooks.slack.com/your-webhook" \
-d '{"text":"❌ MySQL backup failed!"}'
exit 1
fi
Cron Schedule:
# Run backup daily at 2 AM
0 2 * * * /opt/scripts/backup-mysql.sh
# Run verification weekly
0 3 * * 0 /opt/scripts/verify-backup.sh
Monitoring and Alerting
Comprehensive Monitoring Stack
Deploy Prometheus and Grafana
# monitoring-stack.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'chatnio'
static_configs:
- targets: ['chatnio-service:80']
metrics_path: /metrics
scrape_interval: 30s
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
persistentVolumeClaim:
claimName: prometheus-storage
Configure Alert Rules
# alert-rules.yaml
groups:
- name: chatnio.rules
rules:
- alert: HighCPUUsage
expr: rate(cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: DatabaseDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL database is down"
description: "MySQL database is not responding"
- alert: HighMemoryUsage
expr: memory_usage_bytes / memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 90%"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for 5 minutes"
- alert: ReplicationLag
expr: mysql_slave_lag_seconds > 60
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL replication lag is high"
description: "Replication lag is {{ $value }} seconds"
Set Up Alertmanager
# alertmanager-config.yaml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@yourdomain.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/your-webhook'
channel: '#alerts'
text: 'Alert: {{ .GroupLabels.alertname }}'
- name: 'critical'
slack_configs:
- api_url: 'https://hooks.slack.com/your-webhook'
channel: '#critical-alerts'
text: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
email_configs:
- to: 'oncall@yourdomain.com'
subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning'
slack_configs:
- api_url: 'https://hooks.slack.com/your-webhook'
channel: '#warnings'
text: '⚠️ Warning: {{ .GroupLabels.alertname }}'
Performance Optimization
High-Performance Configuration
Application Tuning:
# High-performance environment variables
environment:
- WORKER_THREADS=8
- MAX_CONCURRENT_REQUESTS=500
- CONNECTION_POOL_SIZE=50
- CACHE_SIZE=1GB
- LOG_LEVEL=warn
- COMPRESSION_ENABLED=true
- KEEP_ALIVE_TIMEOUT=65
- REQUEST_TIMEOUT=30
Database Optimization:
-- MySQL performance configuration
SET GLOBAL innodb_buffer_pool_size = 4294967296; -- 4GB
SET GLOBAL innodb_log_file_size = 1073741824; -- 1GB
SET GLOBAL max_connections = 500;
SET GLOBAL query_cache_size = 268435456; -- 256MB
SET GLOBAL tmp_table_size = 134217728; -- 128MB
SET GLOBAL max_heap_table_size = 134217728; -- 128MB
-- Enable query performance insights
SET GLOBAL performance_schema = ON;
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 2;
Redis Optimization:
# redis.conf optimizations
maxmemory 2gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000
tcp-keepalive 300
timeout 300
tcp-backlog 511
Operational Procedures
Health Check Automation
#!/bin/bash
# health-check.sh
HEALTH_CHECK_URL="https://yourdomain.com/health"
CRITICAL_ENDPOINTS=(
"https://yourdomain.com/api/v1/status"
"https://yourdomain.com/api/v1/models"
)
# Function to check endpoint
check_endpoint() {
local url=$1
local response=$(curl -s -w "%{http_code}" -o /dev/null $url)
if [ $response -eq 200 ]; then
echo "✅ $url - OK"
return 0
else
echo "❌ $url - FAILED (HTTP $response)"
return 1
fi
}
# Check main health endpoint
if ! check_endpoint $HEALTH_CHECK_URL; then
echo "🚨 Main health check failed!"
# Trigger alert
curl -X POST "https://hooks.slack.com/your-webhook" \
-d '{"text":"🚨 CRITICAL: Main health check failed!"}'
exit 1
fi
# Check critical endpoints
failed_count=0
for endpoint in "${CRITICAL_ENDPOINTS[@]}"; do
if ! check_endpoint $endpoint; then
((failed_count++))
fi
done
if [ $failed_count -gt 0 ]; then
echo "⚠️ $failed_count critical endpoints failed"
curl -X POST "https://hooks.slack.com/your-webhook" \
-d '{"text":"⚠️ WARNING: '$failed_count' critical endpoints failed"}'
exit 1
fi
echo "✅ All health checks passed"
Automated Failover
#!/bin/bash
# auto-failover.sh
PRIMARY_DB="mysql-primary"
REPLICA_DB="mysql-replica"
APP_NAMESPACE="chatnio-prod"
# Check primary database
if ! mysqladmin ping -h $PRIMARY_DB; then
echo "Primary database is down. Initiating failover..."
# Promote replica to primary
mysql -h $REPLICA_DB -e "STOP SLAVE; RESET MASTER;"
# Update application configuration
kubectl patch configmap chatnio-config -n $APP_NAMESPACE \
--patch '{"data":{"app.yml":"database:\n host: mysql-replica\n"}}'
# Restart application pods
kubectl rollout restart deployment/chatnio-app -n $APP_NAMESPACE
# Send alert
curl -X POST "https://hooks.slack.com/your-webhook" \
-d '{"text":"🔄 Database failover completed. Primary: mysql-replica"}'
echo "Failover completed"
else
echo "Primary database is healthy"
fi
High availability deployment ensures your CoAI.Dev platform can handle failures gracefully and maintain service continuity. Regular testing of failover procedures and monitoring system health are essential for maintaining reliability.