High Availability Setup

Achieve enterprise-grade reliability with CoAI.Dev's high availability deployment patterns. This guide covers redundancy, failover, disaster recovery, and operational practices for mission-critical AI service deployments.

Overview

High availability (HA) ensures your CoAI.Dev platform remains operational even during:

Hardware failures: Server, storage, or network component failures
Software issues: Application crashes or service degradation
Maintenance windows: Planned updates and maintenance
Traffic spikes: Sudden increases in user demand
Regional outages: Cloud provider or datacenter incidents

Reliability Target

Properly configured HA setups can achieve 99.9%+ uptime (less than 8.76 hours downtime per year) with automatic recovery from most failure scenarios.

Architecture Patterns

Multi-Tier Redundancy

Load Balanced Application Servers

# Docker Compose HA Setup
version: '3.8'
services:
  # Load Balancer
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./ssl:/etc/ssl/certs
    depends_on:
      - app1
      - app2
      - app3
 
  # Application Instances
  app1:
    image: programzmh/chatnio:latest
    environment:
      - NODE_ID=app1
      - MYSQL_HOST=mysql-primary
      - REDIS_HOST=redis-primary
    volumes:
      - shared-storage:/storage
    restart: unless-stopped
    
  app2:
    image: programzmh/chatnio:latest
    environment:
      - NODE_ID=app2
      - MYSQL_HOST=mysql-primary
      - REDIS_HOST=redis-primary
    volumes:
      - shared-storage:/storage
    restart: unless-stopped
    
  app3:
    image: programzmh/chatnio:latest
    environment:
      - NODE_ID=app3
      - MYSQL_HOST=mysql-primary
      - REDIS_HOST=redis-primary
    volumes:
      - shared-storage:/storage
    restart: unless-stopped
 
volumes:
  shared-storage:
    driver: nfs
    driver_opts:
      share: "nfs-server:/exports/chatnio"

Nginx Configuration:

upstream chatnio_backend {
    least_conn;
    server app1:8000 max_fails=3 fail_timeout=30s;
    server app2:8000 max_fails=3 fail_timeout=30s;
    server app3:8000 max_fails=3 fail_timeout=30s;
}
 
server {
    listen 80;
    location / {
        proxy_pass http://chatnio_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_connect_timeout 5s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
        
        # Health check
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_next_upstream_tries 2;
    }
}

Kubernetes High Availability

Production-Ready K8s Deployment

Create Namespace and ConfigMap

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: chatnio-prod
 
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chatnio-config
  namespace: chatnio-prod
data:
  app.yml: |
    database:
      host: mysql-primary.chatnio-prod.svc.cluster.local
      port: 3306
      database: chatnio
      charset: utf8mb4
    redis:
      sentinels:
        - host: sentinel-1.chatnio-prod.svc.cluster.local
          port: 26379
        - host: sentinel-2.chatnio-prod.svc.cluster.local
          port: 26379
        - host: sentinel-3.chatnio-prod.svc.cluster.local
          port: 26379
      master_name: mymaster
    logging:
      level: info
      format: json

Deploy Application with Redundancy

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chatnio-app
  namespace: chatnio-prod
  labels:
    app: chatnio
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  selector:
    matchLabels:
      app: chatnio
  template:
    metadata:
      labels:
        app: chatnio
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - chatnio
              topologyKey: kubernetes.io/hostname
      containers:
      - name: chatnio
        image: programzmh/chatnio:stable
        ports:
        - containerPort: 8000
        env:
        - name: CONFIG_FILE
          value: /config/app.yml
        envFrom:
        - secretRef:
            name: chatnio-secrets
        volumeMounts:
        - name: config
          mountPath: /config
        - name: storage
          mountPath: /storage
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
      volumes:
      - name: config
        configMap:
          name: chatnio-config
      - name: storage
        persistentVolumeClaim:
          claimName: chatnio-storage

Configure Load Balancing

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: chatnio-service
  namespace: chatnio-prod
spec:
  selector:
    app: chatnio
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP
 
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chatnio-ingress
  namespace: chatnio-prod
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
  tls:
  - hosts:
    - yourdomain.com
    secretName: chatnio-tls
  rules:
  - host: yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: chatnio-service
            port:
              number: 80

Set Up Horizontal Pod Autoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: chatnio-hpa
  namespace: chatnio-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: chatnio-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Disaster Recovery

Backup and Recovery Strategy

Automated MySQL Backup:

#!/bin/bash
# backup-mysql.sh
 
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/mysql"
DB_HOST="mysql-primary"
DB_NAME="chatnio"
DB_USER="backup_user"
DB_PASS="backup_password"
 
# Create backup directory
mkdir -p $BACKUP_DIR
 
# Perform backup
mysqldump -h $DB_HOST -u $DB_USER -p$DB_PASS \
  --single-transaction \
  --routines \
  --triggers \
  --events \
  --hex-blob \
  --quick \
  --lock-tables=false \
  $DB_NAME | gzip > $BACKUP_DIR/chatnio_$DATE.sql.gz
 
# Upload to cloud storage
aws s3 cp $BACKUP_DIR/chatnio_$DATE.sql.gz s3://chatnio-backups/mysql/
 
# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.sql.gz" -mtime +30 -delete
 
# Verify backup
if [ $? -eq 0 ]; then
  echo "Backup completed successfully: chatnio_$DATE.sql.gz"
  # Send notification
  curl -X POST "https://hooks.slack.com/your-webhook" \
    -d '{"text":"✅ MySQL backup completed: chatnio_'$DATE'.sql.gz"}'
else
  echo "Backup failed!"
  curl -X POST "https://hooks.slack.com/your-webhook" \
    -d '{"text":"❌ MySQL backup failed!"}'
  exit 1
fi

Cron Schedule:

# Run backup daily at 2 AM
0 2 * * * /opt/scripts/backup-mysql.sh
 
# Run verification weekly
0 3 * * 0 /opt/scripts/verify-backup.sh

Monitoring and Alerting

Comprehensive Monitoring Stack

Deploy Prometheus and Grafana

# monitoring-stack.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    
    scrape_configs:
    - job_name: 'chatnio'
      static_configs:
      - targets: ['chatnio-service:80']
      metrics_path: /metrics
      scrape_interval: 30s
    
    - job_name: 'mysql'
      static_configs:
      - targets: ['mysql-exporter:9104']
    
    - job_name: 'redis'
      static_configs:
      - targets: ['redis-exporter:9121']
 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: storage
          mountPath: /prometheus
      volumes:
      - name: config
        configMap:
          name: prometheus-config
      - name: storage
        persistentVolumeClaim:
          claimName: prometheus-storage

Configure Alert Rules

# alert-rules.yaml
groups:
- name: chatnio.rules
  rules:
  - alert: HighCPUUsage
    expr: rate(cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes"
 
  - alert: DatabaseDown
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MySQL database is down"
      description: "MySQL database is not responding"
 
  - alert: HighMemoryUsage
    expr: memory_usage_bytes / memory_limit_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 90%"
 
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% for 5 minutes"
 
  - alert: ReplicationLag
    expr: mysql_slave_lag_seconds > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "MySQL replication lag is high"
      description: "Replication lag is {{ $value }} seconds"

Set Up Alertmanager

# alertmanager-config.yaml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@yourdomain.com'
 
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical'
  - match:
      severity: warning
    receiver: 'warning'
 
receivers:
- name: 'default'
  slack_configs:
  - api_url: 'https://hooks.slack.com/your-webhook'
    channel: '#alerts'
    text: 'Alert: {{ .GroupLabels.alertname }}'
 
- name: 'critical'
  slack_configs:
  - api_url: 'https://hooks.slack.com/your-webhook'
    channel: '#critical-alerts'
    text: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
  email_configs:
  - to: 'oncall@yourdomain.com'
    subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
    body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
 
- name: 'warning'
  slack_configs:
  - api_url: 'https://hooks.slack.com/your-webhook'
    channel: '#warnings'
    text: '⚠️ Warning: {{ .GroupLabels.alertname }}'

Performance Optimization

High-Performance Configuration

Application Tuning:

# High-performance environment variables
environment:
  - WORKER_THREADS=8
  - MAX_CONCURRENT_REQUESTS=500
  - CONNECTION_POOL_SIZE=50
  - CACHE_SIZE=1GB
  - LOG_LEVEL=warn
  - COMPRESSION_ENABLED=true
  - KEEP_ALIVE_TIMEOUT=65
  - REQUEST_TIMEOUT=30

Database Optimization:

-- MySQL performance configuration
SET GLOBAL innodb_buffer_pool_size = 4294967296;  -- 4GB
SET GLOBAL innodb_log_file_size = 1073741824;     -- 1GB
SET GLOBAL max_connections = 500;
SET GLOBAL query_cache_size = 268435456;          -- 256MB
SET GLOBAL tmp_table_size = 134217728;            -- 128MB
SET GLOBAL max_heap_table_size = 134217728;       -- 128MB
 
-- Enable query performance insights
SET GLOBAL performance_schema = ON;
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 2;

Redis Optimization:

# redis.conf optimizations
maxmemory 2gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000
tcp-keepalive 300
timeout 300
tcp-backlog 511

Operational Procedures

Health Check Automation

#!/bin/bash
# health-check.sh
 
HEALTH_CHECK_URL="https://yourdomain.com/health"
CRITICAL_ENDPOINTS=(
  "https://yourdomain.com/api/v1/status"
  "https://yourdomain.com/api/v1/models"
)
 
# Function to check endpoint
check_endpoint() {
  local url=$1
  local response=$(curl -s -w "%{http_code}" -o /dev/null $url)
  
  if [ $response -eq 200 ]; then
    echo "✅ $url - OK"
    return 0
  else
    echo "❌ $url - FAILED (HTTP $response)"
    return 1
  fi
}
 
# Check main health endpoint
if ! check_endpoint $HEALTH_CHECK_URL; then
  echo "🚨 Main health check failed!"
  # Trigger alert
  curl -X POST "https://hooks.slack.com/your-webhook" \
    -d '{"text":"🚨 CRITICAL: Main health check failed!"}'
  exit 1
fi
 
# Check critical endpoints
failed_count=0
for endpoint in "${CRITICAL_ENDPOINTS[@]}"; do
  if ! check_endpoint $endpoint; then
    ((failed_count++))
  fi
done
 
if [ $failed_count -gt 0 ]; then
  echo "⚠️ $failed_count critical endpoints failed"
  curl -X POST "https://hooks.slack.com/your-webhook" \
    -d '{"text":"⚠️ WARNING: '$failed_count' critical endpoints failed"}'
  exit 1
fi
 
echo "✅ All health checks passed"

Automated Failover

#!/bin/bash
# auto-failover.sh
 
PRIMARY_DB="mysql-primary"
REPLICA_DB="mysql-replica"
APP_NAMESPACE="chatnio-prod"
 
# Check primary database
if ! mysqladmin ping -h $PRIMARY_DB; then
  echo "Primary database is down. Initiating failover..."
  
  # Promote replica to primary
  mysql -h $REPLICA_DB -e "STOP SLAVE; RESET MASTER;"
  
  # Update application configuration
  kubectl patch configmap chatnio-config -n $APP_NAMESPACE \
    --patch '{"data":{"app.yml":"database:\n  host: mysql-replica\n"}}'
  
  # Restart application pods
  kubectl rollout restart deployment/chatnio-app -n $APP_NAMESPACE
  
  # Send alert
  curl -X POST "https://hooks.slack.com/your-webhook" \
    -d '{"text":"🔄 Database failover completed. Primary: mysql-replica"}'
  
  echo "Failover completed"
else
  echo "Primary database is healthy"
fi

High availability deployment ensures your CoAI.Dev platform can handle failures gracefully and maintain service continuity. Regular testing of failover procedures and monitoring system health are essential for maintaining reliability.

High Availability Setup

On this page