CoAI LogoCoAI.Dev

Custom Models

Private model integration and deployment for specialized AI capabilities

Custom Models

Integrate and deploy private AI models with CoAI.Dev for specialized capabilities, proprietary data training, and complete control over your AI infrastructure. This guide covers local model deployment, fine-tuning, and enterprise model management.

Overview

Custom model integration enables:

  • 🏠 Private Deployment: Host models on your own infrastructure
  • 🎯 Specialized Models: Deploy domain-specific or fine-tuned models
  • 🔒 Data Privacy: Keep sensitive data within your environment
  • 💰 Cost Control: Eliminate per-token costs for high-volume usage
  • ⚡ Performance Optimization: Optimize models for your specific use cases

Enterprise AI Control

Custom models provide complete control over your AI stack, enabling specialized capabilities while maintaining security, compliance, and cost predictability.

Supported Frameworks

Local AI Frameworks

Ollama Integration

Ollama provides easy local model deployment with minimal configuration.

Setup Ollama Server:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
 
# Start Ollama service
ollama serve
 
# Pull and run models
ollama pull llama2:7b
ollama pull codellama:13b
ollama pull mistral:7b
 
# List available models
ollama list

CoAI.Dev Channel Configuration:

{
  "channel_name": "Local Ollama",
  "channel_type": "ollama",
  "base_url": "http://localhost:11434",
  "models": [
    {
      "model_name": "llama2:7b",
      "model_display_name": "Llama 2 7B",
      "model_type": "text",
      "context_length": 4096,
      "pricing": {
        "input_tokens": 0,
        "output_tokens": 0
      }
    },
    {
      "model_name": "codellama:13b",
      "model_display_name": "Code Llama 13B",
      "model_type": "code",
      "context_length": 8192,
      "pricing": {
        "input_tokens": 0,
        "output_tokens": 0
      }
    }
  ]
}

Docker Deployment:

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
 
volumes:
  ollama-data:

Model Fine-Tuning

Fine-Tuning Workflow

Prepare Training Data

# prepare_data.py
import json
from datasets import Dataset
 
def prepare_chat_data(conversations):
    """Convert conversations to training format"""
    training_data = []
    
    for conversation in conversations:
        formatted_conv = {
            "instruction": conversation["user_message"],
            "input": conversation.get("context", ""),
            "output": conversation["assistant_message"]
        }
        training_data.append(formatted_conv)
    
    return training_data
 
# Example conversation data
conversations = [
    {
        "user_message": "Explain machine learning",
        "assistant_message": "Machine learning is a subset of AI...",
        "context": "Technical documentation context"
    }
]
 
# Prepare and save training data
training_data = prepare_chat_data(conversations)
with open("training_data.json", "w") as f:
    json.dump(training_data, f, indent=2)

Fine-Tune with LoRA

# fine_tune_lora.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
 
# Load base model
model_name = "microsoft/DialoGPT-medium"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
 
# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"]
)
 
# Apply LoRA to model
model = get_peft_model(model, lora_config)
 
# Load and tokenize dataset
dataset = load_dataset("json", data_files="training_data.json")
 
def tokenize_function(examples):
    return tokenizer(
        examples["instruction"] + " " + examples["output"],
        truncation=True,
        padding="max_length",
        max_length=512
    )
 
tokenized_dataset = dataset.map(tokenize_function, batched=True)
 
# Training configuration
training_args = TrainingArguments(
    output_dir="./fine_tuned_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    warmup_steps=500,
    logging_steps=10,
    save_steps=1000,
    evaluation_strategy="steps",
    eval_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
)
 
# Create trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    tokenizer=tokenizer,
)
 
trainer.train()
 
# Save fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

Deploy Fine-Tuned Model

# deploy_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
 
def load_fine_tuned_model(base_model_path, peft_model_path):
    """Load fine-tuned model with LoRA weights"""
    
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_path,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Load LoRA weights
    model = PeftModel.from_pretrained(base_model, peft_model_path)
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_path)
    
    return model, tokenizer
 
# Example usage
model, tokenizer = load_fine_tuned_model(
    "microsoft/DialoGPT-medium",
    "./fine_tuned_model"
)
 
def generate_response(prompt, max_length=100):
    """Generate response using fine-tuned model"""
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response[len(prompt):].strip()
 
# Test the model
response = generate_response("Explain quantum computing")
print(f"Model response: {response}")

Enterprise Model Management

Model Registry and Versioning

Centralized Model Registry

# model-registry.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-registry
data:
  registry.json: |
    {
      "models": [
        {
          "model_id": "company-chat-v1",
          "model_name": "Company Chat Assistant v1.0",
          "model_type": "text",
          "base_model": "llama-2-7b-chat",
          "fine_tuned": true,
          "training_data": "customer_support_2024",
          "created_at": "2024-01-15T10:00:00Z",
          "status": "active",
          "deployment": {
            "endpoint": "http://model-server:8000/v1/chat/completions",
            "replicas": 3,
            "gpu_memory": "8GB",
            "max_tokens": 4096
          },
          "performance": {
            "accuracy": 0.92,
            "latency_p95": "150ms",
            "throughput": "50 req/s"
          }
        }
      ]
    }

Model Registration API:

# model_registry.py
from datetime import datetime
import json
import uuid
 
class ModelRegistry:
    def __init__(self, registry_file="model_registry.json"):
        self.registry_file = registry_file
        self.models = self.load_registry()
    
    def register_model(self, model_info):
        """Register a new model version"""
        model_id = str(uuid.uuid4())
        model_entry = {
            "model_id": model_id,
            "registered_at": datetime.utcnow().isoformat(),
            **model_info
        }
        
        self.models[model_id] = model_entry
        self.save_registry()
        return model_id
    
    def get_model(self, model_id):
        """Get model information"""
        return self.models.get(model_id)
    
    def list_models(self, status=None):
        """List models with optional status filter"""
        if status:
            return {k: v for k, v in self.models.items() 
                   if v.get("status") == status}
        return self.models
    
    def update_model_status(self, model_id, status):
        """Update model status"""
        if model_id in self.models:
            self.models[model_id]["status"] = status
            self.save_registry()
    
    def save_registry(self):
        with open(self.registry_file, "w") as f:
            json.dump(self.models, f, indent=2)
    
    def load_registry(self):
        try:
            with open(self.registry_file, "r") as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
 
# Usage example
registry = ModelRegistry()
 
# Register new model
model_id = registry.register_model({
    "model_name": "Customer Support Bot v2.0",
    "model_type": "text",
    "base_model": "llama-2-13b-chat",
    "fine_tuned": True,
    "deployment_config": {
        "gpu_memory": "16GB",
        "replicas": 2
    }
})
 
print(f"Registered model: {model_id}")

Security and Compliance

Model Security Best Practices

Security Considerations

Custom models require additional security measures to protect intellectual property and ensure safe operation in production environments.

Access Control:

# rbac-models.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: models
  name: model-operator
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
 
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: model-operator-binding
  namespace: models
subjects:
- kind: User
  name: model-operator
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: model-operator
  apiGroup: rbac.authorization.k8s.io

Model Encryption:

# model_encryption.py
from cryptography.fernet import Fernet
import pickle
import base64
 
def encrypt_model(model, key=None):
    """Encrypt model weights"""
    if key is None:
        key = Fernet.generate_key()
    
    cipher = Fernet(key)
    
    # Serialize model
    model_bytes = pickle.dumps(model.state_dict())
    
    # Encrypt
    encrypted_model = cipher.encrypt(model_bytes)
    
    return encrypted_model, key
 
def decrypt_model(encrypted_model, key, model_class):
    """Decrypt and load model"""
    cipher = Fernet(key)
    
    # Decrypt
    model_bytes = cipher.decrypt(encrypted_model)
    
    # Deserialize
    state_dict = pickle.loads(model_bytes)
    
    # Load into model
    model = model_class()
    model.load_state_dict(state_dict)
    
    return model

Cost Optimization

Resource Management

GPU Optimization:

# gpu-optimization.yaml
apiVersion: v1
kind: Pod
metadata:
  name: model-server
spec:
  containers:
  - name: model-server
    image: model-server:latest
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "16Gi"
        cpu: "4"
      requests:
        nvidia.com/gpu: 1
        memory: "8Gi"
        cpu: "2"
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0"
  nodeSelector:
    node-type: gpu-optimized
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Auto-scaling Configuration:

# model-autoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "70"

Custom model integration provides complete control over your AI infrastructure while enabling specialized capabilities. Start with local deployment using Ollama or LocalAI, then scale to enterprise-grade model management with proper versioning, monitoring, and security measures.