Files
ai-stack-deployer/docs/LOGGING-PLAN.md
Oussama Douhou 2f4722acd0 feat: add comprehensive logging infrastructure
- Add Loki/Prometheus/Grafana stack in logging-stack/
- Add log-ingest service for receiving events from AI stacks
- Add Grafana dashboard with stack_name filtering
- Update Dokploy client with setApplicationEnv method
- Configure STACK_NAME env var for deployed stacks
- Add alerting rules for stack health monitoring
2026-01-10 13:22:46 +01:00

13 KiB

AI Stack Monitoring & Logging Plan

Overview

Comprehensive logging strategy for all deployed AI stacks at *.ai.flexinit.nl to enable:

  • Usage analytics and billing
  • Debugging and support
  • Security auditing
  • Performance optimization
  • User behavior insights

1. Log Categories

1.1 System Logs

Log Type Source Content
Container stdout/stderr Docker OpenCode server output, errors, startup
Health checks Docker Container health status over time
Resource metrics cAdvisor/Prometheus CPU, memory, network, disk I/O

1.2 OpenCode Server Logs

Log Type Source Content
Server events --print-logs HTTP requests, WebSocket connections
Session lifecycle OpenCode Session start/end, duration
Tool invocations OpenCode Which tools used, success/failure
MCP connections OpenCode MCP server connects/disconnects

1.3 AI Interaction Logs

Log Type Source Content
Prompts OpenCode session User messages (anonymized)
Responses OpenCode session AI responses (summarized)
Token usage Provider API Input/output tokens per request
Model selection OpenCode Which model used per request
Agent selection oh-my-opencode Which agent (Sisyphus, Oracle, etc.)

1.4 User Activity Logs

Log Type Source Content
File operations OpenCode tools Read/write/edit actions
Bash commands OpenCode tools Commands executed
Git operations OpenCode tools Commits, pushes, branches
Web fetches OpenCode tools URLs accessed

2. Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AI Stack Container                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   OpenCode   │  │  Fluent Bit  │  │  OpenTelemetry SDK   │  │
│  │   Server     │──│  (sidecar)   │──│  (instrumentation)   │  │
│  └──────────────┘  └──────┬───────┘  └──────────┬───────────┘  │
└────────────────────────────┼────────────────────┼───────────────┘
                             │                    │
                             ▼                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Central Logging Stack                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │    Loki      │  │  Prometheus  │  │      Tempo           │  │
│  │   (logs)     │  │  (metrics)   │  │    (traces)          │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         └─────────────────┼─────────────────────┘              │
│                           ▼                                     │
│                    ┌──────────────┐                             │
│                    │   Grafana    │                             │
│                    │ (dashboard)  │                             │
│                    └──────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

3. Implementation Plan

Phase 1: Container Logging (Week 1)

3.1.1 Docker Log Driver

# docker-compose addition for each stack
logging:
  driver: "fluentd"
  options:
    fluentd-address: "10.100.0.x:24224"
    tag: "ai-stack.{{.Name}}"
    fluentd-async: "true"

3.1.2 OpenCode Server Logs

Modify Dockerfile CMD to capture structured logs:

CMD ["sh", "-c", "opencode serve --hostname 0.0.0.0 --port 8080 --mdns --print-logs --log-level INFO 2>&1 | tee /var/log/opencode/server.log"]

3.1.3 Log Rotation

# Add logrotate config
RUN apt-get install -y logrotate
COPY logrotate.conf /etc/logrotate.d/opencode

Phase 2: Session & Prompt Logging (Week 2)

3.2.1 OpenCode Plugin for Logging

Create logging hook in oh-my-opencode:

// src/hooks/logging.ts
export const loggingHook: Hook = {
  name: 'session-logger',
  
  onSessionStart: async (session) => {
    await logEvent({
      type: 'session_start',
      stackName: process.env.STACK_NAME,
      sessionId: session.id,
      timestamp: new Date().toISOString()
    });
  },
  
  onMessage: async (message, session) => {
    await logEvent({
      type: 'message',
      stackName: process.env.STACK_NAME,
      sessionId: session.id,
      role: message.role,
      // Hash content for privacy, log length
      contentHash: hash(message.content),
      contentLength: message.content.length,
      model: session.model,
      agent: session.agent,
      timestamp: new Date().toISOString()
    });
  },
  
  onToolUse: async (tool, args, result, session) => {
    await logEvent({
      type: 'tool_use',
      stackName: process.env.STACK_NAME,
      sessionId: session.id,
      tool: tool.name,
      argsHash: hash(JSON.stringify(args)),
      success: !result.error,
      duration: result.duration,
      timestamp: new Date().toISOString()
    });
  }
};

3.2.2 Log Destination Options

Option A: Centralized HTTP Endpoint

async function logEvent(event: LogEvent) {
  await fetch('https://logs.ai.flexinit.nl/ingest', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-Stack-Name': process.env.STACK_NAME,
      'X-API-Key': process.env.LOGGING_API_KEY
    },
    body: JSON.stringify(event)
  });
}

Option B: Local File + Fluent Bit

async function logEvent(event: LogEvent) {
  const logLine = JSON.stringify(event) + '\n';
  await fs.appendFile('/var/log/opencode/events.jsonl', logLine);
}

Phase 3: Metrics Collection (Week 3)

3.3.1 Prometheus Metrics Endpoint

Add to OpenCode container:

// metrics.ts
import { register, Counter, Histogram, Gauge } from 'prom-client';

export const metrics = {
  sessionsTotal: new Counter({
    name: 'opencode_sessions_total',
    help: 'Total number of sessions',
    labelNames: ['stack_name']
  }),
  
  messagesTotal: new Counter({
    name: 'opencode_messages_total',
    help: 'Total messages processed',
    labelNames: ['stack_name', 'role', 'model', 'agent']
  }),
  
  tokensUsed: new Counter({
    name: 'opencode_tokens_total',
    help: 'Total tokens used',
    labelNames: ['stack_name', 'model', 'direction']
  }),
  
  toolInvocations: new Counter({
    name: 'opencode_tool_invocations_total',
    help: 'Tool invocations',
    labelNames: ['stack_name', 'tool', 'success']
  }),
  
  responseDuration: new Histogram({
    name: 'opencode_response_duration_seconds',
    help: 'AI response duration',
    labelNames: ['stack_name', 'model'],
    buckets: [0.5, 1, 2, 5, 10, 30, 60, 120]
  }),
  
  activeSessions: new Gauge({
    name: 'opencode_active_sessions',
    help: 'Currently active sessions',
    labelNames: ['stack_name']
  })
};

3.3.2 Expose Metrics Endpoint

// Add to container
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.send(await register.metrics());
});

Phase 4: Central Logging Infrastructure (Week 4)

3.4.1 Deploy Logging Stack

# docker-compose.logging.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
    
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana

3.4.2 Prometheus Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: 'ai-stacks'
    dns_sd_configs:
      - names:
          - 'tasks.ai-stack-*'
        type: 'A'
        port: 9090
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: stack_name

4. Data Schema

4.1 Event Log Schema (JSON Lines)

{
  "timestamp": "2026-01-10T12:00:00.000Z",
  "stack_name": "john-dev",
  "session_id": "sess_abc123",
  "event_type": "message|tool_use|session_start|session_end|error",
  "data": {
    "role": "user|assistant",
    "model": "glm-4.7-free",
    "agent": "sisyphus",
    "tool": "bash",
    "tokens_in": 1500,
    "tokens_out": 500,
    "duration_ms": 2340,
    "success": true,
    "error_code": null
  }
}

4.2 Metrics Labels

Metric Labels
opencode_* stack_name, model, agent, tool, success

5. Privacy & Security

5.1 Data Anonymization

  • Prompts: Hash content, store only length and word count
  • File paths: Anonymize to pattern (e.g., /home/user/project/src/*.ts)
  • Bash commands: Log command name only, not arguments with secrets
  • Env vars: Never log, redact from all outputs

5.2 Retention Policy

Data Type Retention Storage
Raw logs 7 days Loki
Aggregated metrics 90 days Prometheus
Session summaries 1 year PostgreSQL
Billing data 7 years PostgreSQL

5.3 Access Control

  • Logs accessible only to platform admins
  • Users can request their own data export
  • Stack owners can view their stack's metrics in Grafana

6. Grafana Dashboards

6.1 Platform Overview

  • Total active stacks
  • Messages per hour (all stacks)
  • Token usage by model
  • Error rate
  • Top agents used

6.2 Per-Stack Dashboard

  • Session count over time
  • Token usage
  • Tool usage breakdown
  • Response time percentiles
  • Error log viewer

6.3 Alerts

# alerting-rules.yml
groups:
  - name: ai-stack-alerts
    rules:
      - alert: StackUnhealthy
        expr: up{job="ai-stacks"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Stack {{ $labels.stack_name }} is down"
      
      - alert: HighErrorRate
        expr: rate(opencode_errors_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.stack_name }}"

7. Implementation Checklist

Phase 1: Container Logging

  • Set up Loki + Promtail on logging server
  • Configure Docker log driver for ai-stack containers
  • Add log rotation to Dockerfile
  • Verify logs flowing to Loki

Phase 2: Session Logging

  • Create logging hook in oh-my-opencode
  • Define event schema
  • Implement log shipping (HTTP or file-based)
  • Add session/message/tool logging

Phase 3: Metrics

  • Add prom-client to container
  • Expose /metrics endpoint
  • Configure Prometheus scraping
  • Create initial Grafana dashboards

Phase 4: Production Hardening

  • Implement data anonymization
  • Set up retention policies
  • Configure alerts
  • Document runbooks

8. Cost Estimates

Component Resource Monthly Cost
Loki 50GB logs @ 7 days ~$15
Prometheus 10GB metrics @ 90 days ~$10
Grafana 1 instance Free (OSS)
Log ingestion Network ~$5
Total ~$30/month

9. Next Steps

  1. Approve plan - Review and confirm approach
  2. Deploy logging infra - Loki/Prometheus/Grafana on dedicated server
  3. Modify Dockerfile - Add logging configuration
  4. Create oh-my-opencode hooks - Session/message/tool logging
  5. Build dashboards - Grafana visualizations
  6. Test with pilot stack - Validate before rollout
  7. Rollout to all stacks - Update deployer to include logging config