Files

Oussama Douhou 2f4722acd0 feat: add comprehensive logging infrastructure

- Add Loki/Prometheus/Grafana stack in logging-stack/
- Add log-ingest service for receiving events from AI stacks
- Add Grafana dashboard with stack_name filtering
- Update Dokploy client with setApplicationEnv method
- Configure STACK_NAME env var for deployed stacks
- Add alerting rules for stack health monitoring

2026-01-10 13:22:46 +01:00

13 KiB

Raw Blame History

AI Stack Monitoring & Logging Plan

Overview

Comprehensive logging strategy for all deployed AI stacks at *.ai.flexinit.nl to enable:

Usage analytics and billing
Debugging and support
Security auditing
Performance optimization
User behavior insights

1. Log Categories

1.1 System Logs

Log Type	Source	Content
Container stdout/stderr	Docker	OpenCode server output, errors, startup
Health checks	Docker	Container health status over time
Resource metrics	cAdvisor/Prometheus	CPU, memory, network, disk I/O

1.2 OpenCode Server Logs

Log Type	Source	Content
Server events	`--print-logs`	HTTP requests, WebSocket connections
Session lifecycle	OpenCode	Session start/end, duration
Tool invocations	OpenCode	Which tools used, success/failure
MCP connections	OpenCode	MCP server connects/disconnects

1.3 AI Interaction Logs

Log Type	Source	Content
Prompts	OpenCode session	User messages (anonymized)
Responses	OpenCode session	AI responses (summarized)
Token usage	Provider API	Input/output tokens per request
Model selection	OpenCode	Which model used per request
Agent selection	oh-my-opencode	Which agent (Sisyphus, Oracle, etc.)

1.4 User Activity Logs

Log Type	Source	Content
File operations	OpenCode tools	Read/write/edit actions
Bash commands	OpenCode tools	Commands executed
Git operations	OpenCode tools	Commits, pushes, branches
Web fetches	OpenCode tools	URLs accessed

2. Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AI Stack Container                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   OpenCode   │  │  Fluent Bit  │  │  OpenTelemetry SDK   │  │
│  │   Server     │──│  (sidecar)   │──│  (instrumentation)   │  │
│  └──────────────┘  └──────┬───────┘  └──────────┬───────────┘  │
└────────────────────────────┼────────────────────┼───────────────┘
                             │                    │
                             ▼                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Central Logging Stack                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │    Loki      │  │  Prometheus  │  │      Tempo           │  │
│  │   (logs)     │  │  (metrics)   │  │    (traces)          │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         └─────────────────┼─────────────────────┘              │
│                           ▼                                     │
│                    ┌──────────────┐                             │
│                    │   Grafana    │                             │
│                    │ (dashboard)  │                             │
│                    └──────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

3. Implementation Plan

Phase 1: Container Logging (Week 1)

3.1.1 Docker Log Driver

# docker-compose addition for each stack
logging:
  driver: "fluentd"
  options:
    fluentd-address: "10.100.0.x:24224"
    tag: "ai-stack.{{.Name}}"
    fluentd-async: "true"

3.1.2 OpenCode Server Logs

Modify Dockerfile CMD to capture structured logs:

CMD ["sh", "-c", "opencode serve --hostname 0.0.0.0 --port 8080 --mdns --print-logs --log-level INFO 2>&1 | tee /var/log/opencode/server.log"]

3.1.3 Log Rotation

# Add logrotate config
RUN apt-get install -y logrotate
COPY logrotate.conf /etc/logrotate.d/opencode

Phase 2: Session & Prompt Logging (Week 2)

3.2.1 OpenCode Plugin for Logging

Create logging hook in oh-my-opencode:

// src/hooks/logging.ts
export const loggingHook: Hook = {
  name: 'session-logger',
  
  onSessionStart: async (session) => {
    await logEvent({
      type: 'session_start',
      stackName: process.env.STACK_NAME,
      sessionId: session.id,
      timestamp: new Date().toISOString()
    });
  },
  
  onMessage: async (message, session) => {
    await logEvent({
      type: 'message',
      stackName: process.env.STACK_NAME,
      sessionId: session.id,
      role: message.role,
      // Hash content for privacy, log length
      contentHash: hash(message.content),
      contentLength: message.content.length,
      model: session.model,
      agent: session.agent,
      timestamp: new Date().toISOString()
    });
  },
  
  onToolUse: async (tool, args, result, session) => {
    await logEvent({
      type: 'tool_use',
      stackName: process.env.STACK_NAME,
      sessionId: session.id,
      tool: tool.name,
      argsHash: hash(JSON.stringify(args)),
      success: !result.error,
      duration: result.duration,
      timestamp: new Date().toISOString()
    });
  }
};

3.2.2 Log Destination Options

Option A: Centralized HTTP Endpoint

async function logEvent(event: LogEvent) {
  await fetch('https://logs.ai.flexinit.nl/ingest', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-Stack-Name': process.env.STACK_NAME,
      'X-API-Key': process.env.LOGGING_API_KEY
    },
    body: JSON.stringify(event)
  });
}

Option B: Local File + Fluent Bit

async function logEvent(event: LogEvent) {
  const logLine = JSON.stringify(event) + '\n';
  await fs.appendFile('/var/log/opencode/events.jsonl', logLine);
}

Phase 3: Metrics Collection (Week 3)

3.3.1 Prometheus Metrics Endpoint

Add to OpenCode container:

// metrics.ts
import { register, Counter, Histogram, Gauge } from 'prom-client';

export const metrics = {
  sessionsTotal: new Counter({
    name: 'opencode_sessions_total',
    help: 'Total number of sessions',
    labelNames: ['stack_name']
  }),
  
  messagesTotal: new Counter({
    name: 'opencode_messages_total',
    help: 'Total messages processed',
    labelNames: ['stack_name', 'role', 'model', 'agent']
  }),
  
  tokensUsed: new Counter({
    name: 'opencode_tokens_total',
    help: 'Total tokens used',
    labelNames: ['stack_name', 'model', 'direction']
  }),
  
  toolInvocations: new Counter({
    name: 'opencode_tool_invocations_total',
    help: 'Tool invocations',
    labelNames: ['stack_name', 'tool', 'success']
  }),
  
  responseDuration: new Histogram({
    name: 'opencode_response_duration_seconds',
    help: 'AI response duration',
    labelNames: ['stack_name', 'model'],
    buckets: [0.5, 1, 2, 5, 10, 30, 60, 120]
  }),
  
  activeSessions: new Gauge({
    name: 'opencode_active_sessions',
    help: 'Currently active sessions',
    labelNames: ['stack_name']
  })
};

3.3.2 Expose Metrics Endpoint

// Add to container
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.send(await register.metrics());
});

Phase 4: Central Logging Infrastructure (Week 4)

3.4.1 Deploy Logging Stack

# docker-compose.logging.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
    
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana

3.4.2 Prometheus Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: 'ai-stacks'
    dns_sd_configs:
      - names:
          - 'tasks.ai-stack-*'
        type: 'A'
        port: 9090
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: stack_name

4. Data Schema

4.1 Event Log Schema (JSON Lines)

{
  "timestamp": "2026-01-10T12:00:00.000Z",
  "stack_name": "john-dev",
  "session_id": "sess_abc123",
  "event_type": "message|tool_use|session_start|session_end|error",
  "data": {
    "role": "user|assistant",
    "model": "glm-4.7-free",
    "agent": "sisyphus",
    "tool": "bash",
    "tokens_in": 1500,
    "tokens_out": 500,
    "duration_ms": 2340,
    "success": true,
    "error_code": null
  }
}

4.2 Metrics Labels

Metric	Labels
`opencode_*`	`stack_name`, `model`, `agent`, `tool`, `success`

5. Privacy & Security

5.1 Data Anonymization

Prompts: Hash content, store only length and word count
File paths: Anonymize to pattern (e.g., /home/user/project/src/*.ts)
Bash commands: Log command name only, not arguments with secrets
Env vars: Never log, redact from all outputs

5.2 Retention Policy

Data Type	Retention	Storage
Raw logs	7 days	Loki
Aggregated metrics	90 days	Prometheus
Session summaries	1 year	PostgreSQL
Billing data	7 years	PostgreSQL

5.3 Access Control

Logs accessible only to platform admins
Users can request their own data export
Stack owners can view their stack's metrics in Grafana

6. Grafana Dashboards

6.1 Platform Overview

Total active stacks
Messages per hour (all stacks)
Token usage by model
Error rate
Top agents used

6.2 Per-Stack Dashboard

Session count over time
Token usage
Tool usage breakdown
Response time percentiles
Error log viewer

6.3 Alerts

# alerting-rules.yml
groups:
  - name: ai-stack-alerts
    rules:
      - alert: StackUnhealthy
        expr: up{job="ai-stacks"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Stack {{ $labels.stack_name }} is down"
      
      - alert: HighErrorRate
        expr: rate(opencode_errors_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.stack_name }}"

7. Implementation Checklist

Phase 1: Container Logging

Set up Loki + Promtail on logging server
Configure Docker log driver for ai-stack containers
Add log rotation to Dockerfile
Verify logs flowing to Loki

Phase 2: Session Logging

Create logging hook in oh-my-opencode
Define event schema
Implement log shipping (HTTP or file-based)
Add session/message/tool logging

Phase 3: Metrics

Add prom-client to container
Expose /metrics endpoint
Configure Prometheus scraping
Create initial Grafana dashboards

Phase 4: Production Hardening

Implement data anonymization
Set up retention policies
Configure alerts
Document runbooks

8. Cost Estimates

Component	Resource	Monthly Cost
Loki	50GB logs @ 7 days	~$15
Prometheus	10GB metrics @ 90 days	~$10
Grafana	1 instance	Free (OSS)
Log ingestion	Network	~$5
Total		~$30/month

9. Next Steps

Approve plan - Review and confirm approach
Deploy logging infra - Loki/Prometheus/Grafana on dedicated server
Modify Dockerfile - Add logging configuration
Create oh-my-opencode hooks - Session/message/tool logging
Build dashboards - Grafana visualizations
Test with pilot stack - Validate before rollout
Rollout to all stacks - Update deployer to include logging config

13 KiB Raw Blame History