# AI Stack Monitoring & Logging Plan ## Overview Comprehensive logging strategy for all deployed AI stacks at `*.ai.flexinit.nl` to enable: - Usage analytics and billing - Debugging and support - Security auditing - Performance optimization - User behavior insights --- ## 1. Log Categories ### 1.1 System Logs | Log Type | Source | Content | |----------|--------|---------| | Container stdout/stderr | Docker | OpenCode server output, errors, startup | | Health checks | Docker | Container health status over time | | Resource metrics | cAdvisor/Prometheus | CPU, memory, network, disk I/O | ### 1.2 OpenCode Server Logs | Log Type | Source | Content | |----------|--------|---------| | Server events | `--print-logs` | HTTP requests, WebSocket connections | | Session lifecycle | OpenCode | Session start/end, duration | | Tool invocations | OpenCode | Which tools used, success/failure | | MCP connections | OpenCode | MCP server connects/disconnects | ### 1.3 AI Interaction Logs | Log Type | Source | Content | |----------|--------|---------| | Prompts | OpenCode session | User messages (anonymized) | | Responses | OpenCode session | AI responses (summarized) | | Token usage | Provider API | Input/output tokens per request | | Model selection | OpenCode | Which model used per request | | Agent selection | oh-my-opencode | Which agent (Sisyphus, Oracle, etc.) | ### 1.4 User Activity Logs | Log Type | Source | Content | |----------|--------|---------| | File operations | OpenCode tools | Read/write/edit actions | | Bash commands | OpenCode tools | Commands executed | | Git operations | OpenCode tools | Commits, pushes, branches | | Web fetches | OpenCode tools | URLs accessed | --- ## 2. Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ AI Stack Container │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ OpenCode │ │ Fluent Bit │ │ OpenTelemetry SDK │ │ │ │ Server │──│ (sidecar) │──│ (instrumentation) │ │ │ └──────────────┘ └──────┬───────┘ └──────────┬───────────┘ │ └────────────────────────────┼────────────────────┼───────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Central Logging Stack │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Loki │ │ Prometheus │ │ Tempo │ │ │ │ (logs) │ │ (metrics) │ │ (traces) │ │ │ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │ │ └─────────────────┼─────────────────────┘ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Grafana │ │ │ │ (dashboard) │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 3. Implementation Plan ### Phase 1: Container Logging (Week 1) #### 3.1.1 Docker Log Driver ```yaml # docker-compose addition for each stack logging: driver: "fluentd" options: fluentd-address: "10.100.0.x:24224" tag: "ai-stack.{{.Name}}" fluentd-async: "true" ``` #### 3.1.2 OpenCode Server Logs Modify Dockerfile CMD to capture structured logs: ```dockerfile CMD ["sh", "-c", "opencode serve --hostname 0.0.0.0 --port 8080 --mdns --print-logs --log-level INFO 2>&1 | tee /var/log/opencode/server.log"] ``` #### 3.1.3 Log Rotation ```dockerfile # Add logrotate config RUN apt-get install -y logrotate COPY logrotate.conf /etc/logrotate.d/opencode ``` ### Phase 2: Session & Prompt Logging (Week 2) #### 3.2.1 OpenCode Plugin for Logging Create logging hook in oh-my-opencode: ```typescript // src/hooks/logging.ts export const loggingHook: Hook = { name: 'session-logger', onSessionStart: async (session) => { await logEvent({ type: 'session_start', stackName: process.env.STACK_NAME, sessionId: session.id, timestamp: new Date().toISOString() }); }, onMessage: async (message, session) => { await logEvent({ type: 'message', stackName: process.env.STACK_NAME, sessionId: session.id, role: message.role, // Hash content for privacy, log length contentHash: hash(message.content), contentLength: message.content.length, model: session.model, agent: session.agent, timestamp: new Date().toISOString() }); }, onToolUse: async (tool, args, result, session) => { await logEvent({ type: 'tool_use', stackName: process.env.STACK_NAME, sessionId: session.id, tool: tool.name, argsHash: hash(JSON.stringify(args)), success: !result.error, duration: result.duration, timestamp: new Date().toISOString() }); } }; ``` #### 3.2.2 Log Destination Options **Option A: Centralized HTTP Endpoint** ```typescript async function logEvent(event: LogEvent) { await fetch('https://logs.ai.flexinit.nl/ingest', { method: 'POST', headers: { 'Content-Type': 'application/json', 'X-Stack-Name': process.env.STACK_NAME, 'X-API-Key': process.env.LOGGING_API_KEY }, body: JSON.stringify(event) }); } ``` **Option B: Local File + Fluent Bit** ```typescript async function logEvent(event: LogEvent) { const logLine = JSON.stringify(event) + '\n'; await fs.appendFile('/var/log/opencode/events.jsonl', logLine); } ``` ### Phase 3: Metrics Collection (Week 3) #### 3.3.1 Prometheus Metrics Endpoint Add to OpenCode container: ```typescript // metrics.ts import { register, Counter, Histogram, Gauge } from 'prom-client'; export const metrics = { sessionsTotal: new Counter({ name: 'opencode_sessions_total', help: 'Total number of sessions', labelNames: ['stack_name'] }), messagesTotal: new Counter({ name: 'opencode_messages_total', help: 'Total messages processed', labelNames: ['stack_name', 'role', 'model', 'agent'] }), tokensUsed: new Counter({ name: 'opencode_tokens_total', help: 'Total tokens used', labelNames: ['stack_name', 'model', 'direction'] }), toolInvocations: new Counter({ name: 'opencode_tool_invocations_total', help: 'Tool invocations', labelNames: ['stack_name', 'tool', 'success'] }), responseDuration: new Histogram({ name: 'opencode_response_duration_seconds', help: 'AI response duration', labelNames: ['stack_name', 'model'], buckets: [0.5, 1, 2, 5, 10, 30, 60, 120] }), activeSessions: new Gauge({ name: 'opencode_active_sessions', help: 'Currently active sessions', labelNames: ['stack_name'] }) }; ``` #### 3.3.2 Expose Metrics Endpoint ```typescript // Add to container app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.send(await register.metrics()); }); ``` ### Phase 4: Central Logging Infrastructure (Week 4) #### 3.4.1 Deploy Logging Stack ```yaml # docker-compose.logging.yml services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - loki-data:/loki promtail: image: grafana/promtail:latest volumes: - /var/log:/var/log:ro - ./promtail-config.yml:/etc/promtail/config.yml prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} volumes: - grafana-data:/var/lib/grafana ``` #### 3.4.2 Prometheus Scrape Config ```yaml # prometheus.yml scrape_configs: - job_name: 'ai-stacks' dns_sd_configs: - names: - 'tasks.ai-stack-*' type: 'A' port: 9090 relabel_configs: - source_labels: [__meta_dns_name] target_label: stack_name ``` --- ## 4. Data Schema ### 4.1 Event Log Schema (JSON Lines) ```json { "timestamp": "2026-01-10T12:00:00.000Z", "stack_name": "john-dev", "session_id": "sess_abc123", "event_type": "message|tool_use|session_start|session_end|error", "data": { "role": "user|assistant", "model": "glm-4.7-free", "agent": "sisyphus", "tool": "bash", "tokens_in": 1500, "tokens_out": 500, "duration_ms": 2340, "success": true, "error_code": null } } ``` ### 4.2 Metrics Labels | Metric | Labels | |--------|--------| | `opencode_*` | `stack_name`, `model`, `agent`, `tool`, `success` | --- ## 5. Privacy & Security ### 5.1 Data Anonymization - **Prompts**: Hash content, store only length and word count - **File paths**: Anonymize to pattern (e.g., `/home/user/project/src/*.ts`) - **Bash commands**: Log command name only, not arguments with secrets - **Env vars**: Never log, redact from all outputs ### 5.2 Retention Policy | Data Type | Retention | Storage | |-----------|-----------|---------| | Raw logs | 7 days | Loki | | Aggregated metrics | 90 days | Prometheus | | Session summaries | 1 year | PostgreSQL | | Billing data | 7 years | PostgreSQL | ### 5.3 Access Control - Logs accessible only to platform admins - Users can request their own data export - Stack owners can view their stack's metrics in Grafana --- ## 6. Grafana Dashboards ### 6.1 Platform Overview - Total active stacks - Messages per hour (all stacks) - Token usage by model - Error rate - Top agents used ### 6.2 Per-Stack Dashboard - Session count over time - Token usage - Tool usage breakdown - Response time percentiles - Error log viewer ### 6.3 Alerts ```yaml # alerting-rules.yml groups: - name: ai-stack-alerts rules: - alert: StackUnhealthy expr: up{job="ai-stacks"} == 0 for: 5m labels: severity: critical annotations: summary: "Stack {{ $labels.stack_name }} is down" - alert: HighErrorRate expr: rate(opencode_errors_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "High error rate on {{ $labels.stack_name }}" ``` --- ## 7. Implementation Checklist ### Phase 1: Container Logging - [x] Set up Loki + Promtail on logging server (using existing `logs.intra.flexinit.nl`) - [x] Configure Docker log driver for ai-stack containers - [x] Add log rotation to Dockerfile - [x] Verify logs flowing to Loki ### Phase 2: Session Logging - [x] Create logging hook in oh-my-opencode (`/home/odouhou/locale-projects/oh-my-opencode-free-fork/src/hooks/usage-logging/`) - [x] Define event schema - [x] Implement log shipping (HTTP-based via log-ingest service) - [x] Add session/message/tool logging ### Phase 3: Metrics - [x] Add prom-client to container (`docker/shared-config/metrics-exporter.ts`) - [x] Expose /metrics endpoint (port 9090) - [x] Configure Prometheus scraping (datasource added to Grafana) - [x] Create initial Grafana dashboards (`/d/ai-stack-overview`) ### Phase 4: Production Hardening - [x] Implement data anonymization (content hashed, not stored) - [ ] Set up retention policies - [ ] Configure alerts - [ ] Document runbooks ### Deployed Components (2026-01-10) - **Log-ingest service**: `http://ai-stack-log-ingest:3000/ingest` (dokploy-network) - **Grafana dashboard**: https://logs.intra.flexinit.nl/d/ai-stack-overview - **Datasource UIDs**: Loki (`af9a823s6iku8b`), Prometheus (`cf9r1fmfw9xxcf`) - **BWS credentials**: `GRAFANA_OPENCODE_ACCESS_TOKEN` (id: `c77e58e3-fb34-41dc-9824-b3ce00da18a0`) --- ## 8. Cost Estimates | Component | Resource | Monthly Cost | |-----------|----------|--------------| | Loki | 50GB logs @ 7 days | ~$15 | | Prometheus | 10GB metrics @ 90 days | ~$10 | | Grafana | 1 instance | Free (OSS) | | Log ingestion | Network | ~$5 | | **Total** | | **~$30/month** | --- ## 9. Next Steps 1. **Approve plan** - Review and confirm approach 2. **Deploy logging infra** - Loki/Prometheus/Grafana on dedicated server 3. **Modify Dockerfile** - Add logging configuration 4. **Create oh-my-opencode hooks** - Session/message/tool logging 5. **Build dashboards** - Grafana visualizations 6. **Test with pilot stack** - Validate before rollout 7. **Rollout to all stacks** - Update deployer to include logging config