- Update log-ingest to use internal Loki endpoint - Add standalone docker-compose for dokploy deployment - Update ROADMAP and LOGGING-PLAN with completed status - Configure proper network settings for dokploy-network
14 KiB
14 KiB
AI Stack Monitoring & Logging Plan
Overview
Comprehensive logging strategy for all deployed AI stacks at *.ai.flexinit.nl to enable:
- Usage analytics and billing
- Debugging and support
- Security auditing
- Performance optimization
- User behavior insights
1. Log Categories
1.1 System Logs
| Log Type | Source | Content |
|---|---|---|
| Container stdout/stderr | Docker | OpenCode server output, errors, startup |
| Health checks | Docker | Container health status over time |
| Resource metrics | cAdvisor/Prometheus | CPU, memory, network, disk I/O |
1.2 OpenCode Server Logs
| Log Type | Source | Content |
|---|---|---|
| Server events | --print-logs |
HTTP requests, WebSocket connections |
| Session lifecycle | OpenCode | Session start/end, duration |
| Tool invocations | OpenCode | Which tools used, success/failure |
| MCP connections | OpenCode | MCP server connects/disconnects |
1.3 AI Interaction Logs
| Log Type | Source | Content |
|---|---|---|
| Prompts | OpenCode session | User messages (anonymized) |
| Responses | OpenCode session | AI responses (summarized) |
| Token usage | Provider API | Input/output tokens per request |
| Model selection | OpenCode | Which model used per request |
| Agent selection | oh-my-opencode | Which agent (Sisyphus, Oracle, etc.) |
1.4 User Activity Logs
| Log Type | Source | Content |
|---|---|---|
| File operations | OpenCode tools | Read/write/edit actions |
| Bash commands | OpenCode tools | Commands executed |
| Git operations | OpenCode tools | Commits, pushes, branches |
| Web fetches | OpenCode tools | URLs accessed |
2. Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AI Stack Container │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ OpenCode │ │ Fluent Bit │ │ OpenTelemetry SDK │ │
│ │ Server │──│ (sidecar) │──│ (instrumentation) │ │
│ └──────────────┘ └──────┬───────┘ └──────────┬───────────┘ │
└────────────────────────────┼────────────────────┼───────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Central Logging Stack │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Loki │ │ Prometheus │ │ Tempo │ │
│ │ (logs) │ │ (metrics) │ │ (traces) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ └─────────────────┼─────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Grafana │ │
│ │ (dashboard) │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
3. Implementation Plan
Phase 1: Container Logging (Week 1)
3.1.1 Docker Log Driver
# docker-compose addition for each stack
logging:
driver: "fluentd"
options:
fluentd-address: "10.100.0.x:24224"
tag: "ai-stack.{{.Name}}"
fluentd-async: "true"
3.1.2 OpenCode Server Logs
Modify Dockerfile CMD to capture structured logs:
CMD ["sh", "-c", "opencode serve --hostname 0.0.0.0 --port 8080 --mdns --print-logs --log-level INFO 2>&1 | tee /var/log/opencode/server.log"]
3.1.3 Log Rotation
# Add logrotate config
RUN apt-get install -y logrotate
COPY logrotate.conf /etc/logrotate.d/opencode
Phase 2: Session & Prompt Logging (Week 2)
3.2.1 OpenCode Plugin for Logging
Create logging hook in oh-my-opencode:
// src/hooks/logging.ts
export const loggingHook: Hook = {
name: 'session-logger',
onSessionStart: async (session) => {
await logEvent({
type: 'session_start',
stackName: process.env.STACK_NAME,
sessionId: session.id,
timestamp: new Date().toISOString()
});
},
onMessage: async (message, session) => {
await logEvent({
type: 'message',
stackName: process.env.STACK_NAME,
sessionId: session.id,
role: message.role,
// Hash content for privacy, log length
contentHash: hash(message.content),
contentLength: message.content.length,
model: session.model,
agent: session.agent,
timestamp: new Date().toISOString()
});
},
onToolUse: async (tool, args, result, session) => {
await logEvent({
type: 'tool_use',
stackName: process.env.STACK_NAME,
sessionId: session.id,
tool: tool.name,
argsHash: hash(JSON.stringify(args)),
success: !result.error,
duration: result.duration,
timestamp: new Date().toISOString()
});
}
};
3.2.2 Log Destination Options
Option A: Centralized HTTP Endpoint
async function logEvent(event: LogEvent) {
await fetch('https://logs.ai.flexinit.nl/ingest', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Stack-Name': process.env.STACK_NAME,
'X-API-Key': process.env.LOGGING_API_KEY
},
body: JSON.stringify(event)
});
}
Option B: Local File + Fluent Bit
async function logEvent(event: LogEvent) {
const logLine = JSON.stringify(event) + '\n';
await fs.appendFile('/var/log/opencode/events.jsonl', logLine);
}
Phase 3: Metrics Collection (Week 3)
3.3.1 Prometheus Metrics Endpoint
Add to OpenCode container:
// metrics.ts
import { register, Counter, Histogram, Gauge } from 'prom-client';
export const metrics = {
sessionsTotal: new Counter({
name: 'opencode_sessions_total',
help: 'Total number of sessions',
labelNames: ['stack_name']
}),
messagesTotal: new Counter({
name: 'opencode_messages_total',
help: 'Total messages processed',
labelNames: ['stack_name', 'role', 'model', 'agent']
}),
tokensUsed: new Counter({
name: 'opencode_tokens_total',
help: 'Total tokens used',
labelNames: ['stack_name', 'model', 'direction']
}),
toolInvocations: new Counter({
name: 'opencode_tool_invocations_total',
help: 'Tool invocations',
labelNames: ['stack_name', 'tool', 'success']
}),
responseDuration: new Histogram({
name: 'opencode_response_duration_seconds',
help: 'AI response duration',
labelNames: ['stack_name', 'model'],
buckets: [0.5, 1, 2, 5, 10, 30, 60, 120]
}),
activeSessions: new Gauge({
name: 'opencode_active_sessions',
help: 'Currently active sessions',
labelNames: ['stack_name']
})
};
3.3.2 Expose Metrics Endpoint
// Add to container
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});
Phase 4: Central Logging Infrastructure (Week 4)
3.4.1 Deploy Logging Stack
# docker-compose.logging.yml
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log:ro
- ./promtail-config.yml:/etc/promtail/config.yml
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana-data:/var/lib/grafana
3.4.2 Prometheus Scrape Config
# prometheus.yml
scrape_configs:
- job_name: 'ai-stacks'
dns_sd_configs:
- names:
- 'tasks.ai-stack-*'
type: 'A'
port: 9090
relabel_configs:
- source_labels: [__meta_dns_name]
target_label: stack_name
4. Data Schema
4.1 Event Log Schema (JSON Lines)
{
"timestamp": "2026-01-10T12:00:00.000Z",
"stack_name": "john-dev",
"session_id": "sess_abc123",
"event_type": "message|tool_use|session_start|session_end|error",
"data": {
"role": "user|assistant",
"model": "glm-4.7-free",
"agent": "sisyphus",
"tool": "bash",
"tokens_in": 1500,
"tokens_out": 500,
"duration_ms": 2340,
"success": true,
"error_code": null
}
}
4.2 Metrics Labels
| Metric | Labels |
|---|---|
opencode_* |
stack_name, model, agent, tool, success |
5. Privacy & Security
5.1 Data Anonymization
- Prompts: Hash content, store only length and word count
- File paths: Anonymize to pattern (e.g.,
/home/user/project/src/*.ts) - Bash commands: Log command name only, not arguments with secrets
- Env vars: Never log, redact from all outputs
5.2 Retention Policy
| Data Type | Retention | Storage |
|---|---|---|
| Raw logs | 7 days | Loki |
| Aggregated metrics | 90 days | Prometheus |
| Session summaries | 1 year | PostgreSQL |
| Billing data | 7 years | PostgreSQL |
5.3 Access Control
- Logs accessible only to platform admins
- Users can request their own data export
- Stack owners can view their stack's metrics in Grafana
6. Grafana Dashboards
6.1 Platform Overview
- Total active stacks
- Messages per hour (all stacks)
- Token usage by model
- Error rate
- Top agents used
6.2 Per-Stack Dashboard
- Session count over time
- Token usage
- Tool usage breakdown
- Response time percentiles
- Error log viewer
6.3 Alerts
# alerting-rules.yml
groups:
- name: ai-stack-alerts
rules:
- alert: StackUnhealthy
expr: up{job="ai-stacks"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Stack {{ $labels.stack_name }} is down"
- alert: HighErrorRate
expr: rate(opencode_errors_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.stack_name }}"
7. Implementation Checklist
Phase 1: Container Logging
- Set up Loki + Promtail on logging server (using existing
logs.intra.flexinit.nl) - Configure Docker log driver for ai-stack containers
- Add log rotation to Dockerfile
- Verify logs flowing to Loki
Phase 2: Session Logging
- Create logging hook in oh-my-opencode (
/home/odouhou/locale-projects/oh-my-opencode-free-fork/src/hooks/usage-logging/) - Define event schema
- Implement log shipping (HTTP-based via log-ingest service)
- Add session/message/tool logging
Phase 3: Metrics
- Add prom-client to container (
docker/shared-config/metrics-exporter.ts) - Expose /metrics endpoint (port 9090)
- Configure Prometheus scraping (datasource added to Grafana)
- Create initial Grafana dashboards (
/d/ai-stack-overview)
Phase 4: Production Hardening
- Implement data anonymization (content hashed, not stored)
- Set up retention policies
- Configure alerts
- Document runbooks
Deployed Components (2026-01-10)
- Log-ingest service:
http://ai-stack-log-ingest:3000/ingest(dokploy-network) - Grafana dashboard: https://logs.intra.flexinit.nl/d/ai-stack-overview
- Datasource UIDs: Loki (
af9a823s6iku8b), Prometheus (cf9r1fmfw9xxcf) - BWS credentials:
GRAFANA_OPENCODE_ACCESS_TOKEN(id:c77e58e3-fb34-41dc-9824-b3ce00da18a0)
8. Cost Estimates
| Component | Resource | Monthly Cost |
|---|---|---|
| Loki | 50GB logs @ 7 days | ~$15 |
| Prometheus | 10GB metrics @ 90 days | ~$10 |
| Grafana | 1 instance | Free (OSS) |
| Log ingestion | Network | ~$5 |
| Total | ~$30/month |
9. Next Steps
- Approve plan - Review and confirm approach
- Deploy logging infra - Loki/Prometheus/Grafana on dedicated server
- Modify Dockerfile - Add logging configuration
- Create oh-my-opencode hooks - Session/message/tool logging
- Build dashboards - Grafana visualizations
- Test with pilot stack - Validate before rollout
- Rollout to all stacks - Update deployer to include logging config