feat: add comprehensive logging infrastructure
- Add Loki/Prometheus/Grafana stack in logging-stack/ - Add log-ingest service for receiving events from AI stacks - Add Grafana dashboard with stack_name filtering - Update Dokploy client with setApplicationEnv method - Configure STACK_NAME env var for deployed stacks - Add alerting rules for stack health monitoring
This commit is contained in:
435
docs/LOGGING-PLAN.md
Normal file
435
docs/LOGGING-PLAN.md
Normal file
@@ -0,0 +1,435 @@
|
||||
# AI Stack Monitoring & Logging Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Comprehensive logging strategy for all deployed AI stacks at `*.ai.flexinit.nl` to enable:
|
||||
- Usage analytics and billing
|
||||
- Debugging and support
|
||||
- Security auditing
|
||||
- Performance optimization
|
||||
- User behavior insights
|
||||
|
||||
---
|
||||
|
||||
## 1. Log Categories
|
||||
|
||||
### 1.1 System Logs
|
||||
| Log Type | Source | Content |
|
||||
|----------|--------|---------|
|
||||
| Container stdout/stderr | Docker | OpenCode server output, errors, startup |
|
||||
| Health checks | Docker | Container health status over time |
|
||||
| Resource metrics | cAdvisor/Prometheus | CPU, memory, network, disk I/O |
|
||||
|
||||
### 1.2 OpenCode Server Logs
|
||||
| Log Type | Source | Content |
|
||||
|----------|--------|---------|
|
||||
| Server events | `--print-logs` | HTTP requests, WebSocket connections |
|
||||
| Session lifecycle | OpenCode | Session start/end, duration |
|
||||
| Tool invocations | OpenCode | Which tools used, success/failure |
|
||||
| MCP connections | OpenCode | MCP server connects/disconnects |
|
||||
|
||||
### 1.3 AI Interaction Logs
|
||||
| Log Type | Source | Content |
|
||||
|----------|--------|---------|
|
||||
| Prompts | OpenCode session | User messages (anonymized) |
|
||||
| Responses | OpenCode session | AI responses (summarized) |
|
||||
| Token usage | Provider API | Input/output tokens per request |
|
||||
| Model selection | OpenCode | Which model used per request |
|
||||
| Agent selection | oh-my-opencode | Which agent (Sisyphus, Oracle, etc.) |
|
||||
|
||||
### 1.4 User Activity Logs
|
||||
| Log Type | Source | Content |
|
||||
|----------|--------|---------|
|
||||
| File operations | OpenCode tools | Read/write/edit actions |
|
||||
| Bash commands | OpenCode tools | Commands executed |
|
||||
| Git operations | OpenCode tools | Commits, pushes, branches |
|
||||
| Web fetches | OpenCode tools | URLs accessed |
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ AI Stack Container │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||||
│ │ OpenCode │ │ Fluent Bit │ │ OpenTelemetry SDK │ │
|
||||
│ │ Server │──│ (sidecar) │──│ (instrumentation) │ │
|
||||
│ └──────────────┘ └──────┬───────┘ └──────────┬───────────┘ │
|
||||
└────────────────────────────┼────────────────────┼───────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Central Logging Stack │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Loki │ │ Prometheus │ │ Tempo │ │
|
||||
│ │ (logs) │ │ (metrics) │ │ (traces) │ │
|
||||
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
|
||||
│ └─────────────────┼─────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────┐ │
|
||||
│ │ Grafana │ │
|
||||
│ │ (dashboard) │ │
|
||||
│ └──────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Implementation Plan
|
||||
|
||||
### Phase 1: Container Logging (Week 1)
|
||||
|
||||
#### 3.1.1 Docker Log Driver
|
||||
```yaml
|
||||
# docker-compose addition for each stack
|
||||
logging:
|
||||
driver: "fluentd"
|
||||
options:
|
||||
fluentd-address: "10.100.0.x:24224"
|
||||
tag: "ai-stack.{{.Name}}"
|
||||
fluentd-async: "true"
|
||||
```
|
||||
|
||||
#### 3.1.2 OpenCode Server Logs
|
||||
Modify Dockerfile CMD to capture structured logs:
|
||||
```dockerfile
|
||||
CMD ["sh", "-c", "opencode serve --hostname 0.0.0.0 --port 8080 --mdns --print-logs --log-level INFO 2>&1 | tee /var/log/opencode/server.log"]
|
||||
```
|
||||
|
||||
#### 3.1.3 Log Rotation
|
||||
```dockerfile
|
||||
# Add logrotate config
|
||||
RUN apt-get install -y logrotate
|
||||
COPY logrotate.conf /etc/logrotate.d/opencode
|
||||
```
|
||||
|
||||
### Phase 2: Session & Prompt Logging (Week 2)
|
||||
|
||||
#### 3.2.1 OpenCode Plugin for Logging
|
||||
Create logging hook in oh-my-opencode:
|
||||
|
||||
```typescript
|
||||
// src/hooks/logging.ts
|
||||
export const loggingHook: Hook = {
|
||||
name: 'session-logger',
|
||||
|
||||
onSessionStart: async (session) => {
|
||||
await logEvent({
|
||||
type: 'session_start',
|
||||
stackName: process.env.STACK_NAME,
|
||||
sessionId: session.id,
|
||||
timestamp: new Date().toISOString()
|
||||
});
|
||||
},
|
||||
|
||||
onMessage: async (message, session) => {
|
||||
await logEvent({
|
||||
type: 'message',
|
||||
stackName: process.env.STACK_NAME,
|
||||
sessionId: session.id,
|
||||
role: message.role,
|
||||
// Hash content for privacy, log length
|
||||
contentHash: hash(message.content),
|
||||
contentLength: message.content.length,
|
||||
model: session.model,
|
||||
agent: session.agent,
|
||||
timestamp: new Date().toISOString()
|
||||
});
|
||||
},
|
||||
|
||||
onToolUse: async (tool, args, result, session) => {
|
||||
await logEvent({
|
||||
type: 'tool_use',
|
||||
stackName: process.env.STACK_NAME,
|
||||
sessionId: session.id,
|
||||
tool: tool.name,
|
||||
argsHash: hash(JSON.stringify(args)),
|
||||
success: !result.error,
|
||||
duration: result.duration,
|
||||
timestamp: new Date().toISOString()
|
||||
});
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
#### 3.2.2 Log Destination Options
|
||||
|
||||
**Option A: Centralized HTTP Endpoint**
|
||||
```typescript
|
||||
async function logEvent(event: LogEvent) {
|
||||
await fetch('https://logs.ai.flexinit.nl/ingest', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'X-Stack-Name': process.env.STACK_NAME,
|
||||
'X-API-Key': process.env.LOGGING_API_KEY
|
||||
},
|
||||
body: JSON.stringify(event)
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Option B: Local File + Fluent Bit**
|
||||
```typescript
|
||||
async function logEvent(event: LogEvent) {
|
||||
const logLine = JSON.stringify(event) + '\n';
|
||||
await fs.appendFile('/var/log/opencode/events.jsonl', logLine);
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 3: Metrics Collection (Week 3)
|
||||
|
||||
#### 3.3.1 Prometheus Metrics Endpoint
|
||||
Add to OpenCode container:
|
||||
|
||||
```typescript
|
||||
// metrics.ts
|
||||
import { register, Counter, Histogram, Gauge } from 'prom-client';
|
||||
|
||||
export const metrics = {
|
||||
sessionsTotal: new Counter({
|
||||
name: 'opencode_sessions_total',
|
||||
help: 'Total number of sessions',
|
||||
labelNames: ['stack_name']
|
||||
}),
|
||||
|
||||
messagesTotal: new Counter({
|
||||
name: 'opencode_messages_total',
|
||||
help: 'Total messages processed',
|
||||
labelNames: ['stack_name', 'role', 'model', 'agent']
|
||||
}),
|
||||
|
||||
tokensUsed: new Counter({
|
||||
name: 'opencode_tokens_total',
|
||||
help: 'Total tokens used',
|
||||
labelNames: ['stack_name', 'model', 'direction']
|
||||
}),
|
||||
|
||||
toolInvocations: new Counter({
|
||||
name: 'opencode_tool_invocations_total',
|
||||
help: 'Tool invocations',
|
||||
labelNames: ['stack_name', 'tool', 'success']
|
||||
}),
|
||||
|
||||
responseDuration: new Histogram({
|
||||
name: 'opencode_response_duration_seconds',
|
||||
help: 'AI response duration',
|
||||
labelNames: ['stack_name', 'model'],
|
||||
buckets: [0.5, 1, 2, 5, 10, 30, 60, 120]
|
||||
}),
|
||||
|
||||
activeSessions: new Gauge({
|
||||
name: 'opencode_active_sessions',
|
||||
help: 'Currently active sessions',
|
||||
labelNames: ['stack_name']
|
||||
})
|
||||
};
|
||||
```
|
||||
|
||||
#### 3.3.2 Expose Metrics Endpoint
|
||||
```typescript
|
||||
// Add to container
|
||||
app.get('/metrics', async (req, res) => {
|
||||
res.set('Content-Type', register.contentType);
|
||||
res.send(await register.metrics());
|
||||
});
|
||||
```
|
||||
|
||||
### Phase 4: Central Logging Infrastructure (Week 4)
|
||||
|
||||
#### 3.4.1 Deploy Logging Stack
|
||||
```yaml
|
||||
# docker-compose.logging.yml
|
||||
services:
|
||||
loki:
|
||||
image: grafana/loki:latest
|
||||
ports:
|
||||
- "3100:3100"
|
||||
volumes:
|
||||
- loki-data:/loki
|
||||
|
||||
promtail:
|
||||
image: grafana/promtail:latest
|
||||
volumes:
|
||||
- /var/log:/var/log:ro
|
||||
- ./promtail-config.yml:/etc/promtail/config.yml
|
||||
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
```
|
||||
|
||||
#### 3.4.2 Prometheus Scrape Config
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'ai-stacks'
|
||||
dns_sd_configs:
|
||||
- names:
|
||||
- 'tasks.ai-stack-*'
|
||||
type: 'A'
|
||||
port: 9090
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_dns_name]
|
||||
target_label: stack_name
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Data Schema
|
||||
|
||||
### 4.1 Event Log Schema (JSON Lines)
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-01-10T12:00:00.000Z",
|
||||
"stack_name": "john-dev",
|
||||
"session_id": "sess_abc123",
|
||||
"event_type": "message|tool_use|session_start|session_end|error",
|
||||
"data": {
|
||||
"role": "user|assistant",
|
||||
"model": "glm-4.7-free",
|
||||
"agent": "sisyphus",
|
||||
"tool": "bash",
|
||||
"tokens_in": 1500,
|
||||
"tokens_out": 500,
|
||||
"duration_ms": 2340,
|
||||
"success": true,
|
||||
"error_code": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 Metrics Labels
|
||||
| Metric | Labels |
|
||||
|--------|--------|
|
||||
| `opencode_*` | `stack_name`, `model`, `agent`, `tool`, `success` |
|
||||
|
||||
---
|
||||
|
||||
## 5. Privacy & Security
|
||||
|
||||
### 5.1 Data Anonymization
|
||||
- **Prompts**: Hash content, store only length and word count
|
||||
- **File paths**: Anonymize to pattern (e.g., `/home/user/project/src/*.ts`)
|
||||
- **Bash commands**: Log command name only, not arguments with secrets
|
||||
- **Env vars**: Never log, redact from all outputs
|
||||
|
||||
### 5.2 Retention Policy
|
||||
| Data Type | Retention | Storage |
|
||||
|-----------|-----------|---------|
|
||||
| Raw logs | 7 days | Loki |
|
||||
| Aggregated metrics | 90 days | Prometheus |
|
||||
| Session summaries | 1 year | PostgreSQL |
|
||||
| Billing data | 7 years | PostgreSQL |
|
||||
|
||||
### 5.3 Access Control
|
||||
- Logs accessible only to platform admins
|
||||
- Users can request their own data export
|
||||
- Stack owners can view their stack's metrics in Grafana
|
||||
|
||||
---
|
||||
|
||||
## 6. Grafana Dashboards
|
||||
|
||||
### 6.1 Platform Overview
|
||||
- Total active stacks
|
||||
- Messages per hour (all stacks)
|
||||
- Token usage by model
|
||||
- Error rate
|
||||
- Top agents used
|
||||
|
||||
### 6.2 Per-Stack Dashboard
|
||||
- Session count over time
|
||||
- Token usage
|
||||
- Tool usage breakdown
|
||||
- Response time percentiles
|
||||
- Error log viewer
|
||||
|
||||
### 6.3 Alerts
|
||||
```yaml
|
||||
# alerting-rules.yml
|
||||
groups:
|
||||
- name: ai-stack-alerts
|
||||
rules:
|
||||
- alert: StackUnhealthy
|
||||
expr: up{job="ai-stacks"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Stack {{ $labels.stack_name }} is down"
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(opencode_errors_total[5m]) > 0.1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate on {{ $labels.stack_name }}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Implementation Checklist
|
||||
|
||||
### Phase 1: Container Logging
|
||||
- [ ] Set up Loki + Promtail on logging server
|
||||
- [ ] Configure Docker log driver for ai-stack containers
|
||||
- [ ] Add log rotation to Dockerfile
|
||||
- [ ] Verify logs flowing to Loki
|
||||
|
||||
### Phase 2: Session Logging
|
||||
- [ ] Create logging hook in oh-my-opencode
|
||||
- [ ] Define event schema
|
||||
- [ ] Implement log shipping (HTTP or file-based)
|
||||
- [ ] Add session/message/tool logging
|
||||
|
||||
### Phase 3: Metrics
|
||||
- [ ] Add prom-client to container
|
||||
- [ ] Expose /metrics endpoint
|
||||
- [ ] Configure Prometheus scraping
|
||||
- [ ] Create initial Grafana dashboards
|
||||
|
||||
### Phase 4: Production Hardening
|
||||
- [ ] Implement data anonymization
|
||||
- [ ] Set up retention policies
|
||||
- [ ] Configure alerts
|
||||
- [ ] Document runbooks
|
||||
|
||||
---
|
||||
|
||||
## 8. Cost Estimates
|
||||
|
||||
| Component | Resource | Monthly Cost |
|
||||
|-----------|----------|--------------|
|
||||
| Loki | 50GB logs @ 7 days | ~$15 |
|
||||
| Prometheus | 10GB metrics @ 90 days | ~$10 |
|
||||
| Grafana | 1 instance | Free (OSS) |
|
||||
| Log ingestion | Network | ~$5 |
|
||||
| **Total** | | **~$30/month** |
|
||||
|
||||
---
|
||||
|
||||
## 9. Next Steps
|
||||
|
||||
1. **Approve plan** - Review and confirm approach
|
||||
2. **Deploy logging infra** - Loki/Prometheus/Grafana on dedicated server
|
||||
3. **Modify Dockerfile** - Add logging configuration
|
||||
4. **Create oh-my-opencode hooks** - Session/message/tool logging
|
||||
5. **Build dashboards** - Grafana visualizations
|
||||
6. **Test with pilot stack** - Validate before rollout
|
||||
7. **Rollout to all stacks** - Update deployer to include logging config
|
||||
Reference in New Issue
Block a user