ai-stack-deployer/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨


***YOU ARE FORCED TO FOLLOW THESE PRINCIPLE RULES***
- ***MUST*** USE SKILL TODO**
- ***MUST*** FOLLOW YOUR TODO
- ***MUST*** USE DOCUMENTATION/REPOSITORIES AFTER 3 TRIES
- ***MUST*** PROPPERLY TEST WHAT YOU ARE DOING
- ***NEVER*** NEVER ASSUME
- ***MUST*** BE SURE
- ***MUST*** DOCUMENT YOU FINDINGS FOR THE NEXT TIME
- ***MUST*** CLEAN UP PROPPERLY
- ***MUST*** USE/UPDATE YOUR TEST DOCUMENT

🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

## Project Overview

AI Stack Deployer is a self-service portal that deploys personal OpenCode AI coding assistant stacks. Users enter a name, and the system provisions containers via Dokploy to create a fully functional AI stack at `{name}.ai.flexinit.nl`. Wildcard DNS and SSL are pre-configured, so deployments only need to create the Dokploy project and application.

## Core Architecture

### Deployment Flow
The system orchestrates deployments through Dokploy, leveraging pre-configured infrastructure:

1. **Dokploy API** - Manages projects, applications, and container deployments
2. **Traefik** - Handles SSL termination and routing (pre-configured wildcard DNS and SSL)

Each deployment creates:
- Dokploy project: `ai-stack-{name}`
- Application with OpenCode server + ttyd terminal
- Domain configuration with automatic HTTPS (Traefik handles SSL via wildcard cert)

**Note**: DNS is pre-configured with wildcard `*.ai.flexinit.nl` → `144.76.116.169`. Individual DNS records are NOT created per deployment - Traefik routes based on hostname matching.

### Two Runtime Modes

1. **HTTP Server** (`src/index.ts`) - Hono-based API for web portal
   - **Fully implemented** production-ready web application
   - REST API endpoints for deployment management
   - Server-Sent Events (SSE) for real-time progress streaming
   - Static frontend serving (HTML/CSS/JS)
   - In-memory deployment state tracking
   - CORS and logging middleware

2. **MCP Server** (`src/mcp-server.ts`) - Model Context Protocol server
   - Development tool for Claude Code integration
   - Exposes deployment tools via stdio transport
   - Same deployment logic as HTTP server
   - Useful for testing and automation

### API Clients

**HetznerDNSClient** (`src/api/hetzner.ts`):
- Available for DNS management via Hetzner Cloud API
- **NOT used in deployment flow** (wildcard DNS already configured)
- Key methods: `createARecord()`, `recordExists()`, `findRRSetByName()`
- Could be used for manual DNS operations or testing

**DokployClient** (`src/api/dokploy.ts`):
- Orchestrates container deployments (primary deployment mechanism)
- Key flow: `createProject()` → `createApplication()` → `createDomain()` → `deployApplication()`
- Communicates with internal Dokploy at `http://10.100.0.20:3000`
- Traefik on Dokploy automatically handles SSL via pre-configured wildcard certificate

### HTTP API Endpoints

The HTTP server exposes the following endpoints:

**Health Check**:
- `GET /health` - Returns service health status

**Deployment**:
- `POST /api/deploy` - Start a new deployment
  - Body: `{ "name": "stack-name" }`
  - Returns: `{ deploymentId, url, statusEndpoint }`
- `GET /api/status/:deploymentId` - SSE stream for deployment progress
  - Events: `progress`, `complete`, `error`
- `GET /api/check/:name` - Check if a name is available
  - Returns: `{ available, valid, error? }`

**Frontend**:
- `GET /` - Serves the web UI (`src/frontend/index.html`)
- `GET /static/*` - Serves static assets (CSS, JS)

### State Management

Both servers track deployments in-memory using a Map:
```typescript
interface DeploymentState {
  id: string;              // dep_{timestamp}_{random}
  name: string;            // normalized username
  status: 'initializing' | 'creating_project' | 'creating_application' |
          'deploying' | 'completed' | 'failed';
  url?: string;            // https://{name}.ai.flexinit.nl
  error?: string;
  projectId?: string;
  applicationId?: string;
  progress: number;        // 0-100 (HTTP server only)
  currentStep: string;     // Human-readable step (HTTP server only)
}
```

**Note**: State is in-memory only and lost on server restart. For production with persistence, implement database storage.

### Name Validation

Stack names must be:
- 3-20 characters
- Lowercase alphanumeric with hyphens
- Cannot start/end with hyphen
- Not in reserved list (admin, api, www, root, system, test, demo, portal)

### Frontend Architecture

**Location**: `src/frontend/`
- `index.html` - Main UI with state machine (form, progress, success, error)
- `style.css` - Modern gradient design with animations
- `app.js` - Vanilla JavaScript with SSE client and real-time validation

**Features**:
- Real-time name availability checking
- Client-side validation with server verification
- SSE-powered live deployment progress tracking
- State machine: Form → Progress → Success/Error
- Responsive design (mobile-friendly)
- No framework dependencies (vanilla JS)

## Development Commands

```bash
# Development server (HTTP API with hot reload)
bun run dev

# Production server (HTTP API)
bun run start

# MCP server (for Claude Code integration)
bun run mcp

# Type checking
bun run typecheck

# Build for production
bun run build

# Test API clients (requires valid credentials)
bun run src/test-clients.ts

# Docker commands
docker build -t ai-stack-deployer .
docker-compose up -d
docker-compose logs -f
docker-compose down
```

## Session Management

The project supports two types of Claude Code sessions:

**🤖 Built-in Sessions** (Automatic)
- Created automatically by Claude Code for every conversation
- Stored in `~/.claude/projects/.../`
- Resume with: `claude --session-id {uuid}` or `claude --continue`

**📁 Custom Sessions** (Optional, for organization)
- Created explicitly via `./scripts/claude-start.sh {name}`
- Enable named sessions and Graphiti Memory auto-integration
- Best for: feature development, bug fixes, multi-day work

### Commands

```bash
# List ALL sessions (both built-in and custom)
bash scripts/claude-session.sh list

# Create/resume custom named session
./scripts/claude-start.sh feature-http-api

# Delete a custom session
bash scripts/claude-session.sh delete feature-http-api

# Override permission mode (default: bypassPermissions)
CLAUDE_PERMISSION_MODE=prompt ./scripts/claude-start.sh feature-name
```

### Custom Session Benefits

**Automatic Configuration:**
- Permission mode: `bypassPermissions` (no permission prompts for file operations)
- Session ID: Persistent UUID throughout work session
- Environment variables: Auto-set for Graphiti Memory integration

**Environment Variables (Set Automatically):**
```bash
CLAUDE_SESSION_ID=550e8400-e29b-41d4-a716-446655440000
CLAUDE_SESSION_NAME=feature-http-api
CLAUDE_SESSION_START=2026-01-09 20:16:00
CLAUDE_SESSION_PROJECT=ai-stack-deployer
CLAUDE_SESSION_MCP_GROUP=project_ai_stack_deployer
```

**Graphiti Memory Integration:**
```javascript
// At session end, store learnings
graphiti-memory_add_memory({
  name: "Session: feature-http-api - 2026-01-09",
  episode_body: "Session ID: 550e8400. Implemented HTTP server endpoints for deploy API. Added SSE for progress updates. Tests passing.",
  group_id: "project_ai_stack_deployer"  // Auto-set from CLAUDE_SESSION_MCP_GROUP
})
```

**Storage:**
- Custom sessions: `$HOME/.claude/sessions/ai-stack-deployer/*.session`
- Built-in sessions: `~/.claude/projects/-home-odouhou-locale-projects-ai-stack-deployer/*.jsonl`

## Environment Variables

Required for deployment operations:
- `DOKPLOY_URL` - Dokploy API URL (http://10.100.0.20:3000)
- `DOKPLOY_API_TOKEN` - Dokploy API authentication token

Optional configuration:
- `PORT` - HTTP server port (default: 3000)
- `HOST` - HTTP server bind address (default: 0.0.0.0)
- `STACK_DOMAIN_SUFFIX` - Domain suffix for stacks (default: ai.flexinit.nl)
- `STACK_IMAGE` - Docker image for user stacks
- `RESERVED_NAMES` - Comma-separated list of forbidden names

Not used in deployment (available for testing/manual operations):
- `HETZNER_API_TOKEN` - Hetzner Cloud API token
- `HETZNER_ZONE_ID` - DNS zone ID (343733 for flexinit.nl)
- `TRAEFIK_IP` - Public IP (144.76.116.169) - only for reference

See `.env.example` for complete configuration template.

## MCP Server Integration

The MCP server is configured in `.mcp.json` and provides these tools:

- `deploy_stack` - Deploys a new AI stack (Dokploy orchestration only, no DNS creation)
- `check_deployment_status` - Query deployment progress by ID
- `list_deployments` - List all deployments in current session
- `check_name_availability` - Validate name before deployment
- `test_api_connections` - Verify Hetzner and Dokploy connectivity (both clients available for testing)

To test MCP functionality:
```bash
# Start MCP server
bun run mcp

# Test API connections
bun run src/test-clients.ts
```

## Key Implementation Details

### Error Handling
Both API clients throw errors on failure. The MCP server catches these and returns structured error responses. No automatic retry logic exists yet.

### Deployment Idempotency
- Dokploy projects: Searches for existing project by name before creating
- Creates only if not found
- No automatic cleanup on partial failures
- DNS is wildcard-based, so no per-deployment DNS operations needed

### Concurrency
The MCP server handles one request at a time per invocation. No rate limiting or queue management exists yet.

### Security Notes
- All tokens in environment variables (never in code)
- Dokploy URL is internal-only (10.100.0.x network)
- No authentication on HTTP endpoints (portal will need auth)
- Name validation prevents injection attacks

## Testing Strategy

Currently implemented:
- `src/test-clients.ts` - Manual testing of Hetzner and Dokploy clients
- Requires real API credentials in `.env`
- Note: Only Dokploy client is used in actual deployments

Missing (needs implementation):
- Unit tests for validation logic
- Integration tests for deployment flow
- Mock API clients for testing without credentials
- Health check monitoring
- Rollback on failures

## Common Patterns

### Adding a New MCP Tool
1. Define tool schema in `tools` array (src/mcp-server.ts:178)
2. Add case to switch statement in `CallToolRequestSchema` handler (src/mcp-server.ts:249)
3. Extract typed arguments: `const { arg } = args as { arg: Type }`
4. Return structured response with `content: [{ type: 'text', text: JSON.stringify(...) }]`

### Adding HTTP Endpoints
1. Add route to Hono app in `src/index.ts`
2. Use API clients from `src/api/` directory
3. Return JSON with consistent error format
4. Consider adding SSE for long-running operations

### Extending API Clients
- Keep TypeScript interfaces at top of file
- Use `satisfies` for type-safe request bodies
- Throw descriptive errors (include API status codes)
- Add methods to client class, use `private request()` helper

## Production Deployment

### Docker Build and Run

**Build Architecture**: The Dockerfile uses a hybrid approach to avoid AVX CPU requirements:

- **Build stage** (Node.js 20): Builds React client with Vite (no AVX required)
- **Runtime stage** (Bun 1.3): Runs the API server (Bun only needs AVX for builds, not runtime)

This approach ensures the Docker image builds successfully on all CPU architectures, including older systems and some cloud build environments that lack AVX support.

```bash
# Build the Docker image
docker build -t ai-stack-deployer:latest .

# Run with docker-compose (recommended)
docker-compose up -d

# Or run manually
docker run -d \
  --name ai-stack-deployer \
  -p 3000:3000 \
  --env-file .env \
  ai-stack-deployer:latest
```

**Note**: If you encounter "CPU lacks AVX support" errors during Docker builds, ensure you're using the latest Dockerfile which implements the Node.js/Bun hybrid build strategy.

### Deploying to Dokploy

1. **Prepare Environment**:
   - Ensure `.env` file has valid `DOKPLOY_API_TOKEN`
   - Verify `DOKPLOY_URL` points to internal Dokploy instance

2. **Build and Push Image** (if using custom registry):
   ```bash
   docker build -t your-registry/ai-stack-deployer:latest .
   docker push your-registry/ai-stack-deployer:latest
   ```

3. **Deploy via Dokploy UI**:
   - Create new project: `ai-stack-deployer-portal`
   - Create application from Docker image
   - Configure domain (e.g., `portal.ai.flexinit.nl`)
   - Set environment variables from `.env`
   - Deploy

4. **Verify Deployment**:
   ```bash
   curl https://portal.ai.flexinit.nl/health
   ```

### Health Monitoring

The application includes a `/health` endpoint that returns:
```json
{
  "status": "healthy",
  "timestamp": "2026-01-09T...",
  "version": "0.1.0",
  "service": "ai-stack-deployer",
  "activeDeployments": 0
}
```

Docker health check runs every 30 seconds and restarts container if unhealthy.

## Infrastructure Dependencies

- **Wildcard DNS** - `*.ai.flexinit.nl` → `144.76.116.169` (pre-configured in Hetzner DNS)
- **Traefik** at 144.76.116.169 - Pre-configured wildcard SSL certificate for `*.ai.flexinit.nl`
- **Dokploy** at 10.100.0.20:3000 - Container orchestration platform (handles all deployments)
- **Docker image** - oh-my-opencode-free (OpenCode + ttyd terminal)

**Key Point**: Individual DNS records are NOT created per deployment. The wildcard DNS and SSL are already configured, so Traefik automatically routes `{name}.ai.flexinit.nl` to the correct container based on hostname matching.

## Logging Infrastructure

AI Stack logging integrates with the existing monitoring stack at `logs.intra.flexinit.nl`.

### Components

| Component | Location | Purpose |
|-----------|----------|---------|
| Log-ingest | `http://ai-stack-log-ingest:3000` (dokploy-network) | Receives events from AI stacks, pushes to Loki |
| Loki | `monitor-grafanaloki-qkj16i-loki-1` | Log storage |
| Grafana | https://logs.intra.flexinit.nl | Visualization |
| Dashboard | `/d/ai-stack-overview` | AI Stack metrics and logs |

### Datasource UIDs (Grafana)
- Loki: `af9a823s6iku8b`
- Prometheus: `cf9r1fmfw9xxcf`

### Configuration

AI stacks send logs via environment variable:
```
LOG_INGEST_URL=http://ai-stack-log-ingest:3000/ingest
```

### Local Development

The `logging-stack/` directory contains a standalone docker-compose for local testing:
```bash
cd logging-stack && docker-compose up -d
```

### Credentials

Grafana service account token stored in BWS:
- Key: `GRAFANA_OPENCODE_ACCESS_TOKEN`
- BWS ID: `c77e58e3-fb34-41dc-9824-b3ce00da18a0`

## CI/CD - Gitea Actions

The `oh-my-opencode-free` Docker image is built automatically via Gitea Actions on push to main.

### Check Workflow Status

**Web UI:**
```
https://git.app.flexinit.nl/oussamadouhou/oh-my-opencode-free/actions
```

**API:**
```bash
# Get token from BWS (key: GITEA_API_TOKEN)
GITEA_TOKEN="<token>"

# List recent runs with status
curl -s -H "Authorization: token $GITEA_TOKEN" \
  "https://git.app.flexinit.nl/api/v1/repos/oussamadouhou/oh-my-opencode-free/actions/runs?limit=5" | \
  jq '.workflow_runs[] | {run_number, status, conclusion, display_title, head_sha: .head_sha[0:7]}'
```

### API Response Fields

| Field | Values |
|-------|--------|
| `status` | `queued`, `in_progress`, `completed` |
| `conclusion` | `success`, `failure`, `cancelled`, `skipped` |

### Credentials

- **GITEA_API_TOKEN** - Gitea API access (stored in BWS)

## Project Status

✅ **Completed**:
- HTTP Server with REST API and SSE streaming
- Frontend UI with real-time deployment tracking
- MCP Server for Claude Code integration
- Docker configuration for production deployment
- Full deployment orchestration via Dokploy API
- Name validation and availability checking
- Error handling and progress reporting

**Ready for Production Deployment**