promptlooper/promptlooper-spec.md

# PromptLooper

> The one who loops prompts — a universal LLM pipeline tuning workbench.

PromptLooper is a self-hosted tool for systematically optimizing LLM prompts, model selection, and inference parameters. It runs experiments across prompt × model × parameter combinations, caches every response, scores results against pluggable evaluation functions, and surfaces the best configurations through a real-time observability dashboard with human-in-the-loop steering.

It ships as a single Docker container (SQLite mode) for zero-config quickstart, or a Docker Compose stack (Postgres + Redis) for production use. An MCP server enables any AI agent to drive PromptLooper programmatically — creating experiments, running sweeps, and reading results without human intervention.

---

## Problem Statement

Anyone building LLM-powered applications faces the same painful loop:

1. Write a system prompt
2. Pick a model and parameters (temperature, top_p, max_tokens, etc.)
3. Run it against sample data
4. Read the output and decide if it's "good enough"
5. Tweak something and repeat

This process is manual, unscientific, and wasteful. There's no way to:
- Systematically compare configurations side-by-side
- Know if you've already tested a particular combination
- Quantify "better" beyond gut feeling
- Let an agent handle the iteration while you steer from above
- Share optimized configurations between projects or team members

PromptLooper makes this process systematic, observable, cached, and agent-drivable.

---

## Target Users

| User | Use Case |
|------|----------|
| **Solo developer** | Tuning prompts for a side project, wants to try 5 models and find the sweet spot |
| **Team building RAG pipelines** | Optimizing chunking + embedding + retrieval + synthesis prompts across stages |
| **AI agent (via MCP)** | Autonomously running optimization sweeps, reporting back to human when done |
| **Prompt engineer** | A/B testing prompt variants at scale with quantified scoring |
| **Infrastructure team** | Benchmarking new models against existing baselines before migration |

---

## Core Concepts

### Experiment

A named configuration that defines:
- **Sample data**: Input documents, queries, or any text the pipeline will process
- **Pipeline stages**: 1-N sequential stages, each with its own prompt template and model config
- **Evaluation criteria**: Scoring functions that grade the output
- **Parameter space**: What to vary (prompt text, model, temperature, top_p, chunk_size, etc.)

### Run

A single execution of one specific configuration within an experiment. A run captures:
- Full input configuration (prompt, model, all parameters)
- Raw LLM response(s)
- Timing data (latency, tokens in/out)
- Evaluation scores
- Configuration hash (for cache deduplication)

### Sweep

A batch of runs that systematically explores a parameter space. Types:
- **Grid sweep**: Every combination of specified parameter values
- **Random sweep**: Random sampling from parameter ranges
- **Guided sweep**: Agent-driven, where results from previous runs inform the next configuration to try

### Scoring Function

A pluggable evaluation that takes (input, output, context) and returns a numeric score. Built-in options:
- **Embedding similarity**: How semantically close is the output to a reference answer?
- **Length compliance**: Does the output meet length constraints?
- **Format compliance**: Does the output match expected structure (JSON, markdown, etc.)?
- **Keyword presence**: Do required terms appear in the output?
- **Human rating**: Manual thumbs-up/down or 1-5 star rating from the dashboard
- **LLM-as-judge**: Use a separate LLM call to evaluate quality (configurable judge prompt)
- **Custom function**: User-provided Python snippet or HTTP webhook

### Project

A workspace that groups related experiments. Users can return to a project and pick up where they left off. Projects store:
- All experiments and their runs
- Saved "best" configurations
- Notes and annotations
- Export history

---

## Architecture

```
┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_promptlooper (ub01)                               │
│  Network: promptlooper (172.33.0.0/24)                                   │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌──────────────────────────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │         FastAPI (API)            │  │
│  │  :5434      │  │  job queue  │  │  Experiments, Runs, Scoring,     │  │
│  │  experiments│  │  pub/sub    │  │  Projects, Auth, MCP Server      │  │
│  │  runs, cache│  │  live state │  │  WebSocket for live dashboard    │  │
│  └─────┬───────┘  └──────┬──────┘  └──────────────┬───────────────────┘  │
│        │                 │                        │                      │
│  ┌─────┴─────────────────┴────────────────────────┴───────────────────┐  │
│  │                      Celery Worker                                 │  │
│  │  Executes runs against target LLM endpoints                        │  │
│  │  Caches responses by config hash                                   │  │
│  │  Streams progress via Redis pub/sub                                │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                    Web UI (React + Vite)                           │  │
│  │  nginx → :8400                                                     │  │
│  │  Dashboard, Experiment Builder, Live Observability, Steering       │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘
                              │
                              │  HTTP (OpenAI-compatible)
                              ▼
              ┌───────────────────────────────┐
              │  Target LLM Endpoints          │
              │  OpenWebUI, vLLM, Ollama,      │
              │  OpenAI, Anthropic, any        │
              │  OpenAI-compatible API          │
              └───────────────────────────────┘
```

### Services (Production Compose)

| Service | Image | Port | Purpose |
|---------|-------|------|---------|
| `promptlooper-db` | `postgres:16-alpine` | `5434 → 5432` | Primary data store |
| `promptlooper-redis` | `redis:7-alpine` | — | Celery broker + pub/sub for live dashboard |
| `promptlooper-api` | `Dockerfile` | `8000` | FastAPI REST API + MCP server |
| `promptlooper-worker` | `Dockerfile` | — | Celery worker (run execution) |
| `promptlooper-web` | `Dockerfile` | `8400 → 80` | React frontend (nginx) |

### Single Container Mode

When `DATABASE_URL` is not set, PromptLooper runs with:
- SQLite at `/data/promptlooper.db`
- In-process task queue (no Celery/Redis dependency)
- All services in one container on port 8400

```bash
docker run -p 8400:8400 -v promptlooper-data:/data ghcr.io/xpltdco/promptlooper
```

---

## Data Model

### User
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| username | string | Unique, "admin" created on first boot |
| password_hash | string | bcrypt |
| is_admin | bool | Default true for first user |
| created_at | timestamp | |

### Project
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| name | string | |
| description | text | Optional |
| owner_id | UUID | FK → User |
| created_at | timestamp | |
| updated_at | timestamp | |

### Experiment
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| project_id | UUID | FK → Project |
| name | string | |
| description | text | Optional |
| sample_data | JSONB | Input documents/queries |
| pipeline_stages | JSONB | Stage definitions with prompt templates |
| scoring_config | JSONB | Which scoring functions to use and their weights |
| parameter_space | JSONB | What to vary and ranges/options |
| status | enum | draft, running, paused, completed |
| created_at | timestamp | |
| updated_at | timestamp | |

### Run
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| experiment_id | UUID | FK → Experiment |
| config_hash | string(64) | SHA-256 of full configuration (for cache dedup) |
| config | JSONB | Complete configuration snapshot |
| status | enum | pending, running, completed, failed, cached |
| started_at | timestamp | |
| completed_at | timestamp | |
| duration_ms | int | Wall clock time |
| tokens_in | int | Total input tokens across all stages |
| tokens_out | int | Total output tokens |
| cost_estimate | decimal | Estimated cost based on model pricing |

### StageResult
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| run_id | UUID | FK → Run |
| stage_index | int | 0-based stage number |
| prompt_sent | text | Actual prompt after template rendering |
| response_raw | text | Raw LLM response |
| model_used | string | Model identifier |
| parameters | JSONB | Temperature, top_p, etc. |
| tokens_in | int | This stage |
| tokens_out | int | This stage |
| latency_ms | int | This stage |

### Score
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| run_id | UUID | FK → Run |
| scorer_name | string | e.g. "embedding_similarity", "human_rating" |
| value | float | Normalized 0.0–1.0 |
| metadata | JSONB | Scorer-specific details |
| created_at | timestamp | |

### ResponseCache
| Field | Type | Notes |
|-------|------|-------|
| config_hash | string(64) | PK — SHA-256 of (prompt + model + params + input) |
| response | text | Cached LLM response |
| model | string | |
| tokens_in | int | |
| tokens_out | int | |
| latency_ms | int | Original latency |
| created_at | timestamp | |

### WebhookConfig
| Field | Type | Notes |
|-------|------|-------|
| id | UUID | PK |
| event_type | string | experiment.complete, new_best_found, budget.exhausted, human_needed |
| url | string | Target URL |
| headers | JSONB | Optional auth headers |
| is_active | bool | |

---

## API Endpoints

### Auth
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/v1/auth/setup` | First-boot admin password setup |
| POST | `/api/v1/auth/login` | Login, returns JWT |
| GET | `/api/v1/auth/me` | Current user info |

### Admin
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/admin/settings` | System settings (guest access, default model, etc.) |
| PUT | `/api/v1/admin/settings` | Update settings |
| GET | `/api/v1/admin/stats` | System-wide stats (total runs, cache hit rate, etc.) |

### Projects
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/projects` | List projects |
| POST | `/api/v1/projects` | Create project |
| GET | `/api/v1/projects/{id}` | Project detail with experiment summaries |
| PUT | `/api/v1/projects/{id}` | Update project |
| DELETE | `/api/v1/projects/{id}` | Delete project and all experiments |

### Experiments
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/experiments` | List experiments (filter by project) |
| POST | `/api/v1/experiments` | Create experiment |
| GET | `/api/v1/experiments/{id}` | Experiment detail with run summaries |
| PUT | `/api/v1/experiments/{id}` | Update experiment config |
| DELETE | `/api/v1/experiments/{id}` | Delete experiment |
| POST | `/api/v1/experiments/{id}/sweep` | Start a sweep (grid, random, or guided) |
| POST | `/api/v1/experiments/{id}/pause` | Pause running sweep |
| POST | `/api/v1/experiments/{id}/resume` | Resume paused sweep |
| POST | `/api/v1/experiments/{id}/stop` | Stop sweep |

### Runs
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/experiments/{id}/runs` | List runs with scores (sortable, filterable) |
| GET | `/api/v1/runs/{id}` | Run detail with stage results |
| POST | `/api/v1/runs` | Execute a single run (ad-hoc) |
| POST | `/api/v1/runs/{id}/score` | Add human rating to a run |
| GET | `/api/v1/experiments/{id}/leaderboard` | Top runs ranked by weighted score |

### Export
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/experiments/{id}/export/best` | Best config as JSON |
| GET | `/api/v1/experiments/{id}/export/env` | Best config as .env snippet |
| GET | `/api/v1/experiments/{id}/export/yaml` | Best config as YAML |
| GET | `/api/v1/experiments/{id}/export/report` | Full experiment report (markdown) |

### LLM Endpoints (Target Management)
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/endpoints` | List configured LLM endpoints |
| POST | `/api/v1/endpoints` | Add endpoint (URL, API key, label) |
| PUT | `/api/v1/endpoints/{id}` | Update endpoint |
| DELETE | `/api/v1/endpoints/{id}` | Remove endpoint |
| POST | `/api/v1/endpoints/{id}/test` | Test connectivity and list available models |

### Webhooks
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/webhooks` | List webhook configs |
| POST | `/api/v1/webhooks` | Create webhook |
| DELETE | `/api/v1/webhooks/{id}` | Remove webhook |

### WebSocket
| Path | Description |
|------|-------------|
| `/ws/experiments/{id}` | Live stream: run progress, scores, stage completions |
| `/ws/dashboard` | Global activity feed across all experiments |

### Health
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Health check (DB + Redis connectivity) |

---

## MCP Server

PromptLooper exposes an MCP (Model Context Protocol) server so AI agents can drive it programmatically. The MCP server runs as part of the API service.

### MCP Tools

| Tool | Description |
|------|-------------|
| `create_project` | Create a new project workspace |
| `create_experiment` | Define an experiment with sample data, stages, and scoring |
| `configure_endpoint` | Add or update an LLM target endpoint |
| `run_single` | Execute one specific configuration and return results |
| `run_sweep` | Start a parameter sweep (grid/random/guided) |
| `get_leaderboard` | Get top N configurations ranked by score |
| `get_run_detail` | Get full details of a specific run |
| `export_best_config` | Export the best configuration in JSON/YAML/env format |
| `pause_sweep` | Pause a running sweep |
| `resume_sweep` | Resume a paused sweep |
| `add_human_score` | Rate a run's output |
| `get_experiment_status` | Check experiment progress |
| `list_models` | List available models across all configured endpoints |

### Example Agent Interaction

```
Agent: "Create a project called 'Chrysopedia Extraction' and an experiment
        that tests the stage3_extraction prompt against Qwen-72B and Qwen-32B,
        sweeping temperature from 0.1 to 0.9 in 0.2 increments.
        Use embedding similarity scoring against these reference outputs.
        Run a grid sweep."

PromptLooper MCP: [create_project] → [create_experiment] → [run_sweep]
                  → streams progress → [get_leaderboard]

Agent: "The top config uses Qwen-72B at temperature 0.3. Export it as
        a .env snippet I can drop into Chrysopedia."

PromptLooper MCP: [export_best_config format=env]
```

---

## Response Caching

Every LLM call is cached by a SHA-256 hash of:
- Prompt text (after template rendering)
- Model identifier
- All inference parameters (temperature, top_p, max_tokens, etc.)
- Input data

If an identical configuration has been run before, the cached response is returned instantly with `status: cached`. This means:
- Re-running experiments with new scoring functions costs zero tokens
- Adding a new scorer retroactively evaluates all historical runs
- Accidentally re-running a sweep wastes nothing
- Cache can be invalidated per-run or per-experiment if needed

---

## Authentication Model

### First Boot
- App detects no users exist
- Presents a setup screen: create admin username + password
- Admin account is created, user is logged in

### Guest Access
- Admin can toggle `allow_guest_access` in settings
- Guests can view experiments and results (read-only)
- Guests cannot create experiments, run sweeps, or modify configs
- Default: guest access disabled

### API Authentication
- JWT tokens for the web UI
- API key (generated in admin settings) for programmatic access and MCP
- API key passed via `Authorization: Bearer <key>` header

---

## Real-Time Observability Dashboard

The dashboard is the primary user interface during active experimentation. It provides:

### Live Experiment View
- Progress bar: X of Y runs completed
- Token usage accumulator (running total)
- Cost estimate (based on configured model pricing)
- Cache hit rate for current sweep
- Estimated time remaining

### Side-by-Side Output Comparison
- Pick any two runs and diff their outputs
- Highlight differences in prompt, parameters, and response
- Score comparison overlay

### Leaderboard
- Real-time ranked list of runs by weighted score
- Sortable by any individual scorer
- Click to expand full run detail

### Steering Controls
- **Pause**: Stop the sweep after current run completes
- **Fork**: Create a new experiment branching from current best, with modified parameters
- **Redirect**: Change remaining sweep parameters mid-flight
- **Approve**: Mark a configuration as "good enough" and export
- **Reject**: Exclude a run from leaderboard consideration

### Activity Timeline
- Chronological feed of events: run started, run completed, new best found, cache hit, error
- Filterable by event type

---

## Webhook Events

| Event | Payload | Trigger |
|-------|---------|---------|
| `experiment.started` | experiment_id, sweep config | Sweep begins |
| `experiment.completed` | experiment_id, best config, summary stats | All runs finished |
| `experiment.paused` | experiment_id, reason | Manual or budget pause |
| `new_best_found` | experiment_id, run_id, scores, config | New top-scoring run |
| `budget.exhausted` | experiment_id, token_count, cost | Token/cost budget hit |
| `human_needed` | experiment_id, reason, context | Agent requests human review |
| `run.failed` | run_id, error | Individual run error |

---

## Configuration Export Formats

### JSON
```json
{
  "model": "qwen2.5-72b-instruct",
  "endpoint": "http://chat.forgetyour.name/api",
  "temperature": 0.3,
  "top_p": 0.85,
  "max_tokens": 2048,
  "system_prompt": "You are a music production knowledge extractor...",
  "score": 0.87,
  "experiment": "chrysopedia-extraction-v2",
  "exported_at": "2026-04-06T12:00:00Z"
}
```

### .env
```bash
LLM_MODEL=qwen2.5-72b-instruct
LLM_API_URL=http://chat.forgetyour.name/api
LLM_TEMPERATURE=0.3
LLM_TOP_P=0.85
LLM_MAX_TOKENS=2048
# Score: 0.87 | Experiment: chrysopedia-extraction-v2
```

### YAML
```yaml
model: qwen2.5-72b-instruct
endpoint: http://chat.forgetyour.name/api
parameters:
  temperature: 0.3
  top_p: 0.85
  max_tokens: 2048
system_prompt: |
  You are a music production knowledge extractor...
metadata:
  score: 0.87
  experiment: chrysopedia-extraction-v2
  exported_at: 2026-04-06T12:00:00Z
```

---

## Environment Variables

| Group | Variable | Default | Notes |
|-------|----------|---------|-------|
| **Database** | `DATABASE_URL` | (none → SQLite) | PostgreSQL connection string |
| **Redis** | `REDIS_URL` | (none → in-process) | Redis connection string |
| **Server** | `HOST` | `0.0.0.0` | Bind address |
| **Server** | `PORT` | `8400` | HTTP port |
| **Auth** | `JWT_SECRET` | (auto-generated) | JWT signing key |
| **Auth** | `API_KEY` | (none) | Static API key for programmatic access |
| **Defaults** | `DEFAULT_ENDPOINT_URL` | (none) | Pre-configured LLM endpoint |
| **Defaults** | `DEFAULT_ENDPOINT_KEY` | (none) | API key for default endpoint |
| **Limits** | `MAX_CONCURRENT_RUNS` | `4` | Parallel run limit |
| **Limits** | `MAX_TOKENS_PER_SWEEP` | `0` (unlimited) | Token budget per sweep |
| **Storage** | `DATA_DIR` | `/data` | SQLite DB + file storage location |
| **MCP** | `MCP_ENABLED` | `true` | Enable MCP server |
| **MCP** | `MCP_PORT` | `8401` | MCP server port |

---

## Docker Compose (Production — XPLTD Conventions)

Project name: `xpltd_promptlooper`
Network: `promptlooper` (`172.33.0.0/24`)
Persistent data: `/vmPool/r/services/promptlooper_*`
PostgreSQL port: `5434` (external)
Web UI port: `8400` (external)

---

## Technology Stack

| Layer | Technology | Rationale |
|-------|-----------|-----------|
| **API** | Python 3.12 + FastAPI | Async, OpenAPI auto-gen, matches XPLTD conventions |
| **Task Queue** | Celery + Redis | Proven for background job execution, matches Chrysopedia |
| **Database** | PostgreSQL 16 (prod) / SQLite (single-container) | JSONB for flexible experiment configs |
| **Real-time** | WebSocket via FastAPI + Redis pub/sub | Sub-second dashboard updates |
| **Frontend** | React 18 + TypeScript + Vite | Real-time dashboard, matches Chrysopedia |
| **Styling** | Tailwind CSS | Fast iteration, utility-first |
| **MCP** | Python MCP SDK | Standard protocol for agent integration |
| **Container** | Multi-stage Docker build | Single image serves both API and frontend |

---

## Development & Deployment

### Local Development
```bash
git clone git@git.xpltd.co:xpltdco/promptlooper.git
cd promptlooper
cp .env.example .env
docker compose up -d promptlooper-db promptlooper-redis
cd backend && pip install -r requirements.txt
alembic upgrade head
uvicorn main:app --reload --host 0.0.0.0 --port 8000
# In another terminal:
cd frontend && npm install && npm run dev
```

### Production Deployment (ub01)
```bash
ssh ub01
cd /vmPool/r/repos/xpltdco/promptlooper
git pull && docker compose build && docker compose up -d
```

### Project Structure
```
promptlooper/
├── backend/
│   ├── main.py                 # FastAPI entry point
│   ├── config.py               # Pydantic Settings
│   ├── models.py               # SQLAlchemy ORM
│   ├── schemas.py              # Pydantic request/response
│   ├── auth.py                 # JWT + API key auth
│   ├── worker.py               # Celery app config
│   ├── routers/
│   │   ├── auth.py
│   │   ├── projects.py
│   │   ├── experiments.py
│   │   ├── runs.py
│   │   ├── endpoints.py
│   │   ├── export.py
│   │   ├── webhooks.py
│   │   └── admin.py
│   ├── engine/
│   │   ├── runner.py           # Run execution logic
│   │   ├── sweep.py            # Sweep orchestration
│   │   ├── cache.py            # Response cache layer
│   │   ├── adapters/           # LLM endpoint adapters
│   │   │   ├── openai_compat.py
│   │   │   └── base.py
│   │   └── scorers/            # Pluggable scoring functions
│   │       ├── embedding.py
│   │       ├── format.py
│   │       ├── keyword.py
│   │       ├── llm_judge.py
│   │       └── base.py
│   ├── mcp/
│   │   ├── server.py           # MCP server implementation
│   │   └── tools.py            # MCP tool definitions
│   ├── websocket/
│   │   └── manager.py          # WebSocket connection management
│   └── tests/
├── frontend/
│   └── src/
│       ├── pages/
│       │   ├── Setup.tsx       # First-boot admin setup
│       │   ├── Login.tsx
│       │   ├── Dashboard.tsx   # Global activity
│       │   ├── Projects.tsx
│       │   ├── Experiment.tsx  # Experiment builder + config
│       │   ├── Live.tsx        # Real-time observability
│       │   ├── Compare.tsx     # Side-by-side run comparison
│       │   └── Admin.tsx       # System settings
│       ├── components/
│       │   ├── Leaderboard.tsx
│       │   ├── SteeringControls.tsx
│       │   ├── RunCard.tsx
│       │   ├── ScoreChart.tsx
│       │   └── Timeline.tsx
│       └── api/
├── docker/
│   ├── Dockerfile              # Multi-stage: API + frontend
│   └── nginx.conf
├── alembic/
├── docker-compose.yml
├── .env.example
├── CLAUDE.md
└── README.md
```