commit fc2e4cd7d102d1aed7f23b72da0b582c28895794 Author: John Lightner Date: Tue Apr 7 01:39:18 2026 -0500 MAESTRO: Initialize repository with README, .gitignore, and project files Add README.md with project description, quick-start instructions, and AGPL-3.0 license badge. Add .gitignore for Python, Node, and Docker artifacts. Include existing CLAUDE.md, spec, docker-compose.yml, and env.example. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..94f133b --- /dev/null +++ b/.gitignore @@ -0,0 +1,57 @@ +# Python +__pycache__/ +*.py[cod] +*$py.class +*.egg-info/ +*.egg +dist/ +build/ +.eggs/ +*.whl +.venv/ +venv/ +env/ +.env +*.pyc +.pytest_cache/ +.mypy_cache/ +.ruff_cache/ +htmlcov/ +.coverage +.coverage.* + +# Node / Frontend +node_modules/ +frontend/dist/ +frontend/build/ +.npm +*.tsbuildinfo + +# Docker +docker/nginx.conf.bak + +# IDE +.vscode/ +.idea/ +*.swp +*.swo +*~ +.DS_Store + +# OS +Thumbs.db +Desktop.ini + +# Data (single-container mode) +*.db +/data/ + +# Alembic +alembic/versions/__pycache__/ + +# Auto Run Docs (Maestro working files) +Auto Run Docs/Working/ + +# Misc +*.log +*.bak diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..9e774ac --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,127 @@ +# CLAUDE.md — PromptLooper + +## What is this project? + +PromptLooper is a self-hosted LLM pipeline tuning workbench. It runs experiments across prompt × model × parameter combinations, caches every response, scores results, and surfaces optimal configurations through a real-time dashboard. It has an MCP server so AI agents can drive it programmatically. + +## Repository + +- **Hosted at**: git.xpltd.co/xpltdco/promptlooper +- **XPLTD project name**: `xpltd_promptlooper` +- **Sister project**: Chrysopedia (git.xpltd.co/xpltdco/chrysopedia) — a knowledge extraction pipeline that is PromptLooper's first integration target + +## Tech Stack + +- **Backend**: Python 3.12, FastAPI, Celery, SQLAlchemy, Alembic +- **Frontend**: React 18, TypeScript, Vite, Tailwind CSS +- **Database**: PostgreSQL 16 (production) / SQLite (single-container mode) +- **Cache/Queue**: Redis 7 (production) / in-process (single-container) +- **Real-time**: WebSocket via FastAPI + Redis pub/sub +- **MCP**: Python MCP SDK +- **Container**: Multi-stage Docker build, nginx for frontend + +## XPLTD Conventions + +These are non-negotiable project conventions shared across all XPLTD projects: + +- Docker Compose project name: `xpltd_promptlooper` +- Dedicated bridge network: `promptlooper` (`172.33.0.0/24`) +- Persistent data bind mounts under `/vmPool/r/services/promptlooper_*` +- PostgreSQL on external port `5434` (internal `5432`) +- Web UI on port `8400` +- MCP server on port `8401` +- Container naming: `promptlooper-{service}` (e.g., `promptlooper-api`, `promptlooper-db`) + +## Key Architecture Decisions + +1. **No LLM runs inside PromptLooper itself** — it's purely an HTTP client that calls external LLM endpoints. The only exception is the optional "LLM-as-judge" scorer. +2. **Response caching by config hash** — SHA-256 of (prompt + model + params + input). Cache hits return instantly. This is critical for cost control. +3. **Single-container mode** — when `DATABASE_URL` is not set, use SQLite + in-process queue. Zero dependencies. +4. **WebSocket for real-time** — the dashboard connects via WebSocket to receive run progress, score updates, and steering events. +5. **Pluggable scorers** — all scoring functions implement a base class with `score(input, output, context) → float` signature. +6. **OpenAI-compatible adapter** — the LLM adapter layer speaks OpenAI's chat completions API. This covers OpenWebUI, vLLM, Ollama, and most providers. + +## File Organization + +``` +backend/ + main.py — FastAPI app, middleware, router mounting + config.py — Pydantic Settings from env vars + models.py — SQLAlchemy ORM models + schemas.py — Pydantic request/response schemas + auth.py — JWT + API key authentication + worker.py — Celery app configuration + routers/ — API endpoint handlers + engine/ — Core experiment execution logic + runner.py — Individual run execution + sweep.py — Sweep orchestration (grid/random/guided) + cache.py — Response cache layer + adapters/ — LLM endpoint adapters + scorers/ — Pluggable scoring functions + mcp/ — MCP server implementation + websocket/ — WebSocket connection management + +frontend/src/ + pages/ — Route-level components + components/ — Shared UI components + api/ — Typed API client functions +``` + +## Database Migrations + +Use Alembic. Same patterns as Chrysopedia: +```bash +alembic revision --autogenerate -m "describe_change" +alembic upgrade head +``` + +## Running Locally + +```bash +docker compose up -d promptlooper-db promptlooper-redis +cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000 +# Frontend in another terminal: +cd frontend && npm run dev +``` + +## Testing + +```bash +cd backend && pytest +cd frontend && npm test +``` + +## Important Patterns + +### Adding a new scorer +1. Create `backend/engine/scorers/my_scorer.py` +2. Implement `BaseScorer` with `name`, `score(input, output, context) → float` +3. Register in `backend/engine/scorers/__init__.py` +4. Add to frontend scorer picker component + +### Adding a new LLM adapter +1. Create `backend/engine/adapters/my_adapter.py` +2. Implement `BaseAdapter` with `complete(prompt, model, params) → response` +3. Register in `backend/engine/adapters/__init__.py` +4. Currently only OpenAI-compatible is implemented; all others should be edge cases + +### Adding a new MCP tool +1. Add tool definition in `backend/mcp/tools.py` +2. Implement handler in `backend/mcp/server.py` +3. Tools should map 1:1 to API endpoints where possible + +## Common Gotchas + +- Always hash the FULL config when checking cache — missing a single parameter means cache misses +- WebSocket connections must be cleaned up on disconnect — use the connection manager +- SQLite mode doesn't support concurrent writes — the in-process queue must be single-threaded +- Frontend must handle both WebSocket and polling fallback for environments where WS is blocked +- MCP server runs on a separate port from the main API + +## Deployment + +```bash +ssh ub01 +cd /vmPool/r/repos/xpltdco/promptlooper +git pull && docker compose build && docker compose up -d +``` diff --git a/README.md b/README.md new file mode 100644 index 0000000..0ee27ea --- /dev/null +++ b/README.md @@ -0,0 +1,65 @@ +# PromptLooper + +[![License: AGPL-3.0](https://img.shields.io/badge/License-AGPL--3.0-blue.svg)](https://www.gnu.org/licenses/agpl-3.0) +[![Status: Alpha](https://img.shields.io/badge/Status-Alpha-orange.svg)]() + +> The one who loops prompts — a universal LLM pipeline tuning workbench. + +PromptLooper is a self-hosted tool for systematically optimizing LLM prompts, model selection, and inference parameters. It runs experiments across prompt x model x parameter combinations, caches every response, scores results against pluggable evaluation functions, and surfaces the best configurations through a real-time observability dashboard with human-in-the-loop steering. + +It ships as a single Docker container (SQLite mode) for zero-config quickstart, or a Docker Compose stack (Postgres + Redis) for production use. An MCP server enables any AI agent to drive PromptLooper programmatically — creating experiments, running sweeps, and reading results without human intervention. + +## Quick Start + +### Single Container (zero dependencies) + +```bash +docker run -p 8400:8400 -v promptlooper-data:/data ghcr.io/xpltdco/promptlooper +``` + +Open `http://localhost:8400` — you'll be prompted to create an admin account on first boot. + +### Production (Docker Compose) + +```bash +git clone git@git.xpltd.co:xpltdco/promptlooper.git +cd promptlooper +cp .env.example .env +# Edit .env — set POSTGRES_PASSWORD and JWT_SECRET at minimum +docker compose up -d +``` + +## Features + +- **Systematic experimentation** — grid, random, and guided sweeps across prompt x model x parameter space +- **Response caching** — SHA-256 deduplication means re-runs cost zero tokens +- **Pluggable scoring** — embedding similarity, format compliance, keyword presence, LLM-as-judge, human rating, custom webhooks +- **Real-time dashboard** — live progress, leaderboard, side-by-side comparison, steering controls +- **MCP server** — AI agents can create experiments, run sweeps, and export results programmatically +- **Single-container mode** — SQLite + in-process queue when no external dependencies are configured + +## Development + +```bash +# Start backing services +docker compose up -d promptlooper-db promptlooper-redis + +# Backend +cd backend && pip install -r requirements.txt +alembic upgrade head +uvicorn main:app --reload --host 0.0.0.0 --port 8000 + +# Frontend (separate terminal) +cd frontend && npm install && npm run dev +``` + +## Testing + +```bash +cd backend && pytest +cd frontend && npm test +``` + +## License + +[AGPL-3.0](https://www.gnu.org/licenses/agpl-3.0.html) diff --git a/docker-compose.yml b/docker-compose.yml new file mode 100644 index 0000000..a69e3cc --- /dev/null +++ b/docker-compose.yml @@ -0,0 +1,106 @@ +name: xpltd_promptlooper + +networks: + promptlooper: + driver: bridge + ipam: + config: + - subnet: 172.33.0.0/24 + +services: + promptlooper-db: + image: postgres:16-alpine + container_name: promptlooper-db + restart: unless-stopped + networks: + - promptlooper + ports: + - "5434:5432" + environment: + POSTGRES_USER: ${POSTGRES_USER:-promptlooper} + POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?Set POSTGRES_PASSWORD in .env} + POSTGRES_DB: ${POSTGRES_DB:-promptlooper} + volumes: + - /vmPool/r/services/promptlooper_db:/var/lib/postgresql/data + healthcheck: + test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-promptlooper}"] + interval: 10s + timeout: 5s + retries: 5 + + promptlooper-redis: + image: redis:7-alpine + container_name: promptlooper-redis + restart: unless-stopped + networks: + - promptlooper + volumes: + - /vmPool/r/services/promptlooper_redis:/data + healthcheck: + test: ["CMD", "redis-cli", "ping"] + interval: 10s + timeout: 5s + retries: 5 + + promptlooper-api: + build: + context: . + dockerfile: docker/Dockerfile + target: api + container_name: promptlooper-api + restart: unless-stopped + networks: + - promptlooper + ports: + - "8401:8401" # MCP server + environment: + DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-promptlooper}:${POSTGRES_PASSWORD}@promptlooper-db:5432/${POSTGRES_DB:-promptlooper} + REDIS_URL: redis://promptlooper-redis:6379/0 + JWT_SECRET: ${JWT_SECRET:?Set JWT_SECRET in .env} + DEFAULT_ENDPOINT_URL: ${DEFAULT_ENDPOINT_URL:-} + DEFAULT_ENDPOINT_KEY: ${DEFAULT_ENDPOINT_KEY:-} + MAX_CONCURRENT_RUNS: ${MAX_CONCURRENT_RUNS:-4} + MAX_TOKENS_PER_SWEEP: ${MAX_TOKENS_PER_SWEEP:-0} + MCP_ENABLED: ${MCP_ENABLED:-true} + MCP_PORT: 8401 + depends_on: + promptlooper-db: + condition: service_healthy + promptlooper-redis: + condition: service_healthy + + promptlooper-worker: + build: + context: . + dockerfile: docker/Dockerfile + target: api + container_name: promptlooper-worker + restart: unless-stopped + networks: + - promptlooper + command: celery -A backend.worker:app worker --loglevel=info --concurrency=${MAX_CONCURRENT_RUNS:-4} + environment: + DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-promptlooper}:${POSTGRES_PASSWORD}@promptlooper-db:5432/${POSTGRES_DB:-promptlooper} + REDIS_URL: redis://promptlooper-redis:6379/0 + DEFAULT_ENDPOINT_URL: ${DEFAULT_ENDPOINT_URL:-} + DEFAULT_ENDPOINT_KEY: ${DEFAULT_ENDPOINT_KEY:-} + MAX_CONCURRENT_RUNS: ${MAX_CONCURRENT_RUNS:-4} + depends_on: + promptlooper-db: + condition: service_healthy + promptlooper-redis: + condition: service_healthy + + promptlooper-web: + build: + context: . + dockerfile: docker/Dockerfile + target: web + container_name: promptlooper-web + restart: unless-stopped + networks: + - promptlooper + ports: + - "8400:80" + depends_on: + - promptlooper-api diff --git a/env.example b/env.example new file mode 100644 index 0000000..565d9e5 --- /dev/null +++ b/env.example @@ -0,0 +1,23 @@ +# PromptLooper — Environment Configuration +# Copy to .env and fill in required values + +# ── Database ────────────────────────────────────────────── +POSTGRES_USER=promptlooper +POSTGRES_PASSWORD= # REQUIRED: set a strong password +POSTGRES_DB=promptlooper + +# ── Auth ────────────────────────────────────────────────── +JWT_SECRET= # REQUIRED: generate with `openssl rand -hex 32` + +# ── Default LLM Endpoint (optional) ────────────────────── +# Pre-configure an LLM endpoint so users don't have to add one manually +DEFAULT_ENDPOINT_URL= # e.g. http://chat.forgetyour.name/api/v1 +DEFAULT_ENDPOINT_KEY= # API key for the default endpoint + +# ── Limits ──────────────────────────────────────────────── +MAX_CONCURRENT_RUNS=4 # Parallel run limit per sweep +MAX_TOKENS_PER_SWEEP=0 # 0 = unlimited; set a number to cap token spend + +# ── MCP Server ──────────────────────────────────────────── +MCP_ENABLED=true # Enable/disable MCP server for agent access +# MCP_PORT=8401 # MCP server port (set in docker-compose) diff --git a/promptlooper-spec.md b/promptlooper-spec.md new file mode 100644 index 0000000..3e84e9d --- /dev/null +++ b/promptlooper-spec.md @@ -0,0 +1,635 @@ +# PromptLooper + +> The one who loops prompts — a universal LLM pipeline tuning workbench. + +PromptLooper is a self-hosted tool for systematically optimizing LLM prompts, model selection, and inference parameters. It runs experiments across prompt × model × parameter combinations, caches every response, scores results against pluggable evaluation functions, and surfaces the best configurations through a real-time observability dashboard with human-in-the-loop steering. + +It ships as a single Docker container (SQLite mode) for zero-config quickstart, or a Docker Compose stack (Postgres + Redis) for production use. An MCP server enables any AI agent to drive PromptLooper programmatically — creating experiments, running sweeps, and reading results without human intervention. + +--- + +## Problem Statement + +Anyone building LLM-powered applications faces the same painful loop: + +1. Write a system prompt +2. Pick a model and parameters (temperature, top_p, max_tokens, etc.) +3. Run it against sample data +4. Read the output and decide if it's "good enough" +5. Tweak something and repeat + +This process is manual, unscientific, and wasteful. There's no way to: +- Systematically compare configurations side-by-side +- Know if you've already tested a particular combination +- Quantify "better" beyond gut feeling +- Let an agent handle the iteration while you steer from above +- Share optimized configurations between projects or team members + +PromptLooper makes this process systematic, observable, cached, and agent-drivable. + +--- + +## Target Users + +| User | Use Case | +|------|----------| +| **Solo developer** | Tuning prompts for a side project, wants to try 5 models and find the sweet spot | +| **Team building RAG pipelines** | Optimizing chunking + embedding + retrieval + synthesis prompts across stages | +| **AI agent (via MCP)** | Autonomously running optimization sweeps, reporting back to human when done | +| **Prompt engineer** | A/B testing prompt variants at scale with quantified scoring | +| **Infrastructure team** | Benchmarking new models against existing baselines before migration | + +--- + +## Core Concepts + +### Experiment + +A named configuration that defines: +- **Sample data**: Input documents, queries, or any text the pipeline will process +- **Pipeline stages**: 1-N sequential stages, each with its own prompt template and model config +- **Evaluation criteria**: Scoring functions that grade the output +- **Parameter space**: What to vary (prompt text, model, temperature, top_p, chunk_size, etc.) + +### Run + +A single execution of one specific configuration within an experiment. A run captures: +- Full input configuration (prompt, model, all parameters) +- Raw LLM response(s) +- Timing data (latency, tokens in/out) +- Evaluation scores +- Configuration hash (for cache deduplication) + +### Sweep + +A batch of runs that systematically explores a parameter space. Types: +- **Grid sweep**: Every combination of specified parameter values +- **Random sweep**: Random sampling from parameter ranges +- **Guided sweep**: Agent-driven, where results from previous runs inform the next configuration to try + +### Scoring Function + +A pluggable evaluation that takes (input, output, context) and returns a numeric score. Built-in options: +- **Embedding similarity**: How semantically close is the output to a reference answer? +- **Length compliance**: Does the output meet length constraints? +- **Format compliance**: Does the output match expected structure (JSON, markdown, etc.)? +- **Keyword presence**: Do required terms appear in the output? +- **Human rating**: Manual thumbs-up/down or 1-5 star rating from the dashboard +- **LLM-as-judge**: Use a separate LLM call to evaluate quality (configurable judge prompt) +- **Custom function**: User-provided Python snippet or HTTP webhook + +### Project + +A workspace that groups related experiments. Users can return to a project and pick up where they left off. Projects store: +- All experiments and their runs +- Saved "best" configurations +- Notes and annotations +- Export history + +--- + +## Architecture + +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ Docker Compose: xpltd_promptlooper (ub01) │ +│ Network: promptlooper (172.33.0.0/24) │ +│ │ +│ ┌────────────┐ ┌─────────────┐ ┌──────────────────────────────────┐ │ +│ │ PostgreSQL │ │ Redis │ │ FastAPI (API) │ │ +│ │ :5434 │ │ job queue │ │ Experiments, Runs, Scoring, │ │ +│ │ experiments│ │ pub/sub │ │ Projects, Auth, MCP Server │ │ +│ │ runs, cache│ │ live state │ │ WebSocket for live dashboard │ │ +│ └─────┬───────┘ └──────┬──────┘ └──────────────┬───────────────────┘ │ +│ │ │ │ │ +│ ┌─────┴─────────────────┴────────────────────────┴───────────────────┐ │ +│ │ Celery Worker │ │ +│ │ Executes runs against target LLM endpoints │ │ +│ │ Caches responses by config hash │ │ +│ │ Streams progress via Redis pub/sub │ │ +│ └────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────────────────┐ │ +│ │ Web UI (React + Vite) │ │ +│ │ nginx → :8400 │ │ +│ │ Dashboard, Experiment Builder, Live Observability, Steering │ │ +│ └────────────────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────────────┘ + │ + │ HTTP (OpenAI-compatible) + ▼ + ┌───────────────────────────────┐ + │ Target LLM Endpoints │ + │ OpenWebUI, vLLM, Ollama, │ + │ OpenAI, Anthropic, any │ + │ OpenAI-compatible API │ + └───────────────────────────────┘ +``` + +### Services (Production Compose) + +| Service | Image | Port | Purpose | +|---------|-------|------|---------| +| `promptlooper-db` | `postgres:16-alpine` | `5434 → 5432` | Primary data store | +| `promptlooper-redis` | `redis:7-alpine` | — | Celery broker + pub/sub for live dashboard | +| `promptlooper-api` | `Dockerfile` | `8000` | FastAPI REST API + MCP server | +| `promptlooper-worker` | `Dockerfile` | — | Celery worker (run execution) | +| `promptlooper-web` | `Dockerfile` | `8400 → 80` | React frontend (nginx) | + +### Single Container Mode + +When `DATABASE_URL` is not set, PromptLooper runs with: +- SQLite at `/data/promptlooper.db` +- In-process task queue (no Celery/Redis dependency) +- All services in one container on port 8400 + +```bash +docker run -p 8400:8400 -v promptlooper-data:/data ghcr.io/xpltdco/promptlooper +``` + +--- + +## Data Model + +### User +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| username | string | Unique, "admin" created on first boot | +| password_hash | string | bcrypt | +| is_admin | bool | Default true for first user | +| created_at | timestamp | | + +### Project +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| name | string | | +| description | text | Optional | +| owner_id | UUID | FK → User | +| created_at | timestamp | | +| updated_at | timestamp | | + +### Experiment +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| project_id | UUID | FK → Project | +| name | string | | +| description | text | Optional | +| sample_data | JSONB | Input documents/queries | +| pipeline_stages | JSONB | Stage definitions with prompt templates | +| scoring_config | JSONB | Which scoring functions to use and their weights | +| parameter_space | JSONB | What to vary and ranges/options | +| status | enum | draft, running, paused, completed | +| created_at | timestamp | | +| updated_at | timestamp | | + +### Run +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| experiment_id | UUID | FK → Experiment | +| config_hash | string(64) | SHA-256 of full configuration (for cache dedup) | +| config | JSONB | Complete configuration snapshot | +| status | enum | pending, running, completed, failed, cached | +| started_at | timestamp | | +| completed_at | timestamp | | +| duration_ms | int | Wall clock time | +| tokens_in | int | Total input tokens across all stages | +| tokens_out | int | Total output tokens | +| cost_estimate | decimal | Estimated cost based on model pricing | + +### StageResult +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| run_id | UUID | FK → Run | +| stage_index | int | 0-based stage number | +| prompt_sent | text | Actual prompt after template rendering | +| response_raw | text | Raw LLM response | +| model_used | string | Model identifier | +| parameters | JSONB | Temperature, top_p, etc. | +| tokens_in | int | This stage | +| tokens_out | int | This stage | +| latency_ms | int | This stage | + +### Score +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| run_id | UUID | FK → Run | +| scorer_name | string | e.g. "embedding_similarity", "human_rating" | +| value | float | Normalized 0.0–1.0 | +| metadata | JSONB | Scorer-specific details | +| created_at | timestamp | | + +### ResponseCache +| Field | Type | Notes | +|-------|------|-------| +| config_hash | string(64) | PK — SHA-256 of (prompt + model + params + input) | +| response | text | Cached LLM response | +| model | string | | +| tokens_in | int | | +| tokens_out | int | | +| latency_ms | int | Original latency | +| created_at | timestamp | | + +### WebhookConfig +| Field | Type | Notes | +|-------|------|-------| +| id | UUID | PK | +| event_type | string | experiment.complete, new_best_found, budget.exhausted, human_needed | +| url | string | Target URL | +| headers | JSONB | Optional auth headers | +| is_active | bool | | + +--- + +## API Endpoints + +### Auth +| Method | Path | Description | +|--------|------|-------------| +| POST | `/api/v1/auth/setup` | First-boot admin password setup | +| POST | `/api/v1/auth/login` | Login, returns JWT | +| GET | `/api/v1/auth/me` | Current user info | + +### Admin +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/admin/settings` | System settings (guest access, default model, etc.) | +| PUT | `/api/v1/admin/settings` | Update settings | +| GET | `/api/v1/admin/stats` | System-wide stats (total runs, cache hit rate, etc.) | + +### Projects +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/projects` | List projects | +| POST | `/api/v1/projects` | Create project | +| GET | `/api/v1/projects/{id}` | Project detail with experiment summaries | +| PUT | `/api/v1/projects/{id}` | Update project | +| DELETE | `/api/v1/projects/{id}` | Delete project and all experiments | + +### Experiments +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/experiments` | List experiments (filter by project) | +| POST | `/api/v1/experiments` | Create experiment | +| GET | `/api/v1/experiments/{id}` | Experiment detail with run summaries | +| PUT | `/api/v1/experiments/{id}` | Update experiment config | +| DELETE | `/api/v1/experiments/{id}` | Delete experiment | +| POST | `/api/v1/experiments/{id}/sweep` | Start a sweep (grid, random, or guided) | +| POST | `/api/v1/experiments/{id}/pause` | Pause running sweep | +| POST | `/api/v1/experiments/{id}/resume` | Resume paused sweep | +| POST | `/api/v1/experiments/{id}/stop` | Stop sweep | + +### Runs +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/experiments/{id}/runs` | List runs with scores (sortable, filterable) | +| GET | `/api/v1/runs/{id}` | Run detail with stage results | +| POST | `/api/v1/runs` | Execute a single run (ad-hoc) | +| POST | `/api/v1/runs/{id}/score` | Add human rating to a run | +| GET | `/api/v1/experiments/{id}/leaderboard` | Top runs ranked by weighted score | + +### Export +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/experiments/{id}/export/best` | Best config as JSON | +| GET | `/api/v1/experiments/{id}/export/env` | Best config as .env snippet | +| GET | `/api/v1/experiments/{id}/export/yaml` | Best config as YAML | +| GET | `/api/v1/experiments/{id}/export/report` | Full experiment report (markdown) | + +### LLM Endpoints (Target Management) +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/endpoints` | List configured LLM endpoints | +| POST | `/api/v1/endpoints` | Add endpoint (URL, API key, label) | +| PUT | `/api/v1/endpoints/{id}` | Update endpoint | +| DELETE | `/api/v1/endpoints/{id}` | Remove endpoint | +| POST | `/api/v1/endpoints/{id}/test` | Test connectivity and list available models | + +### Webhooks +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/webhooks` | List webhook configs | +| POST | `/api/v1/webhooks` | Create webhook | +| DELETE | `/api/v1/webhooks/{id}` | Remove webhook | + +### WebSocket +| Path | Description | +|------|-------------| +| `/ws/experiments/{id}` | Live stream: run progress, scores, stage completions | +| `/ws/dashboard` | Global activity feed across all experiments | + +### Health +| Method | Path | Description | +|--------|------|-------------| +| GET | `/health` | Health check (DB + Redis connectivity) | + +--- + +## MCP Server + +PromptLooper exposes an MCP (Model Context Protocol) server so AI agents can drive it programmatically. The MCP server runs as part of the API service. + +### MCP Tools + +| Tool | Description | +|------|-------------| +| `create_project` | Create a new project workspace | +| `create_experiment` | Define an experiment with sample data, stages, and scoring | +| `configure_endpoint` | Add or update an LLM target endpoint | +| `run_single` | Execute one specific configuration and return results | +| `run_sweep` | Start a parameter sweep (grid/random/guided) | +| `get_leaderboard` | Get top N configurations ranked by score | +| `get_run_detail` | Get full details of a specific run | +| `export_best_config` | Export the best configuration in JSON/YAML/env format | +| `pause_sweep` | Pause a running sweep | +| `resume_sweep` | Resume a paused sweep | +| `add_human_score` | Rate a run's output | +| `get_experiment_status` | Check experiment progress | +| `list_models` | List available models across all configured endpoints | + +### Example Agent Interaction + +``` +Agent: "Create a project called 'Chrysopedia Extraction' and an experiment + that tests the stage3_extraction prompt against Qwen-72B and Qwen-32B, + sweeping temperature from 0.1 to 0.9 in 0.2 increments. + Use embedding similarity scoring against these reference outputs. + Run a grid sweep." + +PromptLooper MCP: [create_project] → [create_experiment] → [run_sweep] + → streams progress → [get_leaderboard] + +Agent: "The top config uses Qwen-72B at temperature 0.3. Export it as + a .env snippet I can drop into Chrysopedia." + +PromptLooper MCP: [export_best_config format=env] +``` + +--- + +## Response Caching + +Every LLM call is cached by a SHA-256 hash of: +- Prompt text (after template rendering) +- Model identifier +- All inference parameters (temperature, top_p, max_tokens, etc.) +- Input data + +If an identical configuration has been run before, the cached response is returned instantly with `status: cached`. This means: +- Re-running experiments with new scoring functions costs zero tokens +- Adding a new scorer retroactively evaluates all historical runs +- Accidentally re-running a sweep wastes nothing +- Cache can be invalidated per-run or per-experiment if needed + +--- + +## Authentication Model + +### First Boot +- App detects no users exist +- Presents a setup screen: create admin username + password +- Admin account is created, user is logged in + +### Guest Access +- Admin can toggle `allow_guest_access` in settings +- Guests can view experiments and results (read-only) +- Guests cannot create experiments, run sweeps, or modify configs +- Default: guest access disabled + +### API Authentication +- JWT tokens for the web UI +- API key (generated in admin settings) for programmatic access and MCP +- API key passed via `Authorization: Bearer ` header + +--- + +## Real-Time Observability Dashboard + +The dashboard is the primary user interface during active experimentation. It provides: + +### Live Experiment View +- Progress bar: X of Y runs completed +- Token usage accumulator (running total) +- Cost estimate (based on configured model pricing) +- Cache hit rate for current sweep +- Estimated time remaining + +### Side-by-Side Output Comparison +- Pick any two runs and diff their outputs +- Highlight differences in prompt, parameters, and response +- Score comparison overlay + +### Leaderboard +- Real-time ranked list of runs by weighted score +- Sortable by any individual scorer +- Click to expand full run detail + +### Steering Controls +- **Pause**: Stop the sweep after current run completes +- **Fork**: Create a new experiment branching from current best, with modified parameters +- **Redirect**: Change remaining sweep parameters mid-flight +- **Approve**: Mark a configuration as "good enough" and export +- **Reject**: Exclude a run from leaderboard consideration + +### Activity Timeline +- Chronological feed of events: run started, run completed, new best found, cache hit, error +- Filterable by event type + +--- + +## Webhook Events + +| Event | Payload | Trigger | +|-------|---------|---------| +| `experiment.started` | experiment_id, sweep config | Sweep begins | +| `experiment.completed` | experiment_id, best config, summary stats | All runs finished | +| `experiment.paused` | experiment_id, reason | Manual or budget pause | +| `new_best_found` | experiment_id, run_id, scores, config | New top-scoring run | +| `budget.exhausted` | experiment_id, token_count, cost | Token/cost budget hit | +| `human_needed` | experiment_id, reason, context | Agent requests human review | +| `run.failed` | run_id, error | Individual run error | + +--- + +## Configuration Export Formats + +### JSON +```json +{ + "model": "qwen2.5-72b-instruct", + "endpoint": "http://chat.forgetyour.name/api", + "temperature": 0.3, + "top_p": 0.85, + "max_tokens": 2048, + "system_prompt": "You are a music production knowledge extractor...", + "score": 0.87, + "experiment": "chrysopedia-extraction-v2", + "exported_at": "2026-04-06T12:00:00Z" +} +``` + +### .env +```bash +LLM_MODEL=qwen2.5-72b-instruct +LLM_API_URL=http://chat.forgetyour.name/api +LLM_TEMPERATURE=0.3 +LLM_TOP_P=0.85 +LLM_MAX_TOKENS=2048 +# Score: 0.87 | Experiment: chrysopedia-extraction-v2 +``` + +### YAML +```yaml +model: qwen2.5-72b-instruct +endpoint: http://chat.forgetyour.name/api +parameters: + temperature: 0.3 + top_p: 0.85 + max_tokens: 2048 +system_prompt: | + You are a music production knowledge extractor... +metadata: + score: 0.87 + experiment: chrysopedia-extraction-v2 + exported_at: 2026-04-06T12:00:00Z +``` + +--- + +## Environment Variables + +| Group | Variable | Default | Notes | +|-------|----------|---------|-------| +| **Database** | `DATABASE_URL` | (none → SQLite) | PostgreSQL connection string | +| **Redis** | `REDIS_URL` | (none → in-process) | Redis connection string | +| **Server** | `HOST` | `0.0.0.0` | Bind address | +| **Server** | `PORT` | `8400` | HTTP port | +| **Auth** | `JWT_SECRET` | (auto-generated) | JWT signing key | +| **Auth** | `API_KEY` | (none) | Static API key for programmatic access | +| **Defaults** | `DEFAULT_ENDPOINT_URL` | (none) | Pre-configured LLM endpoint | +| **Defaults** | `DEFAULT_ENDPOINT_KEY` | (none) | API key for default endpoint | +| **Limits** | `MAX_CONCURRENT_RUNS` | `4` | Parallel run limit | +| **Limits** | `MAX_TOKENS_PER_SWEEP` | `0` (unlimited) | Token budget per sweep | +| **Storage** | `DATA_DIR` | `/data` | SQLite DB + file storage location | +| **MCP** | `MCP_ENABLED` | `true` | Enable MCP server | +| **MCP** | `MCP_PORT` | `8401` | MCP server port | + +--- + +## Docker Compose (Production — XPLTD Conventions) + +Project name: `xpltd_promptlooper` +Network: `promptlooper` (`172.33.0.0/24`) +Persistent data: `/vmPool/r/services/promptlooper_*` +PostgreSQL port: `5434` (external) +Web UI port: `8400` (external) + +--- + +## Technology Stack + +| Layer | Technology | Rationale | +|-------|-----------|-----------| +| **API** | Python 3.12 + FastAPI | Async, OpenAPI auto-gen, matches XPLTD conventions | +| **Task Queue** | Celery + Redis | Proven for background job execution, matches Chrysopedia | +| **Database** | PostgreSQL 16 (prod) / SQLite (single-container) | JSONB for flexible experiment configs | +| **Real-time** | WebSocket via FastAPI + Redis pub/sub | Sub-second dashboard updates | +| **Frontend** | React 18 + TypeScript + Vite | Real-time dashboard, matches Chrysopedia | +| **Styling** | Tailwind CSS | Fast iteration, utility-first | +| **MCP** | Python MCP SDK | Standard protocol for agent integration | +| **Container** | Multi-stage Docker build | Single image serves both API and frontend | + +--- + +## Development & Deployment + +### Local Development +```bash +git clone git@git.xpltd.co:xpltdco/promptlooper.git +cd promptlooper +cp .env.example .env +docker compose up -d promptlooper-db promptlooper-redis +cd backend && pip install -r requirements.txt +alembic upgrade head +uvicorn main:app --reload --host 0.0.0.0 --port 8000 +# In another terminal: +cd frontend && npm install && npm run dev +``` + +### Production Deployment (ub01) +```bash +ssh ub01 +cd /vmPool/r/repos/xpltdco/promptlooper +git pull && docker compose build && docker compose up -d +``` + +### Project Structure +``` +promptlooper/ +├── backend/ +│ ├── main.py # FastAPI entry point +│ ├── config.py # Pydantic Settings +│ ├── models.py # SQLAlchemy ORM +│ ├── schemas.py # Pydantic request/response +│ ├── auth.py # JWT + API key auth +│ ├── worker.py # Celery app config +│ ├── routers/ +│ │ ├── auth.py +│ │ ├── projects.py +│ │ ├── experiments.py +│ │ ├── runs.py +│ │ ├── endpoints.py +│ │ ├── export.py +│ │ ├── webhooks.py +│ │ └── admin.py +│ ├── engine/ +│ │ ├── runner.py # Run execution logic +│ │ ├── sweep.py # Sweep orchestration +│ │ ├── cache.py # Response cache layer +│ │ ├── adapters/ # LLM endpoint adapters +│ │ │ ├── openai_compat.py +│ │ │ └── base.py +│ │ └── scorers/ # Pluggable scoring functions +│ │ ├── embedding.py +│ │ ├── format.py +│ │ ├── keyword.py +│ │ ├── llm_judge.py +│ │ └── base.py +│ ├── mcp/ +│ │ ├── server.py # MCP server implementation +│ │ └── tools.py # MCP tool definitions +│ ├── websocket/ +│ │ └── manager.py # WebSocket connection management +│ └── tests/ +├── frontend/ +│ └── src/ +│ ├── pages/ +│ │ ├── Setup.tsx # First-boot admin setup +│ │ ├── Login.tsx +│ │ ├── Dashboard.tsx # Global activity +│ │ ├── Projects.tsx +│ │ ├── Experiment.tsx # Experiment builder + config +│ │ ├── Live.tsx # Real-time observability +│ │ ├── Compare.tsx # Side-by-side run comparison +│ │ └── Admin.tsx # System settings +│ ├── components/ +│ │ ├── Leaderboard.tsx +│ │ ├── SteeringControls.tsx +│ │ ├── RunCard.tsx +│ │ ├── ScoreChart.tsx +│ │ └── Timeline.tsx +│ └── api/ +├── docker/ +│ ├── Dockerfile # Multi-stage: API + frontend +│ └── nginx.conf +├── alembic/ +├── docker-compose.yml +├── .env.example +├── CLAUDE.md +└── README.md +```