promptlooper/promptlooper-spec.md
John Lightner fc2e4cd7d1 MAESTRO: Initialize repository with README, .gitignore, and project files
Add README.md with project description, quick-start instructions, and
AGPL-3.0 license badge. Add .gitignore for Python, Node, and Docker
artifacts. Include existing CLAUDE.md, spec, docker-compose.yml, and
env.example.
2026-04-07 01:39:18 -05:00

25 KiB
Raw Blame History

PromptLooper

The one who loops prompts — a universal LLM pipeline tuning workbench.

PromptLooper is a self-hosted tool for systematically optimizing LLM prompts, model selection, and inference parameters. It runs experiments across prompt × model × parameter combinations, caches every response, scores results against pluggable evaluation functions, and surfaces the best configurations through a real-time observability dashboard with human-in-the-loop steering.

It ships as a single Docker container (SQLite mode) for zero-config quickstart, or a Docker Compose stack (Postgres + Redis) for production use. An MCP server enables any AI agent to drive PromptLooper programmatically — creating experiments, running sweeps, and reading results without human intervention.


Problem Statement

Anyone building LLM-powered applications faces the same painful loop:

  1. Write a system prompt
  2. Pick a model and parameters (temperature, top_p, max_tokens, etc.)
  3. Run it against sample data
  4. Read the output and decide if it's "good enough"
  5. Tweak something and repeat

This process is manual, unscientific, and wasteful. There's no way to:

  • Systematically compare configurations side-by-side
  • Know if you've already tested a particular combination
  • Quantify "better" beyond gut feeling
  • Let an agent handle the iteration while you steer from above
  • Share optimized configurations between projects or team members

PromptLooper makes this process systematic, observable, cached, and agent-drivable.


Target Users

User Use Case
Solo developer Tuning prompts for a side project, wants to try 5 models and find the sweet spot
Team building RAG pipelines Optimizing chunking + embedding + retrieval + synthesis prompts across stages
AI agent (via MCP) Autonomously running optimization sweeps, reporting back to human when done
Prompt engineer A/B testing prompt variants at scale with quantified scoring
Infrastructure team Benchmarking new models against existing baselines before migration

Core Concepts

Experiment

A named configuration that defines:

  • Sample data: Input documents, queries, or any text the pipeline will process
  • Pipeline stages: 1-N sequential stages, each with its own prompt template and model config
  • Evaluation criteria: Scoring functions that grade the output
  • Parameter space: What to vary (prompt text, model, temperature, top_p, chunk_size, etc.)

Run

A single execution of one specific configuration within an experiment. A run captures:

  • Full input configuration (prompt, model, all parameters)
  • Raw LLM response(s)
  • Timing data (latency, tokens in/out)
  • Evaluation scores
  • Configuration hash (for cache deduplication)

Sweep

A batch of runs that systematically explores a parameter space. Types:

  • Grid sweep: Every combination of specified parameter values
  • Random sweep: Random sampling from parameter ranges
  • Guided sweep: Agent-driven, where results from previous runs inform the next configuration to try

Scoring Function

A pluggable evaluation that takes (input, output, context) and returns a numeric score. Built-in options:

  • Embedding similarity: How semantically close is the output to a reference answer?
  • Length compliance: Does the output meet length constraints?
  • Format compliance: Does the output match expected structure (JSON, markdown, etc.)?
  • Keyword presence: Do required terms appear in the output?
  • Human rating: Manual thumbs-up/down or 1-5 star rating from the dashboard
  • LLM-as-judge: Use a separate LLM call to evaluate quality (configurable judge prompt)
  • Custom function: User-provided Python snippet or HTTP webhook

Project

A workspace that groups related experiments. Users can return to a project and pick up where they left off. Projects store:

  • All experiments and their runs
  • Saved "best" configurations
  • Notes and annotations
  • Export history

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_promptlooper (ub01)                               │
│  Network: promptlooper (172.33.0.0/24)                                   │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌──────────────────────────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │         FastAPI (API)            │  │
│  │  :5434      │  │  job queue  │  │  Experiments, Runs, Scoring,     │  │
│  │  experiments│  │  pub/sub    │  │  Projects, Auth, MCP Server      │  │
│  │  runs, cache│  │  live state │  │  WebSocket for live dashboard    │  │
│  └─────┬───────┘  └──────┬──────┘  └──────────────┬───────────────────┘  │
│        │                 │                        │                      │
│  ┌─────┴─────────────────┴────────────────────────┴───────────────────┐  │
│  │                      Celery Worker                                 │  │
│  │  Executes runs against target LLM endpoints                        │  │
│  │  Caches responses by config hash                                   │  │
│  │  Streams progress via Redis pub/sub                                │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                    Web UI (React + Vite)                           │  │
│  │  nginx → :8400                                                     │  │
│  │  Dashboard, Experiment Builder, Live Observability, Steering       │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘
                              │
                              │  HTTP (OpenAI-compatible)
                              ▼
              ┌───────────────────────────────┐
              │  Target LLM Endpoints          │
              │  OpenWebUI, vLLM, Ollama,      │
              │  OpenAI, Anthropic, any        │
              │  OpenAI-compatible API          │
              └───────────────────────────────┘

Services (Production Compose)

Service Image Port Purpose
promptlooper-db postgres:16-alpine 5434 → 5432 Primary data store
promptlooper-redis redis:7-alpine Celery broker + pub/sub for live dashboard
promptlooper-api Dockerfile 8000 FastAPI REST API + MCP server
promptlooper-worker Dockerfile Celery worker (run execution)
promptlooper-web Dockerfile 8400 → 80 React frontend (nginx)

Single Container Mode

When DATABASE_URL is not set, PromptLooper runs with:

  • SQLite at /data/promptlooper.db
  • In-process task queue (no Celery/Redis dependency)
  • All services in one container on port 8400
docker run -p 8400:8400 -v promptlooper-data:/data ghcr.io/xpltdco/promptlooper

Data Model

User

Field Type Notes
id UUID PK
username string Unique, "admin" created on first boot
password_hash string bcrypt
is_admin bool Default true for first user
created_at timestamp

Project

Field Type Notes
id UUID PK
name string
description text Optional
owner_id UUID FK → User
created_at timestamp
updated_at timestamp

Experiment

Field Type Notes
id UUID PK
project_id UUID FK → Project
name string
description text Optional
sample_data JSONB Input documents/queries
pipeline_stages JSONB Stage definitions with prompt templates
scoring_config JSONB Which scoring functions to use and their weights
parameter_space JSONB What to vary and ranges/options
status enum draft, running, paused, completed
created_at timestamp
updated_at timestamp

Run

Field Type Notes
id UUID PK
experiment_id UUID FK → Experiment
config_hash string(64) SHA-256 of full configuration (for cache dedup)
config JSONB Complete configuration snapshot
status enum pending, running, completed, failed, cached
started_at timestamp
completed_at timestamp
duration_ms int Wall clock time
tokens_in int Total input tokens across all stages
tokens_out int Total output tokens
cost_estimate decimal Estimated cost based on model pricing

StageResult

Field Type Notes
id UUID PK
run_id UUID FK → Run
stage_index int 0-based stage number
prompt_sent text Actual prompt after template rendering
response_raw text Raw LLM response
model_used string Model identifier
parameters JSONB Temperature, top_p, etc.
tokens_in int This stage
tokens_out int This stage
latency_ms int This stage

Score

Field Type Notes
id UUID PK
run_id UUID FK → Run
scorer_name string e.g. "embedding_similarity", "human_rating"
value float Normalized 0.01.0
metadata JSONB Scorer-specific details
created_at timestamp

ResponseCache

Field Type Notes
config_hash string(64) PK — SHA-256 of (prompt + model + params + input)
response text Cached LLM response
model string
tokens_in int
tokens_out int
latency_ms int Original latency
created_at timestamp

WebhookConfig

Field Type Notes
id UUID PK
event_type string experiment.complete, new_best_found, budget.exhausted, human_needed
url string Target URL
headers JSONB Optional auth headers
is_active bool

API Endpoints

Auth

Method Path Description
POST /api/v1/auth/setup First-boot admin password setup
POST /api/v1/auth/login Login, returns JWT
GET /api/v1/auth/me Current user info

Admin

Method Path Description
GET /api/v1/admin/settings System settings (guest access, default model, etc.)
PUT /api/v1/admin/settings Update settings
GET /api/v1/admin/stats System-wide stats (total runs, cache hit rate, etc.)

Projects

Method Path Description
GET /api/v1/projects List projects
POST /api/v1/projects Create project
GET /api/v1/projects/{id} Project detail with experiment summaries
PUT /api/v1/projects/{id} Update project
DELETE /api/v1/projects/{id} Delete project and all experiments

Experiments

Method Path Description
GET /api/v1/experiments List experiments (filter by project)
POST /api/v1/experiments Create experiment
GET /api/v1/experiments/{id} Experiment detail with run summaries
PUT /api/v1/experiments/{id} Update experiment config
DELETE /api/v1/experiments/{id} Delete experiment
POST /api/v1/experiments/{id}/sweep Start a sweep (grid, random, or guided)
POST /api/v1/experiments/{id}/pause Pause running sweep
POST /api/v1/experiments/{id}/resume Resume paused sweep
POST /api/v1/experiments/{id}/stop Stop sweep

Runs

Method Path Description
GET /api/v1/experiments/{id}/runs List runs with scores (sortable, filterable)
GET /api/v1/runs/{id} Run detail with stage results
POST /api/v1/runs Execute a single run (ad-hoc)
POST /api/v1/runs/{id}/score Add human rating to a run
GET /api/v1/experiments/{id}/leaderboard Top runs ranked by weighted score

Export

Method Path Description
GET /api/v1/experiments/{id}/export/best Best config as JSON
GET /api/v1/experiments/{id}/export/env Best config as .env snippet
GET /api/v1/experiments/{id}/export/yaml Best config as YAML
GET /api/v1/experiments/{id}/export/report Full experiment report (markdown)

LLM Endpoints (Target Management)

Method Path Description
GET /api/v1/endpoints List configured LLM endpoints
POST /api/v1/endpoints Add endpoint (URL, API key, label)
PUT /api/v1/endpoints/{id} Update endpoint
DELETE /api/v1/endpoints/{id} Remove endpoint
POST /api/v1/endpoints/{id}/test Test connectivity and list available models

Webhooks

Method Path Description
GET /api/v1/webhooks List webhook configs
POST /api/v1/webhooks Create webhook
DELETE /api/v1/webhooks/{id} Remove webhook

WebSocket

Path Description
/ws/experiments/{id} Live stream: run progress, scores, stage completions
/ws/dashboard Global activity feed across all experiments

Health

Method Path Description
GET /health Health check (DB + Redis connectivity)

MCP Server

PromptLooper exposes an MCP (Model Context Protocol) server so AI agents can drive it programmatically. The MCP server runs as part of the API service.

MCP Tools

Tool Description
create_project Create a new project workspace
create_experiment Define an experiment with sample data, stages, and scoring
configure_endpoint Add or update an LLM target endpoint
run_single Execute one specific configuration and return results
run_sweep Start a parameter sweep (grid/random/guided)
get_leaderboard Get top N configurations ranked by score
get_run_detail Get full details of a specific run
export_best_config Export the best configuration in JSON/YAML/env format
pause_sweep Pause a running sweep
resume_sweep Resume a paused sweep
add_human_score Rate a run's output
get_experiment_status Check experiment progress
list_models List available models across all configured endpoints

Example Agent Interaction

Agent: "Create a project called 'Chrysopedia Extraction' and an experiment
        that tests the stage3_extraction prompt against Qwen-72B and Qwen-32B,
        sweeping temperature from 0.1 to 0.9 in 0.2 increments.
        Use embedding similarity scoring against these reference outputs.
        Run a grid sweep."

PromptLooper MCP: [create_project] → [create_experiment] → [run_sweep]
                  → streams progress → [get_leaderboard]

Agent: "The top config uses Qwen-72B at temperature 0.3. Export it as
        a .env snippet I can drop into Chrysopedia."

PromptLooper MCP: [export_best_config format=env]

Response Caching

Every LLM call is cached by a SHA-256 hash of:

  • Prompt text (after template rendering)
  • Model identifier
  • All inference parameters (temperature, top_p, max_tokens, etc.)
  • Input data

If an identical configuration has been run before, the cached response is returned instantly with status: cached. This means:

  • Re-running experiments with new scoring functions costs zero tokens
  • Adding a new scorer retroactively evaluates all historical runs
  • Accidentally re-running a sweep wastes nothing
  • Cache can be invalidated per-run or per-experiment if needed

Authentication Model

First Boot

  • App detects no users exist
  • Presents a setup screen: create admin username + password
  • Admin account is created, user is logged in

Guest Access

  • Admin can toggle allow_guest_access in settings
  • Guests can view experiments and results (read-only)
  • Guests cannot create experiments, run sweeps, or modify configs
  • Default: guest access disabled

API Authentication

  • JWT tokens for the web UI
  • API key (generated in admin settings) for programmatic access and MCP
  • API key passed via Authorization: Bearer <key> header

Real-Time Observability Dashboard

The dashboard is the primary user interface during active experimentation. It provides:

Live Experiment View

  • Progress bar: X of Y runs completed
  • Token usage accumulator (running total)
  • Cost estimate (based on configured model pricing)
  • Cache hit rate for current sweep
  • Estimated time remaining

Side-by-Side Output Comparison

  • Pick any two runs and diff their outputs
  • Highlight differences in prompt, parameters, and response
  • Score comparison overlay

Leaderboard

  • Real-time ranked list of runs by weighted score
  • Sortable by any individual scorer
  • Click to expand full run detail

Steering Controls

  • Pause: Stop the sweep after current run completes
  • Fork: Create a new experiment branching from current best, with modified parameters
  • Redirect: Change remaining sweep parameters mid-flight
  • Approve: Mark a configuration as "good enough" and export
  • Reject: Exclude a run from leaderboard consideration

Activity Timeline

  • Chronological feed of events: run started, run completed, new best found, cache hit, error
  • Filterable by event type

Webhook Events

Event Payload Trigger
experiment.started experiment_id, sweep config Sweep begins
experiment.completed experiment_id, best config, summary stats All runs finished
experiment.paused experiment_id, reason Manual or budget pause
new_best_found experiment_id, run_id, scores, config New top-scoring run
budget.exhausted experiment_id, token_count, cost Token/cost budget hit
human_needed experiment_id, reason, context Agent requests human review
run.failed run_id, error Individual run error

Configuration Export Formats

JSON

{
  "model": "qwen2.5-72b-instruct",
  "endpoint": "http://chat.forgetyour.name/api",
  "temperature": 0.3,
  "top_p": 0.85,
  "max_tokens": 2048,
  "system_prompt": "You are a music production knowledge extractor...",
  "score": 0.87,
  "experiment": "chrysopedia-extraction-v2",
  "exported_at": "2026-04-06T12:00:00Z"
}

.env

LLM_MODEL=qwen2.5-72b-instruct
LLM_API_URL=http://chat.forgetyour.name/api
LLM_TEMPERATURE=0.3
LLM_TOP_P=0.85
LLM_MAX_TOKENS=2048
# Score: 0.87 | Experiment: chrysopedia-extraction-v2

YAML

model: qwen2.5-72b-instruct
endpoint: http://chat.forgetyour.name/api
parameters:
  temperature: 0.3
  top_p: 0.85
  max_tokens: 2048
system_prompt: |
  You are a music production knowledge extractor...
metadata:
  score: 0.87
  experiment: chrysopedia-extraction-v2
  exported_at: 2026-04-06T12:00:00Z

Environment Variables

Group Variable Default Notes
Database DATABASE_URL (none → SQLite) PostgreSQL connection string
Redis REDIS_URL (none → in-process) Redis connection string
Server HOST 0.0.0.0 Bind address
Server PORT 8400 HTTP port
Auth JWT_SECRET (auto-generated) JWT signing key
Auth API_KEY (none) Static API key for programmatic access
Defaults DEFAULT_ENDPOINT_URL (none) Pre-configured LLM endpoint
Defaults DEFAULT_ENDPOINT_KEY (none) API key for default endpoint
Limits MAX_CONCURRENT_RUNS 4 Parallel run limit
Limits MAX_TOKENS_PER_SWEEP 0 (unlimited) Token budget per sweep
Storage DATA_DIR /data SQLite DB + file storage location
MCP MCP_ENABLED true Enable MCP server
MCP MCP_PORT 8401 MCP server port

Docker Compose (Production — XPLTD Conventions)

Project name: xpltd_promptlooper Network: promptlooper (172.33.0.0/24) Persistent data: /vmPool/r/services/promptlooper_* PostgreSQL port: 5434 (external) Web UI port: 8400 (external)


Technology Stack

Layer Technology Rationale
API Python 3.12 + FastAPI Async, OpenAPI auto-gen, matches XPLTD conventions
Task Queue Celery + Redis Proven for background job execution, matches Chrysopedia
Database PostgreSQL 16 (prod) / SQLite (single-container) JSONB for flexible experiment configs
Real-time WebSocket via FastAPI + Redis pub/sub Sub-second dashboard updates
Frontend React 18 + TypeScript + Vite Real-time dashboard, matches Chrysopedia
Styling Tailwind CSS Fast iteration, utility-first
MCP Python MCP SDK Standard protocol for agent integration
Container Multi-stage Docker build Single image serves both API and frontend

Development & Deployment

Local Development

git clone git@git.xpltd.co:xpltdco/promptlooper.git
cd promptlooper
cp .env.example .env
docker compose up -d promptlooper-db promptlooper-redis
cd backend && pip install -r requirements.txt
alembic upgrade head
uvicorn main:app --reload --host 0.0.0.0 --port 8000
# In another terminal:
cd frontend && npm install && npm run dev

Production Deployment (ub01)

ssh ub01
cd /vmPool/r/repos/xpltdco/promptlooper
git pull && docker compose build && docker compose up -d

Project Structure

promptlooper/
├── backend/
│   ├── main.py                 # FastAPI entry point
│   ├── config.py               # Pydantic Settings
│   ├── models.py               # SQLAlchemy ORM
│   ├── schemas.py              # Pydantic request/response
│   ├── auth.py                 # JWT + API key auth
│   ├── worker.py               # Celery app config
│   ├── routers/
│   │   ├── auth.py
│   │   ├── projects.py
│   │   ├── experiments.py
│   │   ├── runs.py
│   │   ├── endpoints.py
│   │   ├── export.py
│   │   ├── webhooks.py
│   │   └── admin.py
│   ├── engine/
│   │   ├── runner.py           # Run execution logic
│   │   ├── sweep.py            # Sweep orchestration
│   │   ├── cache.py            # Response cache layer
│   │   ├── adapters/           # LLM endpoint adapters
│   │   │   ├── openai_compat.py
│   │   │   └── base.py
│   │   └── scorers/            # Pluggable scoring functions
│   │       ├── embedding.py
│   │       ├── format.py
│   │       ├── keyword.py
│   │       ├── llm_judge.py
│   │       └── base.py
│   ├── mcp/
│   │   ├── server.py           # MCP server implementation
│   │   └── tools.py            # MCP tool definitions
│   ├── websocket/
│   │   └── manager.py          # WebSocket connection management
│   └── tests/
├── frontend/
│   └── src/
│       ├── pages/
│       │   ├── Setup.tsx       # First-boot admin setup
│       │   ├── Login.tsx
│       │   ├── Dashboard.tsx   # Global activity
│       │   ├── Projects.tsx
│       │   ├── Experiment.tsx  # Experiment builder + config
│       │   ├── Live.tsx        # Real-time observability
│       │   ├── Compare.tsx     # Side-by-side run comparison
│       │   └── Admin.tsx       # System settings
│       ├── components/
│       │   ├── Leaderboard.tsx
│       │   ├── SteeringControls.tsx
│       │   ├── RunCard.tsx
│       │   ├── ScoreChart.tsx
│       │   └── Timeline.tsx
│       └── api/
├── docker/
│   ├── Dockerfile              # Multi-stage: API + frontend
│   └── nginx.conf
├── alembic/
├── docker-compose.yml
├── .env.example
├── CLAUDE.md
└── README.md