PromptLooper
The one who loops prompts — a universal LLM pipeline tuning workbench.
PromptLooper is a self-hosted tool for systematically optimizing LLM prompts, model selection, and inference parameters. It runs experiments across prompt × model × parameter combinations, caches every response, scores results against pluggable evaluation functions, and surfaces the best configurations through a real-time observability dashboard with human-in-the-loop steering.
It ships as a single Docker container (SQLite mode) for zero-config quickstart, or a Docker Compose stack (Postgres + Redis) for production use. An MCP server enables any AI agent to drive PromptLooper programmatically — creating experiments, running sweeps, and reading results without human intervention.
Problem Statement
Anyone building LLM-powered applications faces the same painful loop:
- Write a system prompt
- Pick a model and parameters (temperature, top_p, max_tokens, etc.)
- Run it against sample data
- Read the output and decide if it's "good enough"
- Tweak something and repeat
This process is manual, unscientific, and wasteful. There's no way to:
- Systematically compare configurations side-by-side
- Know if you've already tested a particular combination
- Quantify "better" beyond gut feeling
- Let an agent handle the iteration while you steer from above
- Share optimized configurations between projects or team members
PromptLooper makes this process systematic, observable, cached, and agent-drivable.
Target Users
| User |
Use Case |
| Solo developer |
Tuning prompts for a side project, wants to try 5 models and find the sweet spot |
| Team building RAG pipelines |
Optimizing chunking + embedding + retrieval + synthesis prompts across stages |
| AI agent (via MCP) |
Autonomously running optimization sweeps, reporting back to human when done |
| Prompt engineer |
A/B testing prompt variants at scale with quantified scoring |
| Infrastructure team |
Benchmarking new models against existing baselines before migration |
Core Concepts
Experiment
A named configuration that defines:
- Sample data: Input documents, queries, or any text the pipeline will process
- Pipeline stages: 1-N sequential stages, each with its own prompt template and model config
- Evaluation criteria: Scoring functions that grade the output
- Parameter space: What to vary (prompt text, model, temperature, top_p, chunk_size, etc.)
Run
A single execution of one specific configuration within an experiment. A run captures:
- Full input configuration (prompt, model, all parameters)
- Raw LLM response(s)
- Timing data (latency, tokens in/out)
- Evaluation scores
- Configuration hash (for cache deduplication)
Sweep
A batch of runs that systematically explores a parameter space. Types:
- Grid sweep: Every combination of specified parameter values
- Random sweep: Random sampling from parameter ranges
- Guided sweep: Agent-driven, where results from previous runs inform the next configuration to try
Scoring Function
A pluggable evaluation that takes (input, output, context) and returns a numeric score. Built-in options:
- Embedding similarity: How semantically close is the output to a reference answer?
- Length compliance: Does the output meet length constraints?
- Format compliance: Does the output match expected structure (JSON, markdown, etc.)?
- Keyword presence: Do required terms appear in the output?
- Human rating: Manual thumbs-up/down or 1-5 star rating from the dashboard
- LLM-as-judge: Use a separate LLM call to evaluate quality (configurable judge prompt)
- Custom function: User-provided Python snippet or HTTP webhook
Project
A workspace that groups related experiments. Users can return to a project and pick up where they left off. Projects store:
- All experiments and their runs
- Saved "best" configurations
- Notes and annotations
- Export history
Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ Docker Compose: xpltd_promptlooper (ub01) │
│ Network: promptlooper (172.33.0.0/24) │
│ │
│ ┌────────────┐ ┌─────────────┐ ┌──────────────────────────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ FastAPI (API) │ │
│ │ :5434 │ │ job queue │ │ Experiments, Runs, Scoring, │ │
│ │ experiments│ │ pub/sub │ │ Projects, Auth, MCP Server │ │
│ │ runs, cache│ │ live state │ │ WebSocket for live dashboard │ │
│ └─────┬───────┘ └──────┬──────┘ └──────────────┬───────────────────┘ │
│ │ │ │ │
│ ┌─────┴─────────────────┴────────────────────────┴───────────────────┐ │
│ │ Celery Worker │ │
│ │ Executes runs against target LLM endpoints │ │
│ │ Caches responses by config hash │ │
│ │ Streams progress via Redis pub/sub │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Web UI (React + Vite) │ │
│ │ nginx → :8400 │ │
│ │ Dashboard, Experiment Builder, Live Observability, Steering │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
│
│ HTTP (OpenAI-compatible)
▼
┌───────────────────────────────┐
│ Target LLM Endpoints │
│ OpenWebUI, vLLM, Ollama, │
│ OpenAI, Anthropic, any │
│ OpenAI-compatible API │
└───────────────────────────────┘
Services (Production Compose)
| Service |
Image |
Port |
Purpose |
promptlooper-db |
postgres:16-alpine |
5434 → 5432 |
Primary data store |
promptlooper-redis |
redis:7-alpine |
— |
Celery broker + pub/sub for live dashboard |
promptlooper-api |
Dockerfile |
8000 |
FastAPI REST API + MCP server |
promptlooper-worker |
Dockerfile |
— |
Celery worker (run execution) |
promptlooper-web |
Dockerfile |
8400 → 80 |
React frontend (nginx) |
Single Container Mode
When DATABASE_URL is not set, PromptLooper runs with:
- SQLite at
/data/promptlooper.db
- In-process task queue (no Celery/Redis dependency)
- All services in one container on port 8400
docker run -p 8400:8400 -v promptlooper-data:/data ghcr.io/xpltdco/promptlooper
Data Model
User
| Field |
Type |
Notes |
| id |
UUID |
PK |
| username |
string |
Unique, "admin" created on first boot |
| password_hash |
string |
bcrypt |
| is_admin |
bool |
Default true for first user |
| created_at |
timestamp |
|
Project
| Field |
Type |
Notes |
| id |
UUID |
PK |
| name |
string |
|
| description |
text |
Optional |
| owner_id |
UUID |
FK → User |
| created_at |
timestamp |
|
| updated_at |
timestamp |
|
Experiment
| Field |
Type |
Notes |
| id |
UUID |
PK |
| project_id |
UUID |
FK → Project |
| name |
string |
|
| description |
text |
Optional |
| sample_data |
JSONB |
Input documents/queries |
| pipeline_stages |
JSONB |
Stage definitions with prompt templates |
| scoring_config |
JSONB |
Which scoring functions to use and their weights |
| parameter_space |
JSONB |
What to vary and ranges/options |
| status |
enum |
draft, running, paused, completed |
| created_at |
timestamp |
|
| updated_at |
timestamp |
|
Run
| Field |
Type |
Notes |
| id |
UUID |
PK |
| experiment_id |
UUID |
FK → Experiment |
| config_hash |
string(64) |
SHA-256 of full configuration (for cache dedup) |
| config |
JSONB |
Complete configuration snapshot |
| status |
enum |
pending, running, completed, failed, cached |
| started_at |
timestamp |
|
| completed_at |
timestamp |
|
| duration_ms |
int |
Wall clock time |
| tokens_in |
int |
Total input tokens across all stages |
| tokens_out |
int |
Total output tokens |
| cost_estimate |
decimal |
Estimated cost based on model pricing |
StageResult
| Field |
Type |
Notes |
| id |
UUID |
PK |
| run_id |
UUID |
FK → Run |
| stage_index |
int |
0-based stage number |
| prompt_sent |
text |
Actual prompt after template rendering |
| response_raw |
text |
Raw LLM response |
| model_used |
string |
Model identifier |
| parameters |
JSONB |
Temperature, top_p, etc. |
| tokens_in |
int |
This stage |
| tokens_out |
int |
This stage |
| latency_ms |
int |
This stage |
Score
| Field |
Type |
Notes |
| id |
UUID |
PK |
| run_id |
UUID |
FK → Run |
| scorer_name |
string |
e.g. "embedding_similarity", "human_rating" |
| value |
float |
Normalized 0.0–1.0 |
| metadata |
JSONB |
Scorer-specific details |
| created_at |
timestamp |
|
ResponseCache
| Field |
Type |
Notes |
| config_hash |
string(64) |
PK — SHA-256 of (prompt + model + params + input) |
| response |
text |
Cached LLM response |
| model |
string |
|
| tokens_in |
int |
|
| tokens_out |
int |
|
| latency_ms |
int |
Original latency |
| created_at |
timestamp |
|
WebhookConfig
| Field |
Type |
Notes |
| id |
UUID |
PK |
| event_type |
string |
experiment.complete, new_best_found, budget.exhausted, human_needed |
| url |
string |
Target URL |
| headers |
JSONB |
Optional auth headers |
| is_active |
bool |
|
API Endpoints
Auth
| Method |
Path |
Description |
| POST |
/api/v1/auth/setup |
First-boot admin password setup |
| POST |
/api/v1/auth/login |
Login, returns JWT |
| GET |
/api/v1/auth/me |
Current user info |
Admin
| Method |
Path |
Description |
| GET |
/api/v1/admin/settings |
System settings (guest access, default model, etc.) |
| PUT |
/api/v1/admin/settings |
Update settings |
| GET |
/api/v1/admin/stats |
System-wide stats (total runs, cache hit rate, etc.) |
Projects
| Method |
Path |
Description |
| GET |
/api/v1/projects |
List projects |
| POST |
/api/v1/projects |
Create project |
| GET |
/api/v1/projects/{id} |
Project detail with experiment summaries |
| PUT |
/api/v1/projects/{id} |
Update project |
| DELETE |
/api/v1/projects/{id} |
Delete project and all experiments |
Experiments
| Method |
Path |
Description |
| GET |
/api/v1/experiments |
List experiments (filter by project) |
| POST |
/api/v1/experiments |
Create experiment |
| GET |
/api/v1/experiments/{id} |
Experiment detail with run summaries |
| PUT |
/api/v1/experiments/{id} |
Update experiment config |
| DELETE |
/api/v1/experiments/{id} |
Delete experiment |
| POST |
/api/v1/experiments/{id}/sweep |
Start a sweep (grid, random, or guided) |
| POST |
/api/v1/experiments/{id}/pause |
Pause running sweep |
| POST |
/api/v1/experiments/{id}/resume |
Resume paused sweep |
| POST |
/api/v1/experiments/{id}/stop |
Stop sweep |
Runs
| Method |
Path |
Description |
| GET |
/api/v1/experiments/{id}/runs |
List runs with scores (sortable, filterable) |
| GET |
/api/v1/runs/{id} |
Run detail with stage results |
| POST |
/api/v1/runs |
Execute a single run (ad-hoc) |
| POST |
/api/v1/runs/{id}/score |
Add human rating to a run |
| GET |
/api/v1/experiments/{id}/leaderboard |
Top runs ranked by weighted score |
Export
| Method |
Path |
Description |
| GET |
/api/v1/experiments/{id}/export/best |
Best config as JSON |
| GET |
/api/v1/experiments/{id}/export/env |
Best config as .env snippet |
| GET |
/api/v1/experiments/{id}/export/yaml |
Best config as YAML |
| GET |
/api/v1/experiments/{id}/export/report |
Full experiment report (markdown) |
LLM Endpoints (Target Management)
| Method |
Path |
Description |
| GET |
/api/v1/endpoints |
List configured LLM endpoints |
| POST |
/api/v1/endpoints |
Add endpoint (URL, API key, label) |
| PUT |
/api/v1/endpoints/{id} |
Update endpoint |
| DELETE |
/api/v1/endpoints/{id} |
Remove endpoint |
| POST |
/api/v1/endpoints/{id}/test |
Test connectivity and list available models |
Webhooks
| Method |
Path |
Description |
| GET |
/api/v1/webhooks |
List webhook configs |
| POST |
/api/v1/webhooks |
Create webhook |
| DELETE |
/api/v1/webhooks/{id} |
Remove webhook |
WebSocket
| Path |
Description |
/ws/experiments/{id} |
Live stream: run progress, scores, stage completions |
/ws/dashboard |
Global activity feed across all experiments |
Health
| Method |
Path |
Description |
| GET |
/health |
Health check (DB + Redis connectivity) |
MCP Server
PromptLooper exposes an MCP (Model Context Protocol) server so AI agents can drive it programmatically. The MCP server runs as part of the API service.
MCP Tools
| Tool |
Description |
create_project |
Create a new project workspace |
create_experiment |
Define an experiment with sample data, stages, and scoring |
configure_endpoint |
Add or update an LLM target endpoint |
run_single |
Execute one specific configuration and return results |
run_sweep |
Start a parameter sweep (grid/random/guided) |
get_leaderboard |
Get top N configurations ranked by score |
get_run_detail |
Get full details of a specific run |
export_best_config |
Export the best configuration in JSON/YAML/env format |
pause_sweep |
Pause a running sweep |
resume_sweep |
Resume a paused sweep |
add_human_score |
Rate a run's output |
get_experiment_status |
Check experiment progress |
list_models |
List available models across all configured endpoints |
Example Agent Interaction
Agent: "Create a project called 'Chrysopedia Extraction' and an experiment
that tests the stage3_extraction prompt against Qwen-72B and Qwen-32B,
sweeping temperature from 0.1 to 0.9 in 0.2 increments.
Use embedding similarity scoring against these reference outputs.
Run a grid sweep."
PromptLooper MCP: [create_project] → [create_experiment] → [run_sweep]
→ streams progress → [get_leaderboard]
Agent: "The top config uses Qwen-72B at temperature 0.3. Export it as
a .env snippet I can drop into Chrysopedia."
PromptLooper MCP: [export_best_config format=env]
Response Caching
Every LLM call is cached by a SHA-256 hash of:
- Prompt text (after template rendering)
- Model identifier
- All inference parameters (temperature, top_p, max_tokens, etc.)
- Input data
If an identical configuration has been run before, the cached response is returned instantly with status: cached. This means:
- Re-running experiments with new scoring functions costs zero tokens
- Adding a new scorer retroactively evaluates all historical runs
- Accidentally re-running a sweep wastes nothing
- Cache can be invalidated per-run or per-experiment if needed
Authentication Model
First Boot
- App detects no users exist
- Presents a setup screen: create admin username + password
- Admin account is created, user is logged in
Guest Access
- Admin can toggle
allow_guest_access in settings
- Guests can view experiments and results (read-only)
- Guests cannot create experiments, run sweeps, or modify configs
- Default: guest access disabled
API Authentication
- JWT tokens for the web UI
- API key (generated in admin settings) for programmatic access and MCP
- API key passed via
Authorization: Bearer <key> header
Real-Time Observability Dashboard
The dashboard is the primary user interface during active experimentation. It provides:
Live Experiment View
- Progress bar: X of Y runs completed
- Token usage accumulator (running total)
- Cost estimate (based on configured model pricing)
- Cache hit rate for current sweep
- Estimated time remaining
Side-by-Side Output Comparison
- Pick any two runs and diff their outputs
- Highlight differences in prompt, parameters, and response
- Score comparison overlay
Leaderboard
- Real-time ranked list of runs by weighted score
- Sortable by any individual scorer
- Click to expand full run detail
Steering Controls
- Pause: Stop the sweep after current run completes
- Fork: Create a new experiment branching from current best, with modified parameters
- Redirect: Change remaining sweep parameters mid-flight
- Approve: Mark a configuration as "good enough" and export
- Reject: Exclude a run from leaderboard consideration
Activity Timeline
- Chronological feed of events: run started, run completed, new best found, cache hit, error
- Filterable by event type
Webhook Events
| Event |
Payload |
Trigger |
experiment.started |
experiment_id, sweep config |
Sweep begins |
experiment.completed |
experiment_id, best config, summary stats |
All runs finished |
experiment.paused |
experiment_id, reason |
Manual or budget pause |
new_best_found |
experiment_id, run_id, scores, config |
New top-scoring run |
budget.exhausted |
experiment_id, token_count, cost |
Token/cost budget hit |
human_needed |
experiment_id, reason, context |
Agent requests human review |
run.failed |
run_id, error |
Individual run error |
Configuration Export Formats
JSON
{
"model": "qwen2.5-72b-instruct",
"endpoint": "http://chat.forgetyour.name/api",
"temperature": 0.3,
"top_p": 0.85,
"max_tokens": 2048,
"system_prompt": "You are a music production knowledge extractor...",
"score": 0.87,
"experiment": "chrysopedia-extraction-v2",
"exported_at": "2026-04-06T12:00:00Z"
}
.env
LLM_MODEL=qwen2.5-72b-instruct
LLM_API_URL=http://chat.forgetyour.name/api
LLM_TEMPERATURE=0.3
LLM_TOP_P=0.85
LLM_MAX_TOKENS=2048
# Score: 0.87 | Experiment: chrysopedia-extraction-v2
YAML
model: qwen2.5-72b-instruct
endpoint: http://chat.forgetyour.name/api
parameters:
temperature: 0.3
top_p: 0.85
max_tokens: 2048
system_prompt: |
You are a music production knowledge extractor...
metadata:
score: 0.87
experiment: chrysopedia-extraction-v2
exported_at: 2026-04-06T12:00:00Z
Environment Variables
| Group |
Variable |
Default |
Notes |
| Database |
DATABASE_URL |
(none → SQLite) |
PostgreSQL connection string |
| Redis |
REDIS_URL |
(none → in-process) |
Redis connection string |
| Server |
HOST |
0.0.0.0 |
Bind address |
| Server |
PORT |
8400 |
HTTP port |
| Auth |
JWT_SECRET |
(auto-generated) |
JWT signing key |
| Auth |
API_KEY |
(none) |
Static API key for programmatic access |
| Defaults |
DEFAULT_ENDPOINT_URL |
(none) |
Pre-configured LLM endpoint |
| Defaults |
DEFAULT_ENDPOINT_KEY |
(none) |
API key for default endpoint |
| Limits |
MAX_CONCURRENT_RUNS |
4 |
Parallel run limit |
| Limits |
MAX_TOKENS_PER_SWEEP |
0 (unlimited) |
Token budget per sweep |
| Storage |
DATA_DIR |
/data |
SQLite DB + file storage location |
| MCP |
MCP_ENABLED |
true |
Enable MCP server |
| MCP |
MCP_PORT |
8401 |
MCP server port |
Docker Compose (Production — XPLTD Conventions)
Project name: xpltd_promptlooper
Network: promptlooper (172.33.0.0/24)
Persistent data: /vmPool/r/services/promptlooper_*
PostgreSQL port: 5434 (external)
Web UI port: 8400 (external)
Technology Stack
| Layer |
Technology |
Rationale |
| API |
Python 3.12 + FastAPI |
Async, OpenAPI auto-gen, matches XPLTD conventions |
| Task Queue |
Celery + Redis |
Proven for background job execution, matches Chrysopedia |
| Database |
PostgreSQL 16 (prod) / SQLite (single-container) |
JSONB for flexible experiment configs |
| Real-time |
WebSocket via FastAPI + Redis pub/sub |
Sub-second dashboard updates |
| Frontend |
React 18 + TypeScript + Vite |
Real-time dashboard, matches Chrysopedia |
| Styling |
Tailwind CSS |
Fast iteration, utility-first |
| MCP |
Python MCP SDK |
Standard protocol for agent integration |
| Container |
Multi-stage Docker build |
Single image serves both API and frontend |
Development & Deployment
Local Development
git clone git@git.xpltd.co:xpltdco/promptlooper.git
cd promptlooper
cp .env.example .env
docker compose up -d promptlooper-db promptlooper-redis
cd backend && pip install -r requirements.txt
alembic upgrade head
uvicorn main:app --reload --host 0.0.0.0 --port 8000
# In another terminal:
cd frontend && npm install && npm run dev
Production Deployment (ub01)
ssh ub01
cd /vmPool/r/repos/xpltdco/promptlooper
git pull && docker compose build && docker compose up -d
Project Structure
promptlooper/
├── backend/
│ ├── main.py # FastAPI entry point
│ ├── config.py # Pydantic Settings
│ ├── models.py # SQLAlchemy ORM
│ ├── schemas.py # Pydantic request/response
│ ├── auth.py # JWT + API key auth
│ ├── worker.py # Celery app config
│ ├── routers/
│ │ ├── auth.py
│ │ ├── projects.py
│ │ ├── experiments.py
│ │ ├── runs.py
│ │ ├── endpoints.py
│ │ ├── export.py
│ │ ├── webhooks.py
│ │ └── admin.py
│ ├── engine/
│ │ ├── runner.py # Run execution logic
│ │ ├── sweep.py # Sweep orchestration
│ │ ├── cache.py # Response cache layer
│ │ ├── adapters/ # LLM endpoint adapters
│ │ │ ├── openai_compat.py
│ │ │ └── base.py
│ │ └── scorers/ # Pluggable scoring functions
│ │ ├── embedding.py
│ │ ├── format.py
│ │ ├── keyword.py
│ │ ├── llm_judge.py
│ │ └── base.py
│ ├── mcp/
│ │ ├── server.py # MCP server implementation
│ │ └── tools.py # MCP tool definitions
│ ├── websocket/
│ │ └── manager.py # WebSocket connection management
│ └── tests/
├── frontend/
│ └── src/
│ ├── pages/
│ │ ├── Setup.tsx # First-boot admin setup
│ │ ├── Login.tsx
│ │ ├── Dashboard.tsx # Global activity
│ │ ├── Projects.tsx
│ │ ├── Experiment.tsx # Experiment builder + config
│ │ ├── Live.tsx # Real-time observability
│ │ ├── Compare.tsx # Side-by-side run comparison
│ │ └── Admin.tsx # System settings
│ ├── components/
│ │ ├── Leaderboard.tsx
│ │ ├── SteeringControls.tsx
│ │ ├── RunCard.tsx
│ │ ├── ScoreChart.tsx
│ │ └── Timeline.tsx
│ └── api/
├── docker/
│ ├── Dockerfile # Multi-stage: API + frontend
│ └── nginx.conf
├── alembic/
├── docker-compose.yml
├── .env.example
├── CLAUDE.md
└── README.md