John Lightner fc2e4cd7d1 MAESTRO: Initialize repository with README, .gitignore, and project files

Add README.md with project description, quick-start instructions, and
AGPL-3.0 license badge. Add .gitignore for Python, Node, and Docker
artifacts. Include existing CLAUDE.md, spec, docker-compose.yml, and
env.example.

2026-04-07 01:39:18 -05:00

25 KiB

Raw Blame History

PromptLooper

The one who loops prompts — a universal LLM pipeline tuning workbench.

PromptLooper is a self-hosted tool for systematically optimizing LLM prompts, model selection, and inference parameters. It runs experiments across prompt × model × parameter combinations, caches every response, scores results against pluggable evaluation functions, and surfaces the best configurations through a real-time observability dashboard with human-in-the-loop steering.

It ships as a single Docker container (SQLite mode) for zero-config quickstart, or a Docker Compose stack (Postgres + Redis) for production use. An MCP server enables any AI agent to drive PromptLooper programmatically — creating experiments, running sweeps, and reading results without human intervention.

Problem Statement

Anyone building LLM-powered applications faces the same painful loop:

Write a system prompt
Pick a model and parameters (temperature, top_p, max_tokens, etc.)
Run it against sample data
Read the output and decide if it's "good enough"
Tweak something and repeat

This process is manual, unscientific, and wasteful. There's no way to:

Systematically compare configurations side-by-side
Know if you've already tested a particular combination
Quantify "better" beyond gut feeling
Let an agent handle the iteration while you steer from above
Share optimized configurations between projects or team members

PromptLooper makes this process systematic, observable, cached, and agent-drivable.

Target Users

User	Use Case
Solo developer	Tuning prompts for a side project, wants to try 5 models and find the sweet spot
Team building RAG pipelines	Optimizing chunking + embedding + retrieval + synthesis prompts across stages
AI agent (via MCP)	Autonomously running optimization sweeps, reporting back to human when done
Prompt engineer	A/B testing prompt variants at scale with quantified scoring
Infrastructure team	Benchmarking new models against existing baselines before migration

Core Concepts

Experiment

A named configuration that defines:

Sample data: Input documents, queries, or any text the pipeline will process
Pipeline stages: 1-N sequential stages, each with its own prompt template and model config
Evaluation criteria: Scoring functions that grade the output
Parameter space: What to vary (prompt text, model, temperature, top_p, chunk_size, etc.)

Run

A single execution of one specific configuration within an experiment. A run captures:

Full input configuration (prompt, model, all parameters)
Raw LLM response(s)
Timing data (latency, tokens in/out)
Evaluation scores
Configuration hash (for cache deduplication)

Sweep

A batch of runs that systematically explores a parameter space. Types:

Grid sweep: Every combination of specified parameter values
Random sweep: Random sampling from parameter ranges
Guided sweep: Agent-driven, where results from previous runs inform the next configuration to try

Scoring Function

A pluggable evaluation that takes (input, output, context) and returns a numeric score. Built-in options:

Embedding similarity: How semantically close is the output to a reference answer?
Length compliance: Does the output meet length constraints?
Format compliance: Does the output match expected structure (JSON, markdown, etc.)?
Keyword presence: Do required terms appear in the output?
Human rating: Manual thumbs-up/down or 1-5 star rating from the dashboard
LLM-as-judge: Use a separate LLM call to evaluate quality (configurable judge prompt)
Custom function: User-provided Python snippet or HTTP webhook

Project

A workspace that groups related experiments. Users can return to a project and pick up where they left off. Projects store:

All experiments and their runs
Saved "best" configurations
Notes and annotations
Export history

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_promptlooper (ub01)                               │
│  Network: promptlooper (172.33.0.0/24)                                   │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌──────────────────────────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │         FastAPI (API)            │  │
│  │  :5434      │  │  job queue  │  │  Experiments, Runs, Scoring,     │  │
│  │  experiments│  │  pub/sub    │  │  Projects, Auth, MCP Server      │  │
│  │  runs, cache│  │  live state │  │  WebSocket for live dashboard    │  │
│  └─────┬───────┘  └──────┬──────┘  └──────────────┬───────────────────┘  │
│        │                 │                        │                      │
│  ┌─────┴─────────────────┴────────────────────────┴───────────────────┐  │
│  │                      Celery Worker                                 │  │
│  │  Executes runs against target LLM endpoints                        │  │
│  │  Caches responses by config hash                                   │  │
│  │  Streams progress via Redis pub/sub                                │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                    Web UI (React + Vite)                           │  │
│  │  nginx → :8400                                                     │  │
│  │  Dashboard, Experiment Builder, Live Observability, Steering       │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘
                              │
                              │  HTTP (OpenAI-compatible)
                              ▼
              ┌───────────────────────────────┐
              │  Target LLM Endpoints          │
              │  OpenWebUI, vLLM, Ollama,      │
              │  OpenAI, Anthropic, any        │
              │  OpenAI-compatible API          │
              └───────────────────────────────┘

Services (Production Compose)

Service	Image	Port	Purpose
`promptlooper-db`	`postgres:16-alpine`	`5434 → 5432`	Primary data store
`promptlooper-redis`	`redis:7-alpine`	—	Celery broker + pub/sub for live dashboard
`promptlooper-api`	`Dockerfile`	`8000`	FastAPI REST API + MCP server
`promptlooper-worker`	`Dockerfile`	—	Celery worker (run execution)
`promptlooper-web`	`Dockerfile`	`8400 → 80`	React frontend (nginx)

Single Container Mode

When DATABASE_URL is not set, PromptLooper runs with:

SQLite at /data/promptlooper.db
In-process task queue (no Celery/Redis dependency)
All services in one container on port 8400

docker run -p 8400:8400 -v promptlooper-data:/data ghcr.io/xpltdco/promptlooper

Data Model

User

Field	Type	Notes
id	UUID	PK
username	string	Unique, "admin" created on first boot
password_hash	string	bcrypt
is_admin	bool	Default true for first user
created_at	timestamp

Project

Field	Type	Notes
id	UUID	PK
name	string
description	text	Optional
owner_id	UUID	FK → User
created_at	timestamp
updated_at	timestamp

Experiment

Field	Type	Notes
id	UUID	PK
project_id	UUID	FK → Project
name	string
description	text	Optional
sample_data	JSONB	Input documents/queries
pipeline_stages	JSONB	Stage definitions with prompt templates
scoring_config	JSONB	Which scoring functions to use and their weights
parameter_space	JSONB	What to vary and ranges/options
status	enum	draft, running, paused, completed
created_at	timestamp
updated_at	timestamp

Run

Field	Type	Notes
id	UUID	PK
experiment_id	UUID	FK → Experiment
config_hash	string(64)	SHA-256 of full configuration (for cache dedup)
config	JSONB	Complete configuration snapshot
status	enum	pending, running, completed, failed, cached
started_at	timestamp
completed_at	timestamp
duration_ms	int	Wall clock time
tokens_in	int	Total input tokens across all stages
tokens_out	int	Total output tokens
cost_estimate	decimal	Estimated cost based on model pricing

StageResult

Field	Type	Notes
id	UUID	PK
run_id	UUID	FK → Run
stage_index	int	0-based stage number
prompt_sent	text	Actual prompt after template rendering
response_raw	text	Raw LLM response
model_used	string	Model identifier
parameters	JSONB	Temperature, top_p, etc.
tokens_in	int	This stage
tokens_out	int	This stage
latency_ms	int	This stage

Score

Field	Type	Notes
id	UUID	PK
run_id	UUID	FK → Run
scorer_name	string	e.g. "embedding_similarity", "human_rating"
value	float	Normalized 0.0–1.0
metadata	JSONB	Scorer-specific details
created_at	timestamp

ResponseCache

Field	Type	Notes
config_hash	string(64)	PK — SHA-256 of (prompt + model + params + input)
response	text	Cached LLM response
model	string
tokens_in	int
tokens_out	int
latency_ms	int	Original latency
created_at	timestamp

WebhookConfig

Field	Type	Notes
id	UUID	PK
event_type	string	experiment.complete, new_best_found, budget.exhausted, human_needed
url	string	Target URL
headers	JSONB	Optional auth headers
is_active	bool

API Endpoints

Auth

Method	Path	Description
POST	`/api/v1/auth/setup`	First-boot admin password setup
POST	`/api/v1/auth/login`	Login, returns JWT
GET	`/api/v1/auth/me`	Current user info

Admin

Method	Path	Description
GET	`/api/v1/admin/settings`	System settings (guest access, default model, etc.)
PUT	`/api/v1/admin/settings`	Update settings
GET	`/api/v1/admin/stats`	System-wide stats (total runs, cache hit rate, etc.)

Projects

Method	Path	Description
GET	`/api/v1/projects`	List projects
POST	`/api/v1/projects`	Create project
GET	`/api/v1/projects/{id}`	Project detail with experiment summaries
PUT	`/api/v1/projects/{id}`	Update project
DELETE	`/api/v1/projects/{id}`	Delete project and all experiments

Experiments

Method	Path	Description
GET	`/api/v1/experiments`	List experiments (filter by project)
POST	`/api/v1/experiments`	Create experiment
GET	`/api/v1/experiments/{id}`	Experiment detail with run summaries
PUT	`/api/v1/experiments/{id}`	Update experiment config
DELETE	`/api/v1/experiments/{id}`	Delete experiment
POST	`/api/v1/experiments/{id}/sweep`	Start a sweep (grid, random, or guided)
POST	`/api/v1/experiments/{id}/pause`	Pause running sweep
POST	`/api/v1/experiments/{id}/resume`	Resume paused sweep
POST	`/api/v1/experiments/{id}/stop`	Stop sweep

Runs

Method	Path	Description
GET	`/api/v1/experiments/{id}/runs`	List runs with scores (sortable, filterable)
GET	`/api/v1/runs/{id}`	Run detail with stage results
POST	`/api/v1/runs`	Execute a single run (ad-hoc)
POST	`/api/v1/runs/{id}/score`	Add human rating to a run
GET	`/api/v1/experiments/{id}/leaderboard`	Top runs ranked by weighted score

Export

Method	Path	Description
GET	`/api/v1/experiments/{id}/export/best`	Best config as JSON
GET	`/api/v1/experiments/{id}/export/env`	Best config as .env snippet
GET	`/api/v1/experiments/{id}/export/yaml`	Best config as YAML
GET	`/api/v1/experiments/{id}/export/report`	Full experiment report (markdown)

LLM Endpoints (Target Management)

Method	Path	Description
GET	`/api/v1/endpoints`	List configured LLM endpoints
POST	`/api/v1/endpoints`	Add endpoint (URL, API key, label)
PUT	`/api/v1/endpoints/{id}`	Update endpoint
DELETE	`/api/v1/endpoints/{id}`	Remove endpoint
POST	`/api/v1/endpoints/{id}/test`	Test connectivity and list available models

Webhooks

Method	Path	Description
GET	`/api/v1/webhooks`	List webhook configs
POST	`/api/v1/webhooks`	Create webhook
DELETE	`/api/v1/webhooks/{id}`	Remove webhook

WebSocket

Path	Description
`/ws/experiments/{id}`	Live stream: run progress, scores, stage completions
`/ws/dashboard`	Global activity feed across all experiments

Health

Method	Path	Description
GET	`/health`	Health check (DB + Redis connectivity)

MCP Server

PromptLooper exposes an MCP (Model Context Protocol) server so AI agents can drive it programmatically. The MCP server runs as part of the API service.

MCP Tools

Tool	Description
`create_project`	Create a new project workspace
`create_experiment`	Define an experiment with sample data, stages, and scoring
`configure_endpoint`	Add or update an LLM target endpoint
`run_single`	Execute one specific configuration and return results
`run_sweep`	Start a parameter sweep (grid/random/guided)
`get_leaderboard`	Get top N configurations ranked by score
`get_run_detail`	Get full details of a specific run
`export_best_config`	Export the best configuration in JSON/YAML/env format
`pause_sweep`	Pause a running sweep
`resume_sweep`	Resume a paused sweep
`add_human_score`	Rate a run's output
`get_experiment_status`	Check experiment progress
`list_models`	List available models across all configured endpoints

Example Agent Interaction

Agent: "Create a project called 'Chrysopedia Extraction' and an experiment
        that tests the stage3_extraction prompt against Qwen-72B and Qwen-32B,
        sweeping temperature from 0.1 to 0.9 in 0.2 increments.
        Use embedding similarity scoring against these reference outputs.
        Run a grid sweep."

PromptLooper MCP: [create_project] → [create_experiment] → [run_sweep]
                  → streams progress → [get_leaderboard]

Agent: "The top config uses Qwen-72B at temperature 0.3. Export it as
        a .env snippet I can drop into Chrysopedia."

PromptLooper MCP: [export_best_config format=env]

Response Caching

Every LLM call is cached by a SHA-256 hash of:

Prompt text (after template rendering)
Model identifier
All inference parameters (temperature, top_p, max_tokens, etc.)
Input data

If an identical configuration has been run before, the cached response is returned instantly with status: cached. This means:

Re-running experiments with new scoring functions costs zero tokens
Adding a new scorer retroactively evaluates all historical runs
Accidentally re-running a sweep wastes nothing
Cache can be invalidated per-run or per-experiment if needed

Authentication Model

First Boot

App detects no users exist
Presents a setup screen: create admin username + password
Admin account is created, user is logged in

Guest Access

Admin can toggle allow_guest_access in settings
Guests can view experiments and results (read-only)
Guests cannot create experiments, run sweeps, or modify configs
Default: guest access disabled

API Authentication

JWT tokens for the web UI
API key (generated in admin settings) for programmatic access and MCP
API key passed via Authorization: Bearer <key> header

Real-Time Observability Dashboard

The dashboard is the primary user interface during active experimentation. It provides:

Live Experiment View

Progress bar: X of Y runs completed
Token usage accumulator (running total)
Cost estimate (based on configured model pricing)
Cache hit rate for current sweep
Estimated time remaining

Side-by-Side Output Comparison

Pick any two runs and diff their outputs
Highlight differences in prompt, parameters, and response
Score comparison overlay

Leaderboard

Real-time ranked list of runs by weighted score
Sortable by any individual scorer
Click to expand full run detail

Steering Controls

Pause: Stop the sweep after current run completes
Fork: Create a new experiment branching from current best, with modified parameters
Redirect: Change remaining sweep parameters mid-flight
Approve: Mark a configuration as "good enough" and export
Reject: Exclude a run from leaderboard consideration

Activity Timeline

Chronological feed of events: run started, run completed, new best found, cache hit, error
Filterable by event type

Webhook Events

Event	Payload	Trigger
`experiment.started`	experiment_id, sweep config	Sweep begins
`experiment.completed`	experiment_id, best config, summary stats	All runs finished
`experiment.paused`	experiment_id, reason	Manual or budget pause
`new_best_found`	experiment_id, run_id, scores, config	New top-scoring run
`budget.exhausted`	experiment_id, token_count, cost	Token/cost budget hit
`human_needed`	experiment_id, reason, context	Agent requests human review
`run.failed`	run_id, error	Individual run error

Configuration Export Formats

JSON

{
  "model": "qwen2.5-72b-instruct",
  "endpoint": "http://chat.forgetyour.name/api",
  "temperature": 0.3,
  "top_p": 0.85,
  "max_tokens": 2048,
  "system_prompt": "You are a music production knowledge extractor...",
  "score": 0.87,
  "experiment": "chrysopedia-extraction-v2",
  "exported_at": "2026-04-06T12:00:00Z"
}

.env

LLM_MODEL=qwen2.5-72b-instruct
LLM_API_URL=http://chat.forgetyour.name/api
LLM_TEMPERATURE=0.3
LLM_TOP_P=0.85
LLM_MAX_TOKENS=2048
# Score: 0.87 | Experiment: chrysopedia-extraction-v2

YAML

model: qwen2.5-72b-instruct
endpoint: http://chat.forgetyour.name/api
parameters:
  temperature: 0.3
  top_p: 0.85
  max_tokens: 2048
system_prompt: |
  You are a music production knowledge extractor...
metadata:
  score: 0.87
  experiment: chrysopedia-extraction-v2
  exported_at: 2026-04-06T12:00:00Z

Environment Variables

Group	Variable	Default	Notes
Database	`DATABASE_URL`	(none → SQLite)	PostgreSQL connection string
Redis	`REDIS_URL`	(none → in-process)	Redis connection string
Server	`HOST`	`0.0.0.0`	Bind address
Server	`PORT`	`8400`	HTTP port
Auth	`JWT_SECRET`	(auto-generated)	JWT signing key
Auth	`API_KEY`	(none)	Static API key for programmatic access
Defaults	`DEFAULT_ENDPOINT_URL`	(none)	Pre-configured LLM endpoint
Defaults	`DEFAULT_ENDPOINT_KEY`	(none)	API key for default endpoint
Limits	`MAX_CONCURRENT_RUNS`	`4`	Parallel run limit
Limits	`MAX_TOKENS_PER_SWEEP`	`0` (unlimited)	Token budget per sweep
Storage	`DATA_DIR`	`/data`	SQLite DB + file storage location
MCP	`MCP_ENABLED`	`true`	Enable MCP server
MCP	`MCP_PORT`	`8401`	MCP server port

Docker Compose (Production — XPLTD Conventions)

Project name: xpltd_promptlooper Network: promptlooper (172.33.0.0/24) Persistent data: /vmPool/r/services/promptlooper_* PostgreSQL port: 5434 (external) Web UI port: 8400 (external)

Technology Stack

Layer	Technology	Rationale
API	Python 3.12 + FastAPI	Async, OpenAPI auto-gen, matches XPLTD conventions
Task Queue	Celery + Redis	Proven for background job execution, matches Chrysopedia
Database	PostgreSQL 16 (prod) / SQLite (single-container)	JSONB for flexible experiment configs
Real-time	WebSocket via FastAPI + Redis pub/sub	Sub-second dashboard updates
Frontend	React 18 + TypeScript + Vite	Real-time dashboard, matches Chrysopedia
Styling	Tailwind CSS	Fast iteration, utility-first
MCP	Python MCP SDK	Standard protocol for agent integration
Container	Multi-stage Docker build	Single image serves both API and frontend

Development & Deployment

Local Development

git clone git@git.xpltd.co:xpltdco/promptlooper.git
cd promptlooper
cp .env.example .env
docker compose up -d promptlooper-db promptlooper-redis
cd backend && pip install -r requirements.txt
alembic upgrade head
uvicorn main:app --reload --host 0.0.0.0 --port 8000
# In another terminal:
cd frontend && npm install && npm run dev

Production Deployment (ub01)

ssh ub01
cd /vmPool/r/repos/xpltdco/promptlooper
git pull && docker compose build && docker compose up -d

Project Structure

promptlooper/
├── backend/
│   ├── main.py                 # FastAPI entry point
│   ├── config.py               # Pydantic Settings
│   ├── models.py               # SQLAlchemy ORM
│   ├── schemas.py              # Pydantic request/response
│   ├── auth.py                 # JWT + API key auth
│   ├── worker.py               # Celery app config
│   ├── routers/
│   │   ├── auth.py
│   │   ├── projects.py
│   │   ├── experiments.py
│   │   ├── runs.py
│   │   ├── endpoints.py
│   │   ├── export.py
│   │   ├── webhooks.py
│   │   └── admin.py
│   ├── engine/
│   │   ├── runner.py           # Run execution logic
│   │   ├── sweep.py            # Sweep orchestration
│   │   ├── cache.py            # Response cache layer
│   │   ├── adapters/           # LLM endpoint adapters
│   │   │   ├── openai_compat.py
│   │   │   └── base.py
│   │   └── scorers/            # Pluggable scoring functions
│   │       ├── embedding.py
│   │       ├── format.py
│   │       ├── keyword.py
│   │       ├── llm_judge.py
│   │       └── base.py
│   ├── mcp/
│   │   ├── server.py           # MCP server implementation
│   │   └── tools.py            # MCP tool definitions
│   ├── websocket/
│   │   └── manager.py          # WebSocket connection management
│   └── tests/
├── frontend/
│   └── src/
│       ├── pages/
│       │   ├── Setup.tsx       # First-boot admin setup
│       │   ├── Login.tsx
│       │   ├── Dashboard.tsx   # Global activity
│       │   ├── Projects.tsx
│       │   ├── Experiment.tsx  # Experiment builder + config
│       │   ├── Live.tsx        # Real-time observability
│       │   ├── Compare.tsx     # Side-by-side run comparison
│       │   └── Admin.tsx       # System settings
│       ├── components/
│       │   ├── Leaderboard.tsx
│       │   ├── SteeringControls.tsx
│       │   ├── RunCard.tsx
│       │   ├── ScoreChart.tsx
│       │   └── Timeline.tsx
│       └── api/
├── docker/
│   ├── Dockerfile              # Multi-stage: API + frontend
│   └── nginx.conf
├── alembic/
├── docker-compose.yml
├── .env.example
├── CLAUDE.md
└── README.md

25 KiB Raw Blame History Unescape Escape

PromptLooper

Problem Statement

Target Users

Core Concepts

Experiment

Run

Sweep

Scoring Function

Project

Architecture

Services (Production Compose)

Single Container Mode

Data Model

User

Project

Experiment

Run

StageResult

Score

ResponseCache

WebhookConfig

API Endpoints

Auth

Admin

Projects

Experiments

Runs

Export

LLM Endpoints (Target Management)

Webhooks

WebSocket

Health

MCP Server

MCP Tools

Example Agent Interaction

Response Caching

Authentication Model

First Boot

Guest Access

API Authentication

Real-Time Observability Dashboard

Live Experiment View

Side-by-Side Output Comparison

Leaderboard

Steering Controls

Activity Timeline

Webhook Events

Configuration Export Formats

JSON

.env

YAML

Environment Variables

Docker Compose (Production — XPLTD Conventions)

Technology Stack

Development & Deployment

Local Development

Production Deployment (ub01)

Project Structure

25 KiB

Raw Blame History