Created tests/test_engine_core.py with 52 tests covering webhook dispatch
engine (sync+async delivery, retries, dispatch), format scorer structure/length
edge cases, cache hash determinism with nested/special chars, adapter mock call
tracking, grid sweep combo verification, scorer integration with known inputs,
and EventBus. Engine coverage improved from 83% to 90%, webhooks.py from 27%
to 99%.
Full webhook system: CRUD endpoints (list/filter/get/create/update/delete),
WebhookDelivery model for delivery audit trail, dispatch engine with 3-attempt
retry and exponential backoff, Celery task integration with sync fallback,
and webhook firing hooks in runner.py and sweep.py event paths.
- WebSocketManager in backend/websocket/manager.py with per-experiment and global subscriptions
- Redis pub/sub bridge (sync + async) broadcasting events to relevant WebSocket clients
- Deque-based replay buffers with since_ts/limit filtering for reconnection support
- Runtime subscribe/unsubscribe and stats API
- Enhanced /ws endpoint in main.py with subscribe/unsubscribe/replay actions
- 35 tests in test_ws_manager.py, all passing
Four fully authenticated endpoints at /api/export/experiments/{id}/:
- /best: Returns best config as JSON with weighted score and metadata
- /env: Flattened KEY=VALUE format with metadata comments
- /yaml: Simple YAML serialization (no external dependency)
- /report: Full markdown report with config space, top N configs,
score distributions, token usage, and timing stats
34 tests in test_export.py covering all endpoints, auth, 404s, and helpers.
Updated test_routers.py to expect 401 (auth required) instead of 501 (stub).
Custom SVG-based charts with no external dependencies. Scatter plot for
score vs parameter value, bar chart for top N configs comparison, line
chart for score progression over time. Interactive tooltips, click
callbacks, chart type switching, dark mode support. 30 tests added.
- Two-column run selectors with experiment→run cascading dropdowns and URL state sync
- Config diff with color-coded same/changed/added/removed entries using key-level comparison
- Line-level LCS response diff with added/removed/same highlighting
- Score comparison with overlaid indigo/emerald bars per scorer
- Pick Winner buttons submit human_preference score via API
- Full RunCard detail view for each run side by side
- 15 tests added (5 diff helper unit tests + 10 component integration tests)
- App.test.tsx updated to mock experiments.list for ComparePage route
- List runs with filtering by experiment, status, and score range plus pagination
- Get run detail with eager-loaded stage results and scores
- Ad-hoc single run creation with Celery/sync dispatch
- Human scoring endpoint (POST /{id}/score)
- Leaderboard endpoint with configurable weighted scoring from experiment scoring_config
- Added AdHocRunCreate, LeaderboardEntry, LeaderboardResponse schemas
- 25 tests in test_runs.py, all passing (503 total tests passing)
Add complete experiments API: list (with project filter), get, create, update,
delete, plus sweep lifecycle (start/pause/resume/stop/status). Adds
SweepRequest and SweepStatusResponse schemas. Sweep dispatch routes through
Celery with synchronous fallback for single-container mode. Redis flags control
pause/resume/stop; direct DB updates used when Redis unavailable. 34 tests.
Extracted inline SteeringControls from LivePage into standalone component.
Added Fork button (modal to clone experiment config), Export Best dropdown
(JSON/YAML/.env download), and estimated time remaining stat. LivePage
updated to import the new component. 33 tests added, all 284 tests pass.
- Add LLMEndpoint model to models.py with encrypted api_key field
- Create encryption.py with Fernet symmetric encryption (key derived from JWT_SECRET via PBKDF2)
- Implement full endpoints router: list, get, create, update, delete + test_connection
- Test endpoint calls adapter.test_connection() and list_models()
- API keys never exposed in responses; has_api_key boolean flag added
- 25 tests in test_endpoints.py, all 444 tests passing
Extract the inline LeaderboardTable from LivePage into a standalone
Leaderboard component with click-to-expand detail rows, sortable
columns, smooth slide-in animation for new entries, and a subtle
glow effect on the best run. 29 tests added.
Created engine/tasks.py with:
- execute_run and execute_sweep Celery tasks registered via autodiscover
- SyncTaskResult class mimicking Celery AsyncResult for in-process mode
- dispatch_run/dispatch_sweep helpers that route to Celery or sync based on config
- Proper async-to-sync bridging for the async engine functions
- 17 tests covering task execution, sync fallback, error handling, and Celery dispatch
Full LivePage implementation with 60/40 split layout:
- Left column: Activity Timeline with color-coded event cards (run.started, run.completed, new_best_found, cache_hit, run.failed), event type filtering, and auto-scroll toggle
- Right column: Leaderboard table with sortable columns, best-run highlighting, and status badges; Steering Controls with pause/resume/stop (with confirmation dialogs), progress bar, token counter, cost estimate, and cache hit rate
- WebSocket integration with exponential backoff reconnect, connection status indicator, and experiment subscription
- 35 tests covering loading/error states, WebSocket events, timeline filtering, leaderboard updates, progress tracking, and steering control interactions
Built standalone PromptEditor with transparent-textarea overlay for syntax
highlighting of Jinja2 expressions, statements, and comments. Includes
clickable variable sidebar for insertion and preview panel with sample data
substitution. Integrated into ExperimentPage PipelineStageCard. 27 tests added.
Adds backend/engine/sweep.py with three sweep strategies:
- GridSweep: exhaustive enumeration of all parameter combinations
- RandomSweep: N random samples from parameter ranges (list, min/max, step)
- GuidedSweep: top-K exploitation + random exploration from previous results
Features: bounded parallelism via asyncio.Semaphore, token budget enforcement,
Redis-based pause/resume/stop control flags, sweep-level event publishing.
36 tests in test_sweep.py covering config generation, helpers, and full sweep execution.
Build the full Experiment Builder (ExperimentPage.tsx) with: basic info form,
sample data input (text/JSON/file upload), pipeline stage builder with template
variables and preview, scoring configuration with enable toggles and weight
sliders, parameter space definition (fixed/range/options types), and action
buttons (Save Draft, Run Single, Start Sweep). Supports both creating new
experiments and editing existing ones. 20 tests added.
Adds backend/engine/runner.py with run_single() that iterates pipeline stages,
renders Jinja2 prompt templates with stage history context, checks/stores response
cache, calls LLM adapters, runs configured scorers, creates StageResult and Score
records, and publishes progress events via Redis pub/sub or in-process EventBus.
Includes 21 passing tests covering all execution paths.
Add OpenAICompatAdapter that works with any OpenAI-compatible API endpoint
(OpenWebUI, vLLM, Ollama, OpenAI, Anthropic via proxy). Features:
- Async HTTP calls via httpx with configurable timeout
- Chat completions format with system + user messages
- Token usage parsing from responses
- Exponential backoff retries (configurable, default 3 attempts)
- Both streaming (SSE) and non-streaming modes
- Model listing and connection testing
- 21 tests covering construction, request building, response parsing,
retry logic, and error handling
- Full setup form with username, password, confirm password
- Auth detection on mount (redirects if already authenticated)
- Client-side validation (empty username, short password, mismatch)
- Server error handling (409 conflict, network errors)
- Welcoming UI with gradient background, dark mode support
- 9 new tests covering all states and error paths
- Updated App.test.tsx to handle async SetupPage rendering
- Added @testing-library/user-event dependency
Define the LLM adapter interface in backend/engine/adapters/base.py with
async methods complete(), list_models(), and test_connection(). The
AdapterResponse dataclass holds response text, token counts, latency,
model name, and raw metadata. Includes 11 tests covering instantiation
guards, concrete subclass behavior, and dataclass semantics.
Create docker/entrypoint.sh to run alembic migrations on API startup.
Create backend/worker.py with Celery app config for the compose worker service.
Fix README single-container port (8000) and add production compose documentation.
Add 27 tests (stack integration + worker) verifying all Docker/compose artifacts
are present, consistent, and the /health endpoint responds correctly.
Add SetupPage, LoginPage, DashboardPage, ProjectsPage, ExperimentPage, LivePage,
ComparePage, and AdminPage as placeholder components. Wire up react-router-dom routing
in App.tsx with BrowserRouter in main.tsx. Unknown routes redirect to dashboard.
Install vitest + @testing-library/react and add 9 routing tests. Build passes cleanly.
FastAPI application with:
- CORS middleware (permissive for dev)
- /health endpoint checking DB and Redis connectivity
- /ws WebSocket endpoint with ConnectionManager for real-time updates
- Async lifespan hooks for DB engine and Redis init/teardown
- get_db dependency for session management
- Dynamic router mounting that silently skips missing router modules
- 10 tests covering all endpoints and utilities
All 13 environment variables from the spec defined with proper defaults.
SQLite fallback when DATABASE_URL is unset, in-process queue flag when
REDIS_URL is unset, JWT_SECRET auto-generation, empty API_KEY normalization.
13 unit tests covering all configuration paths.
Three-stage Dockerfile: frontend-build (Node 20), api (Python 3.12 + uvicorn),
web (nginx 1.27). nginx.conf proxies /api and /ws to the API service with
WebSocket upgrade support. Includes backend/requirements.txt with all Python
deps, frontend scaffolding (Vite + React + TypeScript + Tailwind), and
placeholder alembic files for Docker COPY compatibility.
Fixed DATABASE_URL to use standard postgresql:// scheme, hardcoded DB
credentials for dev simplicity, added API_KEY pass-through, set worker
working_dir, and made JWT_SECRET optional with dev default. All 5 services:
db (:5434), redis, api (MCP :8401), worker (Celery), web (:8400).
Includes all 13 env vars organized into 7 groups: Database, Redis,
Server, Auth, Default LLM Endpoint, Limits, Storage, and MCP.
Production-only variables are commented out; single-container defaults
work out of the box.
Set up all directories from the spec's Project Structure section:
- backend/ with routers/, engine/adapters/, engine/scorers/, mcp/,
websocket/, tests/ (all with __init__.py)
- frontend/src/ with pages/, components/, api/ (.gitkeep)
- docker/ (.gitkeep)
- alembic/versions/ (.gitkeep)
Add README.md with project description, quick-start instructions, and
AGPL-3.0 license badge. Add .gitignore for Python, Node, and Docker
artifacts. Include existing CLAUDE.md, spec, docker-compose.yml, and
env.example.