Full webhook system: CRUD endpoints (list/filter/get/create/update/delete),
WebhookDelivery model for delivery audit trail, dispatch engine with 3-attempt
retry and exponential backoff, Celery task integration with sync fallback,
and webhook firing hooks in runner.py and sweep.py event paths.
- WebSocketManager in backend/websocket/manager.py with per-experiment and global subscriptions
- Redis pub/sub bridge (sync + async) broadcasting events to relevant WebSocket clients
- Deque-based replay buffers with since_ts/limit filtering for reconnection support
- Runtime subscribe/unsubscribe and stats API
- Enhanced /ws endpoint in main.py with subscribe/unsubscribe/replay actions
- 35 tests in test_ws_manager.py, all passing
Four fully authenticated endpoints at /api/export/experiments/{id}/:
- /best: Returns best config as JSON with weighted score and metadata
- /env: Flattened KEY=VALUE format with metadata comments
- /yaml: Simple YAML serialization (no external dependency)
- /report: Full markdown report with config space, top N configs,
score distributions, token usage, and timing stats
34 tests in test_export.py covering all endpoints, auth, 404s, and helpers.
Updated test_routers.py to expect 401 (auth required) instead of 501 (stub).
- List runs with filtering by experiment, status, and score range plus pagination
- Get run detail with eager-loaded stage results and scores
- Ad-hoc single run creation with Celery/sync dispatch
- Human scoring endpoint (POST /{id}/score)
- Leaderboard endpoint with configurable weighted scoring from experiment scoring_config
- Added AdHocRunCreate, LeaderboardEntry, LeaderboardResponse schemas
- 25 tests in test_runs.py, all passing (503 total tests passing)
Add complete experiments API: list (with project filter), get, create, update,
delete, plus sweep lifecycle (start/pause/resume/stop/status). Adds
SweepRequest and SweepStatusResponse schemas. Sweep dispatch routes through
Celery with synchronous fallback for single-container mode. Redis flags control
pause/resume/stop; direct DB updates used when Redis unavailable. 34 tests.
- Add LLMEndpoint model to models.py with encrypted api_key field
- Create encryption.py with Fernet symmetric encryption (key derived from JWT_SECRET via PBKDF2)
- Implement full endpoints router: list, get, create, update, delete + test_connection
- Test endpoint calls adapter.test_connection() and list_models()
- API keys never exposed in responses; has_api_key boolean flag added
- 25 tests in test_endpoints.py, all 444 tests passing
Created engine/tasks.py with:
- execute_run and execute_sweep Celery tasks registered via autodiscover
- SyncTaskResult class mimicking Celery AsyncResult for in-process mode
- dispatch_run/dispatch_sweep helpers that route to Celery or sync based on config
- Proper async-to-sync bridging for the async engine functions
- 17 tests covering task execution, sync fallback, error handling, and Celery dispatch
Adds backend/engine/sweep.py with three sweep strategies:
- GridSweep: exhaustive enumeration of all parameter combinations
- RandomSweep: N random samples from parameter ranges (list, min/max, step)
- GuidedSweep: top-K exploitation + random exploration from previous results
Features: bounded parallelism via asyncio.Semaphore, token budget enforcement,
Redis-based pause/resume/stop control flags, sweep-level event publishing.
36 tests in test_sweep.py covering config generation, helpers, and full sweep execution.
Adds backend/engine/runner.py with run_single() that iterates pipeline stages,
renders Jinja2 prompt templates with stage history context, checks/stores response
cache, calls LLM adapters, runs configured scorers, creates StageResult and Score
records, and publishes progress events via Redis pub/sub or in-process EventBus.
Includes 21 passing tests covering all execution paths.
Add OpenAICompatAdapter that works with any OpenAI-compatible API endpoint
(OpenWebUI, vLLM, Ollama, OpenAI, Anthropic via proxy). Features:
- Async HTTP calls via httpx with configurable timeout
- Chat completions format with system + user messages
- Token usage parsing from responses
- Exponential backoff retries (configurable, default 3 attempts)
- Both streaming (SSE) and non-streaming modes
- Model listing and connection testing
- 21 tests covering construction, request building, response parsing,
retry logic, and error handling
Define the LLM adapter interface in backend/engine/adapters/base.py with
async methods complete(), list_models(), and test_connection(). The
AdapterResponse dataclass holds response text, token counts, latency,
model name, and raw metadata. Includes 11 tests covering instantiation
guards, concrete subclass behavior, and dataclass semantics.