promptlooper/Auto Run Docs/02a-backend-engine.md
John Lightner 5a1d029b9b MAESTRO: Add comprehensive engine test suite achieving 90% coverage
Created tests/test_engine_core.py with 52 tests covering webhook dispatch
engine (sync+async delivery, retries, dispatch), format scorer structure/length
edge cases, cache hash determinism with nested/special chars, adapter mock call
tracking, grid sweep combo verification, scorer integration with known inputs,
and EventBus. Engine coverage improved from 83% to 90%, webhooks.py from 27%
to 99%.
2026-04-07 03:45:24 -05:00

54 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 2a — Backend Engine
Implement the core experiment execution engine: LLM adapters, response caching, run execution, sweep orchestration, and scoring framework. This is the heart of PromptLooper.
- [x] Implement backend/engine/adapters/base.py defining the BaseAdapter abstract class with methods: complete(prompt, model, params) → AdapterResponse (containing response text, tokens_in, tokens_out, latency_ms), list_models() → list of model identifiers, and test_connection() → bool. Define the AdapterResponse dataclass.
- [x] Implement backend/engine/adapters/openai_compat.py as the primary adapter. It should work with any OpenAI-compatible API (OpenWebUI, vLLM, Ollama, OpenAI, Anthropic via proxy). Use httpx for async HTTP calls. Support chat completions format with system + user messages. Parse token usage from the response. Handle errors gracefully with retries (3 attempts, exponential backoff). Support both streaming and non-streaming modes.
- [x] Implement backend/engine/cache.py with the ResponseCache layer. Key function: compute_config_hash(prompt, model, params, input_data) → SHA-256 hex string. Methods: get(config_hash) → CachedResponse or None, put(config_hash, response, metadata). In SQLite mode, use the ResponseCache table directly. In Postgres mode, same table but with connection pooling. Include a cache_stats() method returning hit rate, total entries, and storage size.
- [x] Implement backend/engine/runner.py for individual run execution. The run_single function should: (1) iterate through pipeline stages, (2) render prompt templates with Jinja2 (allowing previous stage output as context), (3) check cache before calling LLM, (4) call the LLM adapter if cache miss, (5) store response in cache, (6) create StageResult records, (7) run all configured scorers, (8) create Score records, (9) update Run status and timing, (10) publish progress events via Redis pub/sub (or in-process event bus).
<!-- Completed: Implemented run_single with all 10 requirements, EventBus (Redis + in-process), Jinja2 templating. 21 tests in test_runner.py, all passing. -->
- [x] Implement backend/engine/sweep.py for sweep orchestration. Support three sweep types: GridSweep (enumerate all combinations from parameter_space), RandomSweep (sample N random configs from parameter ranges), GuidedSweep (use previous results to inform next config — start with top-K exploitation + random exploration). The sweep runner should: respect MAX_CONCURRENT_RUNS for parallelism, track token budget and stop at MAX_TOKENS_PER_SWEEP, emit WebSocket events for each run completion, handle pause/resume/stop via Redis flags.
<!-- Completed: Implemented all 3 sweep types (grid/random/guided), bounded parallelism via asyncio.Semaphore, token budget enforcement, Redis-based pause/resume/stop flags, sweep-level events. 36 tests in test_sweep.py, all passing. -->
- [x] Implement backend/engine/scorers/base.py defining the BaseScorer abstract class with: name property, score(input_data, output, context) → float (0.0 to 1.0), and an optional async variant. The context dict should include the experiment config, stage results, and any reference data.
<!-- Completed: BaseScorer ABC with name property, score() abstract method, score_async() default implementation. 9 tests in test_scorer_base.py, all passing. -->
- [x] Implement backend/engine/scorers/embedding.py — uses a configurable embedding endpoint (Ollama nomic-embed-text or any OpenAI-compatible embedding API) to compute cosine similarity between output and reference answer. Normalize to 0.01.0 range.
<!-- Completed: EmbeddingScorer using httpx async calls to OpenAI-compatible /embeddings endpoint, cosine similarity normalized to [0,1]. Reads reference from context["reference"]. 19 tests in test_scorer_embedding.py, all passing. -->
- [x] Implement backend/engine/scorers/format.py — checks if output matches expected format. Supports: json (valid JSON parse), markdown (has headers, lists), length (within min/max token count), structure (matches a provided JSON schema).
<!-- Completed: FormatScorer with 4 format checks (json, markdown, length, structure). JSON schema validation via jsonschema library with basic fallback. 38 tests in test_scorer_format.py, all passing. -->
- [x] Implement backend/engine/scorers/keyword.py — checks for presence/absence of required keywords in output. Configurable with required_present and required_absent lists. Score = (found / required) ratio.
<!-- Completed: KeywordScorer with required_present/required_absent lists, case-sensitive option, combined ratio scoring. 37 tests in test_scorer_keyword.py, all passing. -->
- [x] Implement backend/engine/scorers/llm_judge.py — sends the output to a separate LLM with a configurable judge prompt and asks for a 1-10 rating. Parses the numeric score from the response. This scorer requires an LLM call so it should be clearly marked as "costs tokens" in the UI. Cache the judge's response too.
<!-- Completed: LLMJudgeScorer with configurable judge prompt, 1-10 rating parsing via regex, normalized to 0.0-1.0. COSTS_TOKENS class marker for UI. Optional ResponseCacheLayer integration for caching judge responses. Retries with exponential backoff. 36 tests in test_scorer_llm_judge.py, all passing. -->
- [x] Wire up the Celery worker in backend/worker.py. Define tasks: execute_run(run_id), execute_sweep(experiment_id, sweep_config). Configure Celery to use Redis as broker. In single-container mode (no Redis), implement a simple synchronous fallback that runs tasks in-process.
<!-- Completed: Created engine/tasks.py with execute_run and execute_sweep Celery tasks (autodiscovered via worker.py). SyncTaskResult class mimics AsyncResult for fallback. dispatch_run/dispatch_sweep helpers route to Celery or sync execution based on settings.use_in_process_queue. 17 tests in test_tasks.py, all passing. -->
- [x] Implement backend/routers/endpoints.py fully — CRUD for LLM endpoint configurations. The test endpoint should call adapter.test_connection() and adapter.list_models() and return the results. Store endpoint configs in the database with encrypted API keys (Fernet symmetric encryption, key derived from JWT_SECRET).
<!-- Completed: Full CRUD (list/get/create/update/delete) + test_connection endpoint. LLMEndpoint model added to models.py. Fernet encryption via encryption.py (PBKDF2 key derivation from JWT_SECRET). API keys never exposed in responses; has_api_key boolean flag added to EndpointResponse. 25 tests in test_endpoints.py, all passing. -->
- [x] Implement backend/routers/experiments.py fully — CRUD plus sweep control. POST /experiments/{id}/sweep should validate the sweep config, create Run records for all configurations, and dispatch to Celery. Pause/resume/stop should set Redis flags that the sweep runner checks between runs.
<!-- Completed: Full CRUD (list with project filter, get, create, update, delete) + sweep control (start/pause/resume/stop + status). SweepRequest/SweepStatusResponse schemas added. Sweep dispatch via Celery/sync fallback. Redis flags for pause/resume/stop, with single-container mode fallback. 34 tests in test_experiments.py, all passing. -->
- [x] Implement backend/routers/runs.py fully — list runs with filtering (by experiment, status, score range), get run detail with stage results and scores, POST for ad-hoc single runs, and POST /{id}/score for human ratings. Include the leaderboard endpoint that returns top N runs ranked by weighted score.
<!-- Completed: Full runs router with list (filter by experiment/status/score range + pagination), detail (eager-loaded stage results + scores), ad-hoc run creation with dispatch, human scoring POST, and leaderboard with configurable weighted scoring from experiment scoring_config. Added AdHocRunCreate, LeaderboardEntry, LeaderboardResponse schemas. 25 tests in test_runs.py, all passing. -->
- [x] Implement backend/routers/export.py — export best config in JSON, .env, and YAML formats as defined in the spec. Include metadata (score, experiment name, timestamp). The report endpoint should generate a markdown summary of the experiment: config space explored, top 5 configs, score distributions, token usage, timing stats.
<!-- Completed: Full export router with 4 endpoints: /best (JSON with weighted score, metadata), /env (flattened KEY=VALUE with comments), /yaml (simple serializer, no PyYAML dependency), /report (markdown with config space, top N configs, score distributions, token usage, timing stats). Auth required on all endpoints. 34 tests in test_export.py, all passing. -->
- [x] Implement backend/websocket/manager.py — WebSocket connection manager that: maintains active connections per experiment and globally, receives Redis pub/sub messages and broadcasts to relevant connections, handles connection/disconnection cleanly, supports reconnection with message replay (last N events).
<!-- Completed: WebSocketManager with per-experiment and global subscriptions, Redis pub/sub bridge (sync + async), deque-based replay buffers with since_ts/limit filtering, clean disconnect cleanup, runtime subscribe/unsubscribe, stats API. Integrated into main.py with enhanced /ws endpoint supporting subscribe/unsubscribe/replay actions and query-param-based initial subscriptions. 35 tests in test_ws_manager.py, all passing. -->
- [x] Implement backend/routers/webhooks.py — CRUD for webhook configs. When events occur (in runner.py and sweep.py), dispatch webhook calls asynchronously via Celery. Include retry logic (3 attempts) and log delivery status.
<!-- Completed: Full CRUD (list with event_type filter, get, create, update, delete) + webhook dispatch engine (engine/webhooks.py). WebhookDelivery model for delivery logging. Dispatch integrated into runner.py (run.completed, run.failed) and sweep.py (sweep.completed) via fire_webhooks Celery task with sync fallback. 3 retries with exponential backoff, delivery status logged per attempt. 30 tests in test_webhooks.py, all passing. -->
- [x] Write tests for the core engine: test cache hash determinism, test adapter mock calls, test scorer implementations with known inputs, test sweep configuration generation (grid should produce correct number of combos, random should respect ranges). Aim for >80% coverage on engine/ directory.
<!-- Completed: Created tests/test_engine_core.py with 52 tests covering: webhook engine (get_active_webhooks, _log_delivery, deliver_webhook sync+async, dispatch_webhooks sync+async), format scorer structure/length edge cases, cache hash edge cases, adapter mock call tracking, grid combo count verification, scorer integration with known inputs, EventBus. Coverage on engine/ went from 83% to 90%. engine/webhooks.py went from 27% to 99%. 309 tests passing (3 pre-existing failures in test_tasks.py unrelated to this work). -->