test: Added automatic primary→fallback LLM endpoint switching in ChatSe…

- "backend/chat_service.py" - "backend/tests/test_chat.py" - "docker-compose.yml" GSD-Task: S08/T01
2026-04-04 14:31:28 +00:00 · 2026-04-04 14:31:28 +00:00 · 899ab742a8
commit 899ab742a8
parent c42d21a29f
12 changed files with 597 additions and 4 deletions
--- a/.gsd/milestones/M025/M025-ROADMAP.md
+++ b/.gsd/milestones/M025/M025-ROADMAP.md
@ -12,7 +12,7 @@ Production hardening, mobile polish, creator onboarding, and formal validation.
 | S04 | [B] Rate Limiting + Cost Management | low | — | ✅ | Chat requests limited per-user and per-creator. Token usage dashboard in admin. |
 | S05 | [B] AI Transparency Page | low | — | ✅ | Creator sees all entities, relationships, and technique pages derived from their content |
 | S06 | [B] Graph Backend Evaluation | low | — | ✅ | Benchmark report: NetworkX vs Neo4j at current and projected entity counts |
-| S07 | [A] Data Export (GDPR-Style) | medium | — | ⬜ | Creator downloads a ZIP with all derived content, entities, and relationships |
+| S07 | [A] Data Export (GDPR-Style) | medium | — | ✅ | Creator downloads a ZIP with all derived content, entities, and relationships |
 | S08 | [B] Load Testing + Fallback Resilience | medium | — | ⬜ | 10 concurrent chat sessions maintain acceptable latency. DGX down → Ollama fallback works. |
 | S09 | [B] Prompt Optimization Pass | low | — | ⬜ | Chat quality reviewed across creators. Personality fidelity assessed. |
 | S10 | Requirement Validation (R015, R037-R041) | low | — | ⬜ | R015, R037, R038, R039, R041 formally validated and signed off |
--- a/.gsd/milestones/M025/slices/S07/S07-SUMMARY.md
+++ b/.gsd/milestones/M025/slices/S07/S07-SUMMARY.md
@ -0,0 +1,85 @@
 ---
 id: S07
 parent: M025
 milestone: M025
 provides:
  - GET /creator/export endpoint returning ZIP archive of all creator-owned data
  - Export My Data button on CreatorDashboard
 requires:
  []
 affects:
  []
 key_files:
  - backend/routers/creator_dashboard.py
  - backend/tests/test_export.py
  - frontend/src/pages/CreatorDashboard.tsx
  - frontend/src/pages/CreatorDashboard.module.css
  - frontend/src/api/creator-dashboard.ts
 key_decisions:
  - In-memory ZIP via io.BytesIO — per-creator datasets are small enough that disk streaming isn't needed
  - Column introspection via __table__.columns for serialization — adapts automatically to schema changes
  - Blob download via hidden anchor + URL.createObjectURL — standard browser download pattern
 patterns_established:
  - Authenticated blob download pattern: fetch with Bearer token → response.blob() → object URL → hidden anchor click → URL.revokeObjectURL
 observability_surfaces:
  - Structured logging on export start (creator_id) and completion (file count, approximate size)
 drill_down_paths:
  - .gsd/milestones/M025/slices/S07/tasks/T01-SUMMARY.md
  - .gsd/milestones/M025/slices/S07/tasks/T02-SUMMARY.md
 duration: ""
 verification_result: passed
 completed_at: 2026-04-04T14:20:43.400Z
 blocker_discovered: false
 ---
 # S07: [A] Data Export (GDPR-Style)
 **Creator can download a ZIP archive of all their derived content, entities, and relationships via an authenticated endpoint, with a one-click button on the dashboard.**
 ## What Happened
 Added a GDPR-style data export feature spanning backend and frontend. The backend endpoint `GET /creator/export` queries 12 creator-owned tables (creators, source_videos, key_moments, technique_pages, technique_page_versions, related_technique_links, video_consents, consent_audit_log, posts, post_attachments, highlight_candidates, generated_shorts), serializes each to JSON with UUID/datetime handling via `default=str`, and packages them into an in-memory ZIP archive with `export_metadata.json` containing timestamp and creator_id. Related technique links include both directions (outgoing and incoming). The endpoint reuses the established auth pattern from the transparency endpoint.
 The frontend adds an "Export My Data" button to CreatorDashboard.tsx with loading spinner during download, inline error display on failure, and browser download via hidden anchor + object URL on success. The `exportCreatorData()` function lives in `creator-dashboard.ts` alongside other dashboard API functions.
 9 standalone ASGI tests cover ZIP validity, JSON content correctness, UUID/datetime serialization, cross-references, metadata fields, auth requirement (401), and creator-link requirement (404).
 ## Verification
 Backend: `cd backend && python -m pytest tests/test_export.py -v` — 9/9 pass (ZIP structure, JSON validity, serialization, auth, 404). Frontend: `cd frontend && npm run build` — clean build, 0 errors.
 ## Requirements Advanced
 None.
 ## Requirements Validated
 None.
 ## New Requirements Surfaced
 None.
 ## Requirements Invalidated or Re-scoped
 None.
 ## Deviations
 ZIP contains 13 files (12 data tables + metadata) rather than the 10 mentioned in the plan's must-haves. The plan description already referenced all 12 tables; the must-have count was understated. Export function placed in creator-dashboard.ts rather than a separate creator.ts — keeps dashboard API functions co-located.
 ## Known Limitations
 Binary attachments (uploaded files, generated short videos) are not included in the export — only metadata. The export_metadata.json notes this. Filename uses creator_id rather than slug.
 ## Follow-ups
 None.
 ## Files Created/Modified
 - `backend/routers/creator_dashboard.py` — Added GET /creator/export endpoint with ZIP archive generation for 12 tables
 - `backend/tests/test_export.py` — 9 standalone ASGI tests for export endpoint
 - `frontend/src/pages/CreatorDashboard.tsx` — Added Export My Data button with loading/error states
 - `frontend/src/pages/CreatorDashboard.module.css` — Export button styling matching dashboard design
 - `frontend/src/api/creator-dashboard.ts` — Added exportCreatorData() blob download function
--- a/.gsd/milestones/M025/slices/S07/S07-UAT.md
+++ b/.gsd/milestones/M025/slices/S07/S07-UAT.md
@ -0,0 +1,53 @@
 # S07: [A] Data Export (GDPR-Style) — UAT
 **Milestone:** M025
 **Written:** 2026-04-04T14:20:43.401Z
 ## UAT: Data Export (GDPR-Style)
 ### Preconditions
 - Logged in as a user linked to a creator account
 - Creator has at least one source video processed through the pipeline (producing key moments, technique pages, etc.)
 ### Test 1: Export button visible on dashboard
 1. Navigate to Creator Dashboard
 2. **Expected:** "Export My Data" button is visible in the dashboard UI
 3. Button has a download icon and matches dashboard styling
 ### Test 2: Successful export download
 1. Click "Export My Data" button
 2. **Expected:** Button shows loading state (spinner, disabled)
 3. Wait for download to complete
 4. **Expected:** Browser downloads a file named `chrysopedia-export-{creator_id}.zip`
 5. Open the ZIP file
 6. **Expected:** Contains 13 files: `creators.json`, `source_videos.json`, `key_moments.json`, `technique_pages.json`, `technique_page_versions.json`, `related_technique_links.json`, `video_consents.json`, `consent_audit_log.json`, `posts.json`, `post_attachments.json`, `highlight_candidates.json`, `generated_shorts.json`, `export_metadata.json`
 7. Open `export_metadata.json`
 8. **Expected:** Contains `exported_at` (ISO timestamp), `creator_id`, and a note about binary attachments not being included
 ### Test 3: JSON content validity
 1. From the downloaded ZIP, open `creators.json`
 2. **Expected:** Valid JSON array with one entry containing the creator's name, slug, and other fields
 3. Open `technique_pages.json`
 4. **Expected:** Valid JSON array. UUID fields are strings (not objects). Datetime fields are ISO-formatted strings.
 ### Test 4: Related links include cross-references
 1. Open `related_technique_links.json` from the ZIP
 2. **Expected:** Includes links where this creator's technique pages are the source AND links where they are the target
 ### Test 5: Auth required
 1. Open browser dev tools, clear auth token
 2. Try accessing `GET /api/v1/creator/export` directly
 3. **Expected:** 401 Unauthorized response
 ### Test 6: Non-creator user gets 404
 1. Log in as a user that is NOT linked to any creator
 2. Navigate to `/api/v1/creator/export`
 3. **Expected:** 404 response (no creator record found)
 ### Test 7: Error state on failure
 1. Simulate a backend failure (e.g., stop the API mid-request or use network throttling to cause timeout)
 2. **Expected:** Button returns to normal state, inline error message displayed to user
 ### Edge Cases
 - **Empty creator (no videos):** Export should still succeed with empty arrays in each JSON file
 - **Large dataset:** Export completes without timeout for creators with many videos/moments
--- a/.gsd/milestones/M025/slices/S07/tasks/T02-VERIFY.json
+++ b/.gsd/milestones/M025/slices/S07/tasks/T02-VERIFY.json
@ -0,0 +1,16 @@
 {
  "schemaVersion": 1,
  "taskId": "T02",
  "unitId": "M025/S07/T02",
  "timestamp": 1775312370686,
  "passed": true,
  "discoverySource": "task-plan",
  "checks": [
    {
      "command": "cd frontend",
      "exitCode": 0,
      "durationMs": 8,
      "verdict": "pass"
    }
  ]
 }
--- a/.gsd/milestones/M025/slices/S08/S08-PLAN.md
+++ b/.gsd/milestones/M025/slices/S08/S08-PLAN.md
@ -1,6 +1,36 @@
 # S08: [B] Load Testing + Fallback Resilience
-**Goal:** Load test concurrent chat sessions and verify fallback resilience
+**Goal:** ChatService survives primary LLM endpoint failure via automatic fallback. Load test script proves 10 concurrent chat sessions maintain acceptable latency.
 **Demo:** After this: 10 concurrent chat sessions maintain acceptable latency. DGX down → Ollama fallback works.
 ## Tasks
 - [x] **T01: Added automatic primary→fallback LLM endpoint switching in ChatService with two unit tests covering APIConnectionError and InternalServerError scenarios** — Add automatic fallback from primary to secondary LLM endpoint in ChatService, matching the pattern already used by the sync LLMClient in pipeline/llm_client.py. When the primary openai.AsyncOpenAI client fails with APIConnectionError, APITimeoutError, or InternalServerError during streaming, retry the entire create() call with a fallback client pointing at settings.llm_fallback_url + settings.llm_fallback_model. Add the fallback_used field to the SSE done event. Update docker-compose.yml to pass LLM_FALLBACK_URL=http://chrysopedia-ollama:11434/v1 to the API container. Write unit tests for both APIConnectionError and InternalServerError fallback scenarios.
 Steps:
 1. Read `backend/chat_service.py` and `backend/pipeline/llm_client.py` to understand the existing fallback pattern.
 2. In ChatService.__init__, create `self._fallback_openai = openai.AsyncOpenAI(base_url=settings.llm_fallback_url, api_key=settings.llm_api_key)`.
 3. In stream_response(), wrap the `self._openai.chat.completions.create(...)` call and its async iteration in a try/except for `(openai.APIConnectionError, openai.APITimeoutError, openai.InternalServerError)`. On catch, log WARNING with `chat_llm_fallback` prefix including the error type and message, then retry the same create() call using `self._fallback_openai` and `self.settings.llm_fallback_model`. Track `fallback_used = True`.
 4. Add `fallback_used` (bool) to the done event data dict: `{"cascade_tier": ..., "conversation_id": ..., "fallback_used": fallback_used}`.
 5. Update the model name logged in the usage log call — when fallback is used, pass `self.settings.llm_fallback_model` instead of `self.settings.llm_model`.
 6. In `docker-compose.yml`, add `LLM_FALLBACK_URL: http://chrysopedia-ollama:11434/v1` and `LLM_FALLBACK_MODEL: fyn-llm-agent-chat` to the chrysopedia-api environment block.
 7. In `backend/tests/test_chat.py`, add two test functions:
   - `test_chat_fallback_on_connection_error`: mock primary openai to raise `openai.APIConnectionError`, mock fallback openai to return streaming chunks. Assert SSE events include tokens and done event has `fallback_used: true`.
   - `test_chat_fallback_on_internal_server_error`: same but with `openai.InternalServerError`.
 8. Run `cd backend && python -m pytest tests/test_chat.py -v -k fallback` — both tests pass.
  - Estimate: 45m
  - Files: backend/chat_service.py, backend/tests/test_chat.py, docker-compose.yml
  - Verify: cd backend && python -m pytest tests/test_chat.py -v -k fallback
 - [ ] **T02: Write async load test script for 10 concurrent chat sessions** — Create a standalone Python script that fires 10 concurrent chat requests to the SSE endpoint, parses streaming events to measure time-to-first-token (TTFT) and total response time, and reports p50/p95/max latency statistics. Uses httpx (already a project dependency) + asyncio. No external load testing tools needed.
 Steps:
 1. Create `scripts/load_test_chat.py` with argparse accepting `--url` (default http://localhost:8096), `--concurrency` (default 10), `--query` (default 'What are common compression techniques?').
 2. Implement an async function `run_single_chat(client, url, query)` that: POSTs to `{url}/api/v1/chat` with `{"query": query}`, reads the SSE stream line-by-line, records timestamp of first `event: token` line (TTFT), records total time when stream ends, returns a result dict with ttft_ms, total_ms, token_count, error (if any).
 3. Implement `run_load_test(url, concurrency, query)` that creates an httpx.AsyncClient with timeout=60s, fires `concurrency` concurrent `run_single_chat` calls via asyncio.gather, collects results.
 4. Compute and print statistics: for both TTFT and total time, show min/p50/p95/max. Show error count. Show per-request summary table.
 5. Add `--auth-token` optional flag for authenticated requests (sets Authorization header) to avoid IP rate limit (10/hour default). Document in script docstring that running 10 requests from one IP will hit the rate limit unless authenticated or rate limit is raised.
 6. Add `--output` flag to write results as JSON to a file.
 7. Validate the script runs: `python scripts/load_test_chat.py --help` exits 0.
 8. Test SSE parsing logic with a small inline unit test or a `--dry-run` flag that uses a mock response.
  - Estimate: 40m
  - Files: scripts/load_test_chat.py
  - Verify: python scripts/load_test_chat.py --help && echo 'Script OK'
--- a/.gsd/milestones/M025/slices/S08/S08-RESEARCH.md
+++ b/.gsd/milestones/M025/slices/S08/S08-RESEARCH.md
@ -0,0 +1,96 @@
 # S08 Research: Load Testing + Fallback Resilience
 ## Summary
 Two independent deliverables: (1) a load test proving 10 concurrent chat sessions maintain acceptable latency, and (2) adding LLM fallback logic to `ChatService` so chat survives when the primary LLM endpoint (DGX/OpenWebUI) is down. The load test is straightforward Python scripting with `httpx` (already a dependency). The fallback is a targeted code change in one file.
 ## Recommendation
 **Targeted research.** The pipeline's sync `LLMClient` already has working fallback logic — the async `ChatService` just needs the same pattern adapted for `openai.AsyncOpenAI`. Load testing uses `httpx` + `asyncio` against the live endpoint. No new libraries needed.
 ## Implementation Landscape
 ### 1. Fallback Resilience (the riskier piece — build first)
 **Current state:** `ChatService.__init__()` creates a single `openai.AsyncOpenAI` client pointing at `settings.llm_api_url`. When streaming fails, it yields `event: error` and returns. No retry, no fallback.
 **What exists in the pipeline:** `LLMClient` (sync, `backend/pipeline/llm_client.py`) creates two clients — `self._primary` and `self._fallback` — and catches `(openai.APIConnectionError, openai.APITimeoutError)` on the primary before retrying on the fallback. This is the exact pattern to replicate.
 **Production config (from container env):**
 - `LLM_API_URL=https://chat.forgetyour.name/api` (OpenWebUI → DGX backend)
 - `LLM_FALLBACK_URL=https://chat.forgetyour.name/api` (currently same URL — needs changing to Ollama)
 - `LLM_FALLBACK_MODEL=fyn-llm-agent-chat`
 **Key observation:** Both primary and fallback currently point to the same URL. For the fallback to be useful, the compose config needs `LLM_FALLBACK_URL=http://chrysopedia-ollama:11434/v1` and `LLM_FALLBACK_MODEL` set to a model that Ollama actually has loaded. This is a config change in docker-compose.yml or .env.
 **Implementation in `chat_service.py`:**
 - Add `self._fallback_openai = openai.AsyncOpenAI(base_url=settings.llm_fallback_url, api_key=settings.llm_api_key)`
 - In `stream_response()`, wrap the streaming `create()` call in a try/except for `(openai.APIConnectionError, openai.APITimeoutError, openai.InternalServerError)`.
 - On catch, log a warning, then retry with `self._fallback_openai` and `self.settings.llm_fallback_model`.
 - The `event: done` payload should include a `fallback_used: true` field so the frontend/logs can distinguish.
 - **Important:** Also catch `openai.InternalServerError` — the current production failure is a 500 from OpenWebUI, not a connection error.
 **Files to modify:**
 - `backend/chat_service.py` — add fallback client + retry logic in `stream_response()`
 - `docker-compose.yml` — set `LLM_FALLBACK_URL` to Ollama endpoint
 **Verification:** 
 - Unit test: mock primary to raise `APIConnectionError`, assert fallback is called and SSE events still stream.
 - Integration: with DGX endpoint unreachable, chat should still work via Ollama.
 ### 2. Load Testing (10 concurrent chat sessions)
 **No load testing tools installed** on either aux or ub01. No need for k6/locust/wrk — `httpx` + `asyncio` is sufficient for 10 concurrent connections to a streaming SSE endpoint.
 **Architecture:**
 - Single Python script (`tests/load_test_chat.py` or `scripts/load_test_chat.py`)
 - Uses `httpx.AsyncClient` with `stream=True` to POST to `/api/v1/chat`
 - Fires 10 concurrent requests via `asyncio.gather()`
 - Measures: time-to-first-token (TTFT), total response time, error rate
 - Reports p50/p95/max latencies
 **Target endpoint:** `http://ub01:8096/api/v1/chat` (production, through nginx)
 **Rate limiting concern:** Default `rate_limit_ip_per_hour=10` means 10 concurrent requests from one IP will exhaust the limit. The load test needs to either:
 - Temporarily increase the rate limit, or
 - Use authenticated users (30/hour default), or  
 - Run from within Docker network (bypass nginx, hit API directly)
 **Recommended approach:** Run against `http://chrysopedia-api:8000/api/v1/chat` from inside the Docker network (via `docker exec` or a temporary container on the same network) to avoid nginx buffering artifacts. Alternatively, temporarily set `RATE_LIMIT_IP_PER_HOUR=100` for the test.
 **"Acceptable latency" target:** The slice says "maintain acceptable latency." R015 sets a 30-second retrieval target for search-to-read. Chat is not R015-scoped but a reasonable bar: TTFT < 5s, total completion < 30s for a typical query with 10 concurrent users. The key metric is degradation — does latency at 10 concurrent differ meaningfully from latency at 1?
 **SSE parsing:** The load test needs to parse SSE events from the stream to measure TTFT (time from request to first `event: token`). The format is `event: <type>\ndata: <json>\n\n`.
 **Bottleneck analysis:** 
 - Single uvicorn worker (no `--workers` flag in Dockerfile CMD) — all 10 requests share one event loop. Async FastAPI + async openai client should handle this fine, but if any blocking call exists, it will serialize.
 - Embedding calls (via Ollama) for search context retrieval could be the bottleneck — Ollama processes sequentially.
 - LLM streaming is the longest phase — 10 concurrent streams to OpenWebUI/DGX should be fine if the backend supports it.
 ### Natural Task Decomposition
 1. **T01: Add fallback to ChatService** — modify `chat_service.py` to create a fallback async client and retry on primary failure. Add unit test. ~30min.
 2. **T02: Configure Ollama fallback in deployment** — set `LLM_FALLBACK_URL` and `LLM_FALLBACK_MODEL` in docker-compose.yml. Verify Ollama has the model loaded. ~15min.  
 3. **T03: Write load test script** — Python asyncio script that fires 10 concurrent chat requests, parses SSE, reports TTFT/total latency/error metrics. ~30min.
 4. **T04: Run load test + document results** — Execute against production, capture results, write brief report. ~20min.
 T01 and T02 can be done in parallel. T03 is independent. T04 depends on T01+T02+T03.
 ## Constraints & Risks
 - **Ollama model availability:** Need to verify `chrysopedia-ollama` has `fyn-llm-agent-chat` or an equivalent model. If not, a model pull is needed first.
 - **Rate limiter:** Load test will hit IP rate limits at default settings. Must plan around this.
 - **Single worker:** The API runs a single uvicorn worker. This is fine for async I/O-bound work but any CPU-bound processing (JSON parsing, Pydantic validation) will serialize under load.
 - **Current LLM endpoint is down:** `https://chat.forgetyour.name/api` returned 500 during research. This makes fallback resilience immediately relevant — it's a real production issue right now.
 ## Key Files
 | File | Role |
 |------|------|
 | `backend/chat_service.py` | Chat service — needs fallback logic |
 | `backend/pipeline/llm_client.py` | Sync LLM client with working fallback pattern to replicate |
 | `backend/config.py` | Settings — already has `llm_fallback_url` and `llm_fallback_model` |
 | `backend/routers/chat.py` | Chat router — no changes needed |
 | `backend/tests/test_chat.py` | Existing chat tests — add fallback test |
 | `docker-compose.yml` | Deployment config — needs `LLM_FALLBACK_URL` env var |
 | `docker/Dockerfile.api` | Single uvicorn worker — context for load test expectations |
--- a/.gsd/milestones/M025/slices/S08/tasks/T01-PLAN.md
+++ b/.gsd/milestones/M025/slices/S08/tasks/T01-PLAN.md
@ -0,0 +1,39 @@
 ---
 estimated_steps: 12
 estimated_files: 3
 skills_used: []
 ---
 # T01: Add LLM fallback client to ChatService with unit tests
 Add automatic fallback from primary to secondary LLM endpoint in ChatService, matching the pattern already used by the sync LLMClient in pipeline/llm_client.py. When the primary openai.AsyncOpenAI client fails with APIConnectionError, APITimeoutError, or InternalServerError during streaming, retry the entire create() call with a fallback client pointing at settings.llm_fallback_url + settings.llm_fallback_model. Add the fallback_used field to the SSE done event. Update docker-compose.yml to pass LLM_FALLBACK_URL=http://chrysopedia-ollama:11434/v1 to the API container. Write unit tests for both APIConnectionError and InternalServerError fallback scenarios.
 Steps:
 1. Read `backend/chat_service.py` and `backend/pipeline/llm_client.py` to understand the existing fallback pattern.
 2. In ChatService.__init__, create `self._fallback_openai = openai.AsyncOpenAI(base_url=settings.llm_fallback_url, api_key=settings.llm_api_key)`.
 3. In stream_response(), wrap the `self._openai.chat.completions.create(...)` call and its async iteration in a try/except for `(openai.APIConnectionError, openai.APITimeoutError, openai.InternalServerError)`. On catch, log WARNING with `chat_llm_fallback` prefix including the error type and message, then retry the same create() call using `self._fallback_openai` and `self.settings.llm_fallback_model`. Track `fallback_used = True`.
 4. Add `fallback_used` (bool) to the done event data dict: `{"cascade_tier": ..., "conversation_id": ..., "fallback_used": fallback_used}`.
 5. Update the model name logged in the usage log call — when fallback is used, pass `self.settings.llm_fallback_model` instead of `self.settings.llm_model`.
 6. In `docker-compose.yml`, add `LLM_FALLBACK_URL: http://chrysopedia-ollama:11434/v1` and `LLM_FALLBACK_MODEL: fyn-llm-agent-chat` to the chrysopedia-api environment block.
 7. In `backend/tests/test_chat.py`, add two test functions:
   - `test_chat_fallback_on_connection_error`: mock primary openai to raise `openai.APIConnectionError`, mock fallback openai to return streaming chunks. Assert SSE events include tokens and done event has `fallback_used: true`.
   - `test_chat_fallback_on_internal_server_error`: same but with `openai.InternalServerError`.
 8. Run `cd backend && python -m pytest tests/test_chat.py -v -k fallback` — both tests pass.
 ## Inputs
 - ``backend/chat_service.py` — current ChatService with single openai client, no fallback`
 - ``backend/pipeline/llm_client.py` — reference pattern for primary/fallback logic`
 - ``backend/config.py` — Settings with llm_fallback_url and llm_fallback_model fields`
 - ``backend/tests/test_chat.py` — existing chat tests with standalone ASGI client pattern`
 - ``docker-compose.yml` — deployment config, needs LLM_FALLBACK_URL env var`
 ## Expected Output
 - ``backend/chat_service.py` — ChatService with fallback AsyncOpenAI client and retry logic in stream_response`
 - ``backend/tests/test_chat.py` — two new test functions for fallback on APIConnectionError and InternalServerError`
 - ``docker-compose.yml` — LLM_FALLBACK_URL and LLM_FALLBACK_MODEL in chrysopedia-api environment`
 ## Verification
 cd backend && python -m pytest tests/test_chat.py -v -k fallback
--- a/.gsd/milestones/M025/slices/S08/tasks/T01-SUMMARY.md
+++ b/.gsd/milestones/M025/slices/S08/tasks/T01-SUMMARY.md
@ -0,0 +1,79 @@
 ---
 id: T01
 parent: S08
 milestone: M025
 provides: []
 requires: []
 affects: []
 key_files: ["backend/chat_service.py", "backend/tests/test_chat.py", "docker-compose.yml"]
 key_decisions: ["Catch APIConnectionError, APITimeoutError, and InternalServerError on primary create() then retry with fallback — matches sync LLMClient pattern"]
 patterns_established: []
 drill_down_paths: []
 observability_surfaces: []
 duration: ""
 verification_result: "Ran cd backend && python -m pytest tests/test_chat.py -v -k fallback — 5 passed. Ran full suite — 26/26 passed."
 completed_at: 2026-04-04T14:31:10.052Z
 blocker_discovered: false
 ---
 # T01: Added automatic primary→fallback LLM endpoint switching in ChatService with two unit tests covering APIConnectionError and InternalServerError scenarios
 > Added automatic primary→fallback LLM endpoint switching in ChatService with two unit tests covering APIConnectionError and InternalServerError scenarios
 ## What Happened
 ---
 id: T01
 parent: S08
 milestone: M025
 key_files:
  - backend/chat_service.py
  - backend/tests/test_chat.py
  - docker-compose.yml
 key_decisions:
  - Catch APIConnectionError, APITimeoutError, and InternalServerError on primary create() then retry with fallback — matches sync LLMClient pattern
 duration: ""
 verification_result: passed
 completed_at: 2026-04-04T14:31:10.053Z
 blocker_discovered: false
 ---
 # T01: Added automatic primary→fallback LLM endpoint switching in ChatService with two unit tests covering APIConnectionError and InternalServerError scenarios
 **Added automatic primary→fallback LLM endpoint switching in ChatService with two unit tests covering APIConnectionError and InternalServerError scenarios**
 ## What Happened
 Added a _fallback_openai AsyncOpenAI client to ChatService.__init__ using settings.llm_fallback_url. Wrapped the primary streaming create() call in a try/except for (APIConnectionError, APITimeoutError, InternalServerError). On catch, logs WARNING with chat_llm_fallback prefix, resets accumulated response, and retries the entire streaming call using the fallback client and settings.llm_fallback_model. If fallback also fails, emits SSE error event. The fallback_used boolean is included in the done event and the usage log records the actual model used. Added LLM_FALLBACK_URL and LLM_FALLBACK_MODEL to docker-compose.yml API environment. Wrote two test functions with side_effect mock factory accounting for SearchService's AsyncOpenAI call ordering.
 ## Verification
 Ran cd backend && python -m pytest tests/test_chat.py -v -k fallback — 5 passed. Ran full suite — 26/26 passed.
 ## Verification Evidence
 | # | Command | Exit Code | Verdict | Duration |
 |---|---------|-----------|---------|----------|
 | 1 | `cd backend && python -m pytest tests/test_chat.py -v -k fallback` | 0 | ✅ pass | 5600ms |
 | 2 | `cd backend && python -m pytest tests/test_chat.py -v` | 0 | ✅ pass | 4400ms |
 ## Deviations
 Test mock factory uses call_count=2/3 instead of 1/2 because patching chat_service.openai.AsyncOpenAI intercepts SearchService's constructor call as well (shared module object).
 ## Known Issues
 None.
 ## Files Created/Modified
 - `backend/chat_service.py`
 - `backend/tests/test_chat.py`
 - `docker-compose.yml`
 ## Deviations
 Test mock factory uses call_count=2/3 instead of 1/2 because patching chat_service.openai.AsyncOpenAI intercepts SearchService's constructor call as well (shared module object).
 ## Known Issues
 None.
--- a/.gsd/milestones/M025/slices/S08/tasks/T02-PLAN.md
+++ b/.gsd/milestones/M025/slices/S08/tasks/T02-PLAN.md
@ -0,0 +1,32 @@
 ---
 estimated_steps: 10
 estimated_files: 1
 skills_used: []
 ---
 # T02: Write async load test script for 10 concurrent chat sessions
 Create a standalone Python script that fires 10 concurrent chat requests to the SSE endpoint, parses streaming events to measure time-to-first-token (TTFT) and total response time, and reports p50/p95/max latency statistics. Uses httpx (already a project dependency) + asyncio. No external load testing tools needed.
 Steps:
 1. Create `scripts/load_test_chat.py` with argparse accepting `--url` (default http://localhost:8096), `--concurrency` (default 10), `--query` (default 'What are common compression techniques?').
 2. Implement an async function `run_single_chat(client, url, query)` that: POSTs to `{url}/api/v1/chat` with `{"query": query}`, reads the SSE stream line-by-line, records timestamp of first `event: token` line (TTFT), records total time when stream ends, returns a result dict with ttft_ms, total_ms, token_count, error (if any).
 3. Implement `run_load_test(url, concurrency, query)` that creates an httpx.AsyncClient with timeout=60s, fires `concurrency` concurrent `run_single_chat` calls via asyncio.gather, collects results.
 4. Compute and print statistics: for both TTFT and total time, show min/p50/p95/max. Show error count. Show per-request summary table.
 5. Add `--auth-token` optional flag for authenticated requests (sets Authorization header) to avoid IP rate limit (10/hour default). Document in script docstring that running 10 requests from one IP will hit the rate limit unless authenticated or rate limit is raised.
 6. Add `--output` flag to write results as JSON to a file.
 7. Validate the script runs: `python scripts/load_test_chat.py --help` exits 0.
 8. Test SSE parsing logic with a small inline unit test or a `--dry-run` flag that uses a mock response.
 ## Inputs
 - ``backend/routers/chat.py` — chat endpoint contract (POST /api/v1/chat, SSE response format)`
 - ``backend/chat_service.py` — SSE event protocol (sources, token, done, error)`
 ## Expected Output
 - ``scripts/load_test_chat.py` — standalone async load test script with SSE parsing, latency measurement, and statistics reporting`
 ## Verification
 python scripts/load_test_chat.py --help && echo 'Script OK'
--- a/backend/chat_service.py
+++ b/backend/chat_service.py
@ -59,6 +59,10 @@ class ChatService:
            base_url=settings.llm_api_url,
            api_key=settings.llm_api_key,
        )
        self._fallback_openai = openai.AsyncOpenAI(
            base_url=settings.llm_fallback_url,
            api_key=settings.llm_api_key,
        )
        self._redis = redis
    async def _load_history(self, conversation_id: str) -> list[dict[str, str]]:
@ -244,6 +248,7 @@ class ChatService:
        accumulated_response = ""
        usage_data: dict[str, int] | None = None
        fallback_used = False
        try:
            stream = await self._openai.chat.completions.create(
@ -269,6 +274,44 @@ class ChatService:
                    accumulated_response += text
                    yield _sse("token", text)
        except (openai.APIConnectionError, openai.APITimeoutError, openai.InternalServerError) as exc:
            logger.warning(
                "chat_llm_fallback primary failed (%s: %s), retrying with fallback at %s",
                type(exc).__name__, exc, self.settings.llm_fallback_url,
            )
            fallback_used = True
            accumulated_response = ""
            usage_data = None
            try:
                stream = await self._fallback_openai.chat.completions.create(
                    model=self.settings.llm_fallback_model,
                    messages=messages,
                    stream=True,
                    stream_options={"include_usage": True},
                    temperature=temperature,
                    max_tokens=2048,
                )
                async for chunk in stream:
                    if hasattr(chunk, "usage") and chunk.usage is not None:
                        usage_data = {
                            "prompt_tokens": chunk.usage.prompt_tokens or 0,
                            "completion_tokens": chunk.usage.completion_tokens or 0,
                            "total_tokens": chunk.usage.total_tokens or 0,
                        }
                    choice = chunk.choices[0] if chunk.choices else None
                    if choice and choice.delta and choice.delta.content:
                        text = choice.delta.content
                        accumulated_response += text
                        yield _sse("token", text)
            except Exception:
                tb = traceback.format_exc()
                logger.error("chat_llm_error fallback also failed query=%r cid=%s\n%s", query, conversation_id, tb)
                yield _sse("error", {"message": "LLM generation failed"})
                return
        except Exception:
            tb = traceback.format_exc()
            logger.error("chat_llm_error query=%r cid=%s\n%s", query, conversation_id, tb)
@ -301,7 +344,7 @@ class ChatService:
            query=query,
            usage=usage_data,
            cascade_tier=cascade_tier,
-            model=self.settings.llm_model,
+            model=self.settings.llm_fallback_model if fallback_used else self.settings.llm_model,
            latency_ms=latency_ms,
        )
@ -311,7 +354,7 @@ class ChatService:
            query, creator, cascade_tier, len(sources), latency_ms, conversation_id,
            usage_data.get("total_tokens", 0),
        )
-        yield _sse("done", {"cascade_tier": cascade_tier, "conversation_id": conversation_id})
+        yield _sse("done", {"cascade_tier": cascade_tier, "conversation_id": conversation_id, "fallback_used": fallback_used})
 # ── Helpers ──────────────────────────────────────────────────────────────────
--- a/backend/tests/test_chat.py
+++ b/backend/tests/test_chat.py
@ -20,6 +20,7 @@ from unittest.mock import AsyncMock, MagicMock, patch
 import pytest
 import pytest_asyncio
 import openai
 from httpx import ASGITransport, AsyncClient
 # Ensure backend/ is on sys.path
@ -958,3 +959,120 @@ async def test_personality_weight_string_returns_422(chat_client):
        json={"query": "test", "personality_weight": "high"},
    )
    assert resp.status_code == 422
 # ── LLM fallback tests ──────────────────────────────────────────────────────
@pytest.mark.asyncio
 async def test_chat_fallback_on_connection_error(chat_client):
    """When primary LLM raises APIConnectionError, fallback client serves the response."""
    search_result = _fake_search_result()
    # Primary client raises on create()
    mock_primary = MagicMock()
    mock_primary.chat.completions.create = AsyncMock(
        side_effect=openai.APIConnectionError(request=MagicMock()),
    )
    # Fallback client succeeds
    mock_fallback = MagicMock()
    mock_fallback.chat.completions.create = AsyncMock(
        return_value=_mock_openai_stream(["fallback ", "answer"]),
    )
    # AsyncOpenAI is called 3 times in ChatService.__init__:
    #   1. SearchService (irrelevant, search is mocked)
    #   2. self._openai (primary)
    #   3. self._fallback_openai (fallback)
    call_count = 0
    def _make_client(**kwargs):
        nonlocal call_count
        call_count += 1
        if call_count == 2:
            return mock_primary
        if call_count == 3:
            return mock_fallback
        return MagicMock()
    with (
        patch("chat_service.SearchService.search", new_callable=AsyncMock, return_value=search_result),
        patch("chat_service.openai.AsyncOpenAI", side_effect=_make_client),
    ):
        resp = await chat_client.post("/api/v1/chat", json={"query": "test fallback"})
    assert resp.status_code == 200
    events = _parse_sse(resp.text)
    event_types = [e["event"] for e in events]
    assert "sources" in event_types
    assert "token" in event_types
    assert "done" in event_types
    assert "error" not in event_types
    # Verify tokens came from fallback
    token_texts = [e["data"] for e in events if e["event"] == "token"]
    combined = "".join(token_texts)
    assert "fallback answer" in combined
    # Done event should have fallback_used=True
    done_data = next(e for e in events if e["event"] == "done")["data"]
    assert done_data["fallback_used"] is True
@pytest.mark.asyncio
 async def test_chat_fallback_on_internal_server_error(chat_client):
    """When primary LLM raises InternalServerError, fallback client serves the response."""
    search_result = _fake_search_result()
    # Primary client raises InternalServerError on create()
    mock_primary = MagicMock()
    mock_primary.chat.completions.create = AsyncMock(
        side_effect=openai.InternalServerError(
            message="GPU OOM",
            response=MagicMock(status_code=500),
            body=None,
        ),
    )
    # Fallback client succeeds
    mock_fallback = MagicMock()
    mock_fallback.chat.completions.create = AsyncMock(
        return_value=_mock_openai_stream(["recovered ", "response"]),
    )
    call_count = 0
    def _make_client(**kwargs):
        nonlocal call_count
        call_count += 1
        if call_count == 2:
            return mock_primary
        if call_count == 3:
            return mock_fallback
        return MagicMock()
    with (
        patch("chat_service.SearchService.search", new_callable=AsyncMock, return_value=search_result),
        patch("chat_service.openai.AsyncOpenAI", side_effect=_make_client),
    ):
        resp = await chat_client.post("/api/v1/chat", json={"query": "test ise fallback"})
    assert resp.status_code == 200
    events = _parse_sse(resp.text)
    event_types = [e["event"] for e in events]
    assert "sources" in event_types
    assert "token" in event_types
    assert "done" in event_types
    assert "error" not in event_types
    # Verify tokens from fallback
    token_texts = [e["data"] for e in events if e["event"] == "token"]
    combined = "".join(token_texts)
    assert "recovered response" in combined
    # Done event should have fallback_used=True
    done_data = next(e for e in events if e["event"] == "done")["data"]
    assert done_data["fallback_used"] is True
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -121,6 +121,8 @@ services:
      REDIS_URL: redis://chrysopedia-redis:6379/0
      QDRANT_URL: http://chrysopedia-qdrant:6333
      EMBEDDING_API_URL: http://chrysopedia-ollama:11434/v1
      LLM_FALLBACK_URL: http://chrysopedia-ollama:11434/v1
      LLM_FALLBACK_MODEL: fyn-llm-agent-chat
      PROMPTS_PATH: /prompts
    volumes:
      - /vmPool/r/services/chrysopedia_data:/data