Add "Monitoring"

2026-04-04 00:04:24 -06:00 · 2026-04-04 00:04:24 -06:00 · ae5b8ca475
commit ae5b8ca475
parent 85fdfe03b0
1 changed files with 133 additions and 0 deletions
--- a/Monitoring.md
+++ b/Monitoring.md
@ -0,0 +1,133 @@
 # Monitoring
 ## Automated Health Probe
 Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.
 **Location:** `forgejo-optimize/chrysopedia-health-probe/` (on dev01)
 ### How It Works
 ```
 dev01 (probe.py)
  ├── SSH → ub01: docker logs, docker inspect, df
  ├── SSH → ub01: docker exec curl (health endpoints)
  └── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)
 ```
 1. Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
 2. Runs 14 pattern detectors (see table below)
 3. Checks health endpoints and response times
 4. Checks container restart counts and OOM status
 5. Checks disk space on `/vmPool`
 6. Deduplicates against existing open Forgejo issues
 7. Files new issues or adds comments to existing ones
 ### Running the Probe
 ```bash
 # On dev01:
 cd ~/projects/forgejo-optimize/chrysopedia-health-probe
 # Dry run — shows detections without filing issues
 python3 probe.py --dry-run --verbose
 # Live run — files/comments on Forgejo issues
 python3 probe.py
 ```
 ### Detection Matrix
 | Pattern | Severity | Source | Labels |
 |---------|----------|--------|--------|
 | Python traceback | high | all container logs | `bug` |
 | HTTP 500 | high | api/web logs | `bug` |
 | HTTP 502/503/504 | high | web logs | `infra` |
 | DB connection error | high | api/worker logs | `database`, `infra` |
 | Qdrant error | high | api logs | `search`, `infra` |
 | Redis error | high | api/worker logs | `infra` |
 | OOM killed | high | docker inspect | `infra` |
 | Celery task failure | medium | worker logs | `celery`, `bug` |
 | Celery task timeout | medium | worker logs | `celery`, `performance` |
 | Slow response (>2s) | medium | health endpoints | `performance` |
 | Container restart | medium | docker inspect | `infra` |
 | Pipeline stage failure | medium | worker logs | `pipeline`, `bug` |
 | LightRAG unreachable | medium | health endpoint | `infra` |
 | Disk space >85% | low | `df /vmPool` | `infra` |
 ### Deduplication
 Each detection gets a **fingerprint** (`{container}::{pattern}`) embedded in the issue title as `[fp:xxx]`.
 - If an open issue with the same fingerprint exists → **comment** added instead of new issue
 - If the same issue has 5+ auto-comments without resolution → **escalated** with `triage-needed` label
 ### Issue Labels
 The probe uses these labels (auto-created on the repo):
 | Label | Color | Purpose |
 |-------|-------|--------|
 | `auto-detected` | blue | Filed by health probe |
 | `bug` | red | Application error |
 | `performance` | yellow | Slow/high latency |
 | `infra` | purple | Infrastructure issue |
 | `celery` | light yellow | Worker/task issue |
 | `database` | green | PostgreSQL issue |
 | `search` | blue | Qdrant/embedding issue |
 | `pipeline` | light blue | LLM pipeline issue |
 | `triage-needed` | pink | Needs human review |
 | `agent-fixable` | light blue | Claude agent can fix |
 | `severity:high/medium/low` | red/pink/light | Impact level |
 ### Issue Template
 Auto-filed issues follow this structure:
 ```markdown
 ## Auto-Detected: {Category}
 **Detected:** {timestamp}
 **Container:** {container_name}
 **Severity:** {high|medium|low}
 ### Evidence
 {log snippet — max 30 lines}
 ### Context
 - Container uptime / restart count
 ### Suggested Investigation
 1. Step 1
 2. Step 2
 3. Step 3
 ```
 ## Bugfix Agent
 A scheduled Claude agent can triage auto-detected issues:
 **Prompt:** `forgejo-optimize/chrysopedia-bugfix-agent-prompt.md`
 **Workflow:**
 1. Lists open issues labeled `auto-detected` + `agent-fixable`
 2. For each (oldest first, max 3 per run):
   - SSHs to ub01, reads logs, inspects source code
   - Classifies as: agent-fixable, needs-human, transient, or resolved
   - If fixable: creates branch, commits fix, opens PR
   - If not: posts root-cause analysis comment
 ## Manual Monitoring
 | What | How |
 |------|-----|
 | Container status | `ssh ub01 "docker ps --filter name=chrysopedia"` |
 | API health | `ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"` |
 | LightRAG health | `ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"` |
 | Resource usage | `ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' \| grep chrysopedia"` |
 | Recent errors | `ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 \| grep -i error"` |
 | Disk space | `ssh ub01 "df -h /vmPool"` |
 ---
 *See also: [[Deployment]], [[Architecture]]*