From ae5b8ca475283acd7cfa5f74ceba40720481f8d5 Mon Sep 17 00:00:00 2001 From: xpltd_admin Date: Sat, 4 Apr 2026 00:04:24 -0600 Subject: [PATCH] Add "Monitoring" --- Monitoring.md | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 Monitoring.md diff --git a/Monitoring.md b/Monitoring.md new file mode 100644 index 0000000..b6ea87c --- /dev/null +++ b/Monitoring.md @@ -0,0 +1,133 @@ +# Monitoring + +## Automated Health Probe + +Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues. + +**Location:** `forgejo-optimize/chrysopedia-health-probe/` (on dev01) + +### How It Works + +``` +dev01 (probe.py) + ├── SSH → ub01: docker logs, docker inspect, df + ├── SSH → ub01: docker exec curl (health endpoints) + └── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues) +``` + +1. Pulls last 30 minutes of container logs from all 10 Chrysopedia containers +2. Runs 14 pattern detectors (see table below) +3. Checks health endpoints and response times +4. Checks container restart counts and OOM status +5. Checks disk space on `/vmPool` +6. Deduplicates against existing open Forgejo issues +7. Files new issues or adds comments to existing ones + +### Running the Probe + +```bash +# On dev01: +cd ~/projects/forgejo-optimize/chrysopedia-health-probe + +# Dry run — shows detections without filing issues +python3 probe.py --dry-run --verbose + +# Live run — files/comments on Forgejo issues +python3 probe.py +``` + +### Detection Matrix + +| Pattern | Severity | Source | Labels | +|---------|----------|--------|--------| +| Python traceback | high | all container logs | `bug` | +| HTTP 500 | high | api/web logs | `bug` | +| HTTP 502/503/504 | high | web logs | `infra` | +| DB connection error | high | api/worker logs | `database`, `infra` | +| Qdrant error | high | api logs | `search`, `infra` | +| Redis error | high | api/worker logs | `infra` | +| OOM killed | high | docker inspect | `infra` | +| Celery task failure | medium | worker logs | `celery`, `bug` | +| Celery task timeout | medium | worker logs | `celery`, `performance` | +| Slow response (>2s) | medium | health endpoints | `performance` | +| Container restart | medium | docker inspect | `infra` | +| Pipeline stage failure | medium | worker logs | `pipeline`, `bug` | +| LightRAG unreachable | medium | health endpoint | `infra` | +| Disk space >85% | low | `df /vmPool` | `infra` | + +### Deduplication + +Each detection gets a **fingerprint** (`{container}::{pattern}`) embedded in the issue title as `[fp:xxx]`. + +- If an open issue with the same fingerprint exists → **comment** added instead of new issue +- If the same issue has 5+ auto-comments without resolution → **escalated** with `triage-needed` label + +### Issue Labels + +The probe uses these labels (auto-created on the repo): + +| Label | Color | Purpose | +|-------|-------|--------| +| `auto-detected` | blue | Filed by health probe | +| `bug` | red | Application error | +| `performance` | yellow | Slow/high latency | +| `infra` | purple | Infrastructure issue | +| `celery` | light yellow | Worker/task issue | +| `database` | green | PostgreSQL issue | +| `search` | blue | Qdrant/embedding issue | +| `pipeline` | light blue | LLM pipeline issue | +| `triage-needed` | pink | Needs human review | +| `agent-fixable` | light blue | Claude agent can fix | +| `severity:high/medium/low` | red/pink/light | Impact level | + +### Issue Template + +Auto-filed issues follow this structure: + +```markdown +## Auto-Detected: {Category} + +**Detected:** {timestamp} +**Container:** {container_name} +**Severity:** {high|medium|low} + +### Evidence +{log snippet — max 30 lines} + +### Context +- Container uptime / restart count + +### Suggested Investigation +1. Step 1 +2. Step 2 +3. Step 3 +``` + +## Bugfix Agent + +A scheduled Claude agent can triage auto-detected issues: + +**Prompt:** `forgejo-optimize/chrysopedia-bugfix-agent-prompt.md` + +**Workflow:** +1. Lists open issues labeled `auto-detected` + `agent-fixable` +2. For each (oldest first, max 3 per run): + - SSHs to ub01, reads logs, inspects source code + - Classifies as: agent-fixable, needs-human, transient, or resolved + - If fixable: creates branch, commits fix, opens PR + - If not: posts root-cause analysis comment + +## Manual Monitoring + +| What | How | +|------|-----| +| Container status | `ssh ub01 "docker ps --filter name=chrysopedia"` | +| API health | `ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"` | +| LightRAG health | `ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"` | +| Resource usage | `ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' \| grep chrysopedia"` | +| Recent errors | `ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 \| grep -i error"` | +| Disk space | `ssh ub01 "df -h /vmPool"` | + +--- + +*See also: [[Deployment]], [[Architecture]]* \ No newline at end of file