From ae5b8ca475283acd7cfa5f74ceba40720481f8d5 Mon Sep 17 00:00:00 2001
From: xpltd_admin <admin@xpltd.co>
Date: Sat, 4 Apr 2026 00:04:24 -0600
Subject: [PATCH] Add "Monitoring"

---
 Monitoring.md | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100644 Monitoring.md

diff --git a/Monitoring.md b/Monitoring.md
new file mode 100644
index 0000000..b6ea87c
--- /dev/null
+++ b/Monitoring.md
@@ -0,0 +1,133 @@
+# Monitoring
+
+## Automated Health Probe
+
+Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.
+
+**Location:** `forgejo-optimize/chrysopedia-health-probe/` (on dev01)
+
+### How It Works
+
+```
+dev01 (probe.py)
+  ├── SSH → ub01: docker logs, docker inspect, df
+  ├── SSH → ub01: docker exec curl (health endpoints)
+  └── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)
+```
+
+1. Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
+2. Runs 14 pattern detectors (see table below)
+3. Checks health endpoints and response times
+4. Checks container restart counts and OOM status
+5. Checks disk space on `/vmPool`
+6. Deduplicates against existing open Forgejo issues
+7. Files new issues or adds comments to existing ones
+
+### Running the Probe
+
+```bash
+# On dev01:
+cd ~/projects/forgejo-optimize/chrysopedia-health-probe
+
+# Dry run — shows detections without filing issues
+python3 probe.py --dry-run --verbose
+
+# Live run — files/comments on Forgejo issues
+python3 probe.py
+```
+
+### Detection Matrix
+
+| Pattern | Severity | Source | Labels |
+|---------|----------|--------|--------|
+| Python traceback | high | all container logs | `bug` |
+| HTTP 500 | high | api/web logs | `bug` |
+| HTTP 502/503/504 | high | web logs | `infra` |
+| DB connection error | high | api/worker logs | `database`, `infra` |
+| Qdrant error | high | api logs | `search`, `infra` |
+| Redis error | high | api/worker logs | `infra` |
+| OOM killed | high | docker inspect | `infra` |
+| Celery task failure | medium | worker logs | `celery`, `bug` |
+| Celery task timeout | medium | worker logs | `celery`, `performance` |
+| Slow response (>2s) | medium | health endpoints | `performance` |
+| Container restart | medium | docker inspect | `infra` |
+| Pipeline stage failure | medium | worker logs | `pipeline`, `bug` |
+| LightRAG unreachable | medium | health endpoint | `infra` |
+| Disk space >85% | low | `df /vmPool` | `infra` |
+
+### Deduplication
+
+Each detection gets a **fingerprint** (`{container}::{pattern}`) embedded in the issue title as `[fp:xxx]`.
+
+- If an open issue with the same fingerprint exists → **comment** added instead of new issue
+- If the same issue has 5+ auto-comments without resolution → **escalated** with `triage-needed` label
+
+### Issue Labels
+
+The probe uses these labels (auto-created on the repo):
+
+| Label | Color | Purpose |
+|-------|-------|--------|
+| `auto-detected` | blue | Filed by health probe |
+| `bug` | red | Application error |
+| `performance` | yellow | Slow/high latency |
+| `infra` | purple | Infrastructure issue |
+| `celery` | light yellow | Worker/task issue |
+| `database` | green | PostgreSQL issue |
+| `search` | blue | Qdrant/embedding issue |
+| `pipeline` | light blue | LLM pipeline issue |
+| `triage-needed` | pink | Needs human review |
+| `agent-fixable` | light blue | Claude agent can fix |
+| `severity:high/medium/low` | red/pink/light | Impact level |
+
+### Issue Template
+
+Auto-filed issues follow this structure:
+
+```markdown
+## Auto-Detected: {Category}
+
+**Detected:** {timestamp}
+**Container:** {container_name}
+**Severity:** {high|medium|low}
+
+### Evidence
+{log snippet — max 30 lines}
+
+### Context
+- Container uptime / restart count
+
+### Suggested Investigation
+1. Step 1
+2. Step 2
+3. Step 3
+```
+
+## Bugfix Agent
+
+A scheduled Claude agent can triage auto-detected issues:
+
+**Prompt:** `forgejo-optimize/chrysopedia-bugfix-agent-prompt.md`
+
+**Workflow:**
+1. Lists open issues labeled `auto-detected` + `agent-fixable`
+2. For each (oldest first, max 3 per run):
+   - SSHs to ub01, reads logs, inspects source code
+   - Classifies as: agent-fixable, needs-human, transient, or resolved
+   - If fixable: creates branch, commits fix, opens PR
+   - If not: posts root-cause analysis comment
+
+## Manual Monitoring
+
+| What | How |
+|------|-----|
+| Container status | `ssh ub01 "docker ps --filter name=chrysopedia"` |
+| API health | `ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"` |
+| LightRAG health | `ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"` |
+| Resource usage | `ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' \| grep chrysopedia"` |
+| Recent errors | `ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 \| grep -i error"` |
+| Disk space | `ssh ub01 "df -h /vmPool"` |
+
+---
+
+*See also: [[Deployment]], [[Architecture]]* 
\ No newline at end of file