Add "Monitoring"
parent
85fdfe03b0
commit
ae5b8ca475
1 changed files with 133 additions and 0 deletions
133
Monitoring.md
Normal file
133
Monitoring.md
Normal file
|
|
@ -0,0 +1,133 @@
|
|||
# Monitoring
|
||||
|
||||
## Automated Health Probe
|
||||
|
||||
Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.
|
||||
|
||||
**Location:** `forgejo-optimize/chrysopedia-health-probe/` (on dev01)
|
||||
|
||||
### How It Works
|
||||
|
||||
```
|
||||
dev01 (probe.py)
|
||||
├── SSH → ub01: docker logs, docker inspect, df
|
||||
├── SSH → ub01: docker exec curl (health endpoints)
|
||||
└── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)
|
||||
```
|
||||
|
||||
1. Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
|
||||
2. Runs 14 pattern detectors (see table below)
|
||||
3. Checks health endpoints and response times
|
||||
4. Checks container restart counts and OOM status
|
||||
5. Checks disk space on `/vmPool`
|
||||
6. Deduplicates against existing open Forgejo issues
|
||||
7. Files new issues or adds comments to existing ones
|
||||
|
||||
### Running the Probe
|
||||
|
||||
```bash
|
||||
# On dev01:
|
||||
cd ~/projects/forgejo-optimize/chrysopedia-health-probe
|
||||
|
||||
# Dry run — shows detections without filing issues
|
||||
python3 probe.py --dry-run --verbose
|
||||
|
||||
# Live run — files/comments on Forgejo issues
|
||||
python3 probe.py
|
||||
```
|
||||
|
||||
### Detection Matrix
|
||||
|
||||
| Pattern | Severity | Source | Labels |
|
||||
|---------|----------|--------|--------|
|
||||
| Python traceback | high | all container logs | `bug` |
|
||||
| HTTP 500 | high | api/web logs | `bug` |
|
||||
| HTTP 502/503/504 | high | web logs | `infra` |
|
||||
| DB connection error | high | api/worker logs | `database`, `infra` |
|
||||
| Qdrant error | high | api logs | `search`, `infra` |
|
||||
| Redis error | high | api/worker logs | `infra` |
|
||||
| OOM killed | high | docker inspect | `infra` |
|
||||
| Celery task failure | medium | worker logs | `celery`, `bug` |
|
||||
| Celery task timeout | medium | worker logs | `celery`, `performance` |
|
||||
| Slow response (>2s) | medium | health endpoints | `performance` |
|
||||
| Container restart | medium | docker inspect | `infra` |
|
||||
| Pipeline stage failure | medium | worker logs | `pipeline`, `bug` |
|
||||
| LightRAG unreachable | medium | health endpoint | `infra` |
|
||||
| Disk space >85% | low | `df /vmPool` | `infra` |
|
||||
|
||||
### Deduplication
|
||||
|
||||
Each detection gets a **fingerprint** (`{container}::{pattern}`) embedded in the issue title as `[fp:xxx]`.
|
||||
|
||||
- If an open issue with the same fingerprint exists → **comment** added instead of new issue
|
||||
- If the same issue has 5+ auto-comments without resolution → **escalated** with `triage-needed` label
|
||||
|
||||
### Issue Labels
|
||||
|
||||
The probe uses these labels (auto-created on the repo):
|
||||
|
||||
| Label | Color | Purpose |
|
||||
|-------|-------|--------|
|
||||
| `auto-detected` | blue | Filed by health probe |
|
||||
| `bug` | red | Application error |
|
||||
| `performance` | yellow | Slow/high latency |
|
||||
| `infra` | purple | Infrastructure issue |
|
||||
| `celery` | light yellow | Worker/task issue |
|
||||
| `database` | green | PostgreSQL issue |
|
||||
| `search` | blue | Qdrant/embedding issue |
|
||||
| `pipeline` | light blue | LLM pipeline issue |
|
||||
| `triage-needed` | pink | Needs human review |
|
||||
| `agent-fixable` | light blue | Claude agent can fix |
|
||||
| `severity:high/medium/low` | red/pink/light | Impact level |
|
||||
|
||||
### Issue Template
|
||||
|
||||
Auto-filed issues follow this structure:
|
||||
|
||||
```markdown
|
||||
## Auto-Detected: {Category}
|
||||
|
||||
**Detected:** {timestamp}
|
||||
**Container:** {container_name}
|
||||
**Severity:** {high|medium|low}
|
||||
|
||||
### Evidence
|
||||
{log snippet — max 30 lines}
|
||||
|
||||
### Context
|
||||
- Container uptime / restart count
|
||||
|
||||
### Suggested Investigation
|
||||
1. Step 1
|
||||
2. Step 2
|
||||
3. Step 3
|
||||
```
|
||||
|
||||
## Bugfix Agent
|
||||
|
||||
A scheduled Claude agent can triage auto-detected issues:
|
||||
|
||||
**Prompt:** `forgejo-optimize/chrysopedia-bugfix-agent-prompt.md`
|
||||
|
||||
**Workflow:**
|
||||
1. Lists open issues labeled `auto-detected` + `agent-fixable`
|
||||
2. For each (oldest first, max 3 per run):
|
||||
- SSHs to ub01, reads logs, inspects source code
|
||||
- Classifies as: agent-fixable, needs-human, transient, or resolved
|
||||
- If fixable: creates branch, commits fix, opens PR
|
||||
- If not: posts root-cause analysis comment
|
||||
|
||||
## Manual Monitoring
|
||||
|
||||
| What | How |
|
||||
|------|-----|
|
||||
| Container status | `ssh ub01 "docker ps --filter name=chrysopedia"` |
|
||||
| API health | `ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"` |
|
||||
| LightRAG health | `ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"` |
|
||||
| Resource usage | `ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' \| grep chrysopedia"` |
|
||||
| Recent errors | `ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 \| grep -i error"` |
|
||||
| Disk space | `ssh ub01 "df -h /vmPool"` |
|
||||
|
||||
---
|
||||
|
||||
*See also: [[Deployment]], [[Architecture]]*
|
||||
Loading…
Add table
Reference in a new issue