Add "Monitoring"
parent
85fdfe03b0
commit
ae5b8ca475
1 changed files with 133 additions and 0 deletions
133
Monitoring.md
Normal file
133
Monitoring.md
Normal file
|
|
@ -0,0 +1,133 @@
|
||||||
|
# Monitoring
|
||||||
|
|
||||||
|
## Automated Health Probe
|
||||||
|
|
||||||
|
Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.
|
||||||
|
|
||||||
|
**Location:** `forgejo-optimize/chrysopedia-health-probe/` (on dev01)
|
||||||
|
|
||||||
|
### How It Works
|
||||||
|
|
||||||
|
```
|
||||||
|
dev01 (probe.py)
|
||||||
|
├── SSH → ub01: docker logs, docker inspect, df
|
||||||
|
├── SSH → ub01: docker exec curl (health endpoints)
|
||||||
|
└── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)
|
||||||
|
```
|
||||||
|
|
||||||
|
1. Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
|
||||||
|
2. Runs 14 pattern detectors (see table below)
|
||||||
|
3. Checks health endpoints and response times
|
||||||
|
4. Checks container restart counts and OOM status
|
||||||
|
5. Checks disk space on `/vmPool`
|
||||||
|
6. Deduplicates against existing open Forgejo issues
|
||||||
|
7. Files new issues or adds comments to existing ones
|
||||||
|
|
||||||
|
### Running the Probe
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On dev01:
|
||||||
|
cd ~/projects/forgejo-optimize/chrysopedia-health-probe
|
||||||
|
|
||||||
|
# Dry run — shows detections without filing issues
|
||||||
|
python3 probe.py --dry-run --verbose
|
||||||
|
|
||||||
|
# Live run — files/comments on Forgejo issues
|
||||||
|
python3 probe.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Detection Matrix
|
||||||
|
|
||||||
|
| Pattern | Severity | Source | Labels |
|
||||||
|
|---------|----------|--------|--------|
|
||||||
|
| Python traceback | high | all container logs | `bug` |
|
||||||
|
| HTTP 500 | high | api/web logs | `bug` |
|
||||||
|
| HTTP 502/503/504 | high | web logs | `infra` |
|
||||||
|
| DB connection error | high | api/worker logs | `database`, `infra` |
|
||||||
|
| Qdrant error | high | api logs | `search`, `infra` |
|
||||||
|
| Redis error | high | api/worker logs | `infra` |
|
||||||
|
| OOM killed | high | docker inspect | `infra` |
|
||||||
|
| Celery task failure | medium | worker logs | `celery`, `bug` |
|
||||||
|
| Celery task timeout | medium | worker logs | `celery`, `performance` |
|
||||||
|
| Slow response (>2s) | medium | health endpoints | `performance` |
|
||||||
|
| Container restart | medium | docker inspect | `infra` |
|
||||||
|
| Pipeline stage failure | medium | worker logs | `pipeline`, `bug` |
|
||||||
|
| LightRAG unreachable | medium | health endpoint | `infra` |
|
||||||
|
| Disk space >85% | low | `df /vmPool` | `infra` |
|
||||||
|
|
||||||
|
### Deduplication
|
||||||
|
|
||||||
|
Each detection gets a **fingerprint** (`{container}::{pattern}`) embedded in the issue title as `[fp:xxx]`.
|
||||||
|
|
||||||
|
- If an open issue with the same fingerprint exists → **comment** added instead of new issue
|
||||||
|
- If the same issue has 5+ auto-comments without resolution → **escalated** with `triage-needed` label
|
||||||
|
|
||||||
|
### Issue Labels
|
||||||
|
|
||||||
|
The probe uses these labels (auto-created on the repo):
|
||||||
|
|
||||||
|
| Label | Color | Purpose |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| `auto-detected` | blue | Filed by health probe |
|
||||||
|
| `bug` | red | Application error |
|
||||||
|
| `performance` | yellow | Slow/high latency |
|
||||||
|
| `infra` | purple | Infrastructure issue |
|
||||||
|
| `celery` | light yellow | Worker/task issue |
|
||||||
|
| `database` | green | PostgreSQL issue |
|
||||||
|
| `search` | blue | Qdrant/embedding issue |
|
||||||
|
| `pipeline` | light blue | LLM pipeline issue |
|
||||||
|
| `triage-needed` | pink | Needs human review |
|
||||||
|
| `agent-fixable` | light blue | Claude agent can fix |
|
||||||
|
| `severity:high/medium/low` | red/pink/light | Impact level |
|
||||||
|
|
||||||
|
### Issue Template
|
||||||
|
|
||||||
|
Auto-filed issues follow this structure:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Auto-Detected: {Category}
|
||||||
|
|
||||||
|
**Detected:** {timestamp}
|
||||||
|
**Container:** {container_name}
|
||||||
|
**Severity:** {high|medium|low}
|
||||||
|
|
||||||
|
### Evidence
|
||||||
|
{log snippet — max 30 lines}
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- Container uptime / restart count
|
||||||
|
|
||||||
|
### Suggested Investigation
|
||||||
|
1. Step 1
|
||||||
|
2. Step 2
|
||||||
|
3. Step 3
|
||||||
|
```
|
||||||
|
|
||||||
|
## Bugfix Agent
|
||||||
|
|
||||||
|
A scheduled Claude agent can triage auto-detected issues:
|
||||||
|
|
||||||
|
**Prompt:** `forgejo-optimize/chrysopedia-bugfix-agent-prompt.md`
|
||||||
|
|
||||||
|
**Workflow:**
|
||||||
|
1. Lists open issues labeled `auto-detected` + `agent-fixable`
|
||||||
|
2. For each (oldest first, max 3 per run):
|
||||||
|
- SSHs to ub01, reads logs, inspects source code
|
||||||
|
- Classifies as: agent-fixable, needs-human, transient, or resolved
|
||||||
|
- If fixable: creates branch, commits fix, opens PR
|
||||||
|
- If not: posts root-cause analysis comment
|
||||||
|
|
||||||
|
## Manual Monitoring
|
||||||
|
|
||||||
|
| What | How |
|
||||||
|
|------|-----|
|
||||||
|
| Container status | `ssh ub01 "docker ps --filter name=chrysopedia"` |
|
||||||
|
| API health | `ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"` |
|
||||||
|
| LightRAG health | `ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"` |
|
||||||
|
| Resource usage | `ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' \| grep chrysopedia"` |
|
||||||
|
| Recent errors | `ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 \| grep -i error"` |
|
||||||
|
| Disk space | `ssh ub01 "df -h /vmPool"` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*See also: [[Deployment]], [[Architecture]]*
|
||||||
Loading…
Add table
Reference in a new issue