No results
1
Monitoring
xpltd_admin edited this page 2026-04-04 00:04:24 -06:00
Monitoring
Automated Health Probe
Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.
Location: forgejo-optimize/chrysopedia-health-probe/ (on dev01)
How It Works
dev01 (probe.py)
├── SSH → ub01: docker logs, docker inspect, df
├── SSH → ub01: docker exec curl (health endpoints)
└── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)
- Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
- Runs 14 pattern detectors (see table below)
- Checks health endpoints and response times
- Checks container restart counts and OOM status
- Checks disk space on
/vmPool - Deduplicates against existing open Forgejo issues
- Files new issues or adds comments to existing ones
Running the Probe
# On dev01:
cd ~/projects/forgejo-optimize/chrysopedia-health-probe
# Dry run — shows detections without filing issues
python3 probe.py --dry-run --verbose
# Live run — files/comments on Forgejo issues
python3 probe.py
Detection Matrix
| Pattern | Severity | Source | Labels |
|---|---|---|---|
| Python traceback | high | all container logs | bug |
| HTTP 500 | high | api/web logs | bug |
| HTTP 502/503/504 | high | web logs | infra |
| DB connection error | high | api/worker logs | database, infra |
| Qdrant error | high | api logs | search, infra |
| Redis error | high | api/worker logs | infra |
| OOM killed | high | docker inspect | infra |
| Celery task failure | medium | worker logs | celery, bug |
| Celery task timeout | medium | worker logs | celery, performance |
| Slow response (>2s) | medium | health endpoints | performance |
| Container restart | medium | docker inspect | infra |
| Pipeline stage failure | medium | worker logs | pipeline, bug |
| LightRAG unreachable | medium | health endpoint | infra |
| Disk space >85% | low | df /vmPool |
infra |
Deduplication
Each detection gets a fingerprint ({container}::{pattern}) embedded in the issue title as [fp:xxx].
- If an open issue with the same fingerprint exists → comment added instead of new issue
- If the same issue has 5+ auto-comments without resolution → escalated with
triage-neededlabel
Issue Labels
The probe uses these labels (auto-created on the repo):
| Label | Color | Purpose |
|---|---|---|
auto-detected |
blue | Filed by health probe |
bug |
red | Application error |
performance |
yellow | Slow/high latency |
infra |
purple | Infrastructure issue |
celery |
light yellow | Worker/task issue |
database |
green | PostgreSQL issue |
search |
blue | Qdrant/embedding issue |
pipeline |
light blue | LLM pipeline issue |
triage-needed |
pink | Needs human review |
agent-fixable |
light blue | Claude agent can fix |
severity:high/medium/low |
red/pink/light | Impact level |
Issue Template
Auto-filed issues follow this structure:
## Auto-Detected: {Category}
**Detected:** {timestamp}
**Container:** {container_name}
**Severity:** {high|medium|low}
### Evidence
{log snippet — max 30 lines}
### Context
- Container uptime / restart count
### Suggested Investigation
1. Step 1
2. Step 2
3. Step 3
Bugfix Agent
A scheduled Claude agent can triage auto-detected issues:
Prompt: forgejo-optimize/chrysopedia-bugfix-agent-prompt.md
Workflow:
- Lists open issues labeled
auto-detected+agent-fixable - For each (oldest first, max 3 per run):
- SSHs to ub01, reads logs, inspects source code
- Classifies as: agent-fixable, needs-human, transient, or resolved
- If fixable: creates branch, commits fix, opens PR
- If not: posts root-cause analysis comment
Manual Monitoring
| What | How |
|---|---|
| Container status | ssh ub01 "docker ps --filter name=chrysopedia" |
| API health | ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health" |
| LightRAG health | ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health" |
| Resource usage | ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' | grep chrysopedia" |
| Recent errors | ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 | grep -i error" |
| Disk space | ssh ub01 "df -h /vmPool" |
See also: Deployment, Architecture
Chrysopedia Wiki
Architecture
Features
- Chat-Engine
- Search-Retrieval
- Highlights
- Personality-Profiles
- Posts (via Post Editor)
Reference
Operations