1 Monitoring
xpltd_admin edited this page 2026-04-04 00:04:24 -06:00

Monitoring

Automated Health Probe

Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.

Location: forgejo-optimize/chrysopedia-health-probe/ (on dev01)

How It Works

dev01 (probe.py)
  ├── SSH → ub01: docker logs, docker inspect, df
  ├── SSH → ub01: docker exec curl (health endpoints)
  └── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)
  1. Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
  2. Runs 14 pattern detectors (see table below)
  3. Checks health endpoints and response times
  4. Checks container restart counts and OOM status
  5. Checks disk space on /vmPool
  6. Deduplicates against existing open Forgejo issues
  7. Files new issues or adds comments to existing ones

Running the Probe

# On dev01:
cd ~/projects/forgejo-optimize/chrysopedia-health-probe

# Dry run — shows detections without filing issues
python3 probe.py --dry-run --verbose

# Live run — files/comments on Forgejo issues
python3 probe.py

Detection Matrix

Pattern Severity Source Labels
Python traceback high all container logs bug
HTTP 500 high api/web logs bug
HTTP 502/503/504 high web logs infra
DB connection error high api/worker logs database, infra
Qdrant error high api logs search, infra
Redis error high api/worker logs infra
OOM killed high docker inspect infra
Celery task failure medium worker logs celery, bug
Celery task timeout medium worker logs celery, performance
Slow response (>2s) medium health endpoints performance
Container restart medium docker inspect infra
Pipeline stage failure medium worker logs pipeline, bug
LightRAG unreachable medium health endpoint infra
Disk space >85% low df /vmPool infra

Deduplication

Each detection gets a fingerprint ({container}::{pattern}) embedded in the issue title as [fp:xxx].

  • If an open issue with the same fingerprint exists → comment added instead of new issue
  • If the same issue has 5+ auto-comments without resolution → escalated with triage-needed label

Issue Labels

The probe uses these labels (auto-created on the repo):

Label Color Purpose
auto-detected blue Filed by health probe
bug red Application error
performance yellow Slow/high latency
infra purple Infrastructure issue
celery light yellow Worker/task issue
database green PostgreSQL issue
search blue Qdrant/embedding issue
pipeline light blue LLM pipeline issue
triage-needed pink Needs human review
agent-fixable light blue Claude agent can fix
severity:high/medium/low red/pink/light Impact level

Issue Template

Auto-filed issues follow this structure:

## Auto-Detected: {Category}

**Detected:** {timestamp}
**Container:** {container_name}
**Severity:** {high|medium|low}

### Evidence
{log snippet — max 30 lines}

### Context
- Container uptime / restart count

### Suggested Investigation
1. Step 1
2. Step 2
3. Step 3

Bugfix Agent

A scheduled Claude agent can triage auto-detected issues:

Prompt: forgejo-optimize/chrysopedia-bugfix-agent-prompt.md

Workflow:

  1. Lists open issues labeled auto-detected + agent-fixable
  2. For each (oldest first, max 3 per run):
    • SSHs to ub01, reads logs, inspects source code
    • Classifies as: agent-fixable, needs-human, transient, or resolved
    • If fixable: creates branch, commits fix, opens PR
    • If not: posts root-cause analysis comment

Manual Monitoring

What How
Container status ssh ub01 "docker ps --filter name=chrysopedia"
API health ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"
LightRAG health ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"
Resource usage ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' | grep chrysopedia"
Recent errors ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 | grep -i error"
Disk space ssh ub01 "df -h /vmPool"

See also: Deployment, Architecture