Monitoring

Automated Health Probe

Chrysopedia has an automated health probe that monitors the Docker stack on ub01, detects issues, and files them as Forgejo issues.

Location: forgejo-optimize/chrysopedia-health-probe/ (on dev01)

How It Works

dev01 (probe.py)
  ├── SSH → ub01: docker logs, docker inspect, df
  ├── SSH → ub01: docker exec curl (health endpoints)
  └── HTTPS → git.xpltd.co: Forgejo API (files/comments on issues)

Pulls last 30 minutes of container logs from all 10 Chrysopedia containers
Runs 14 pattern detectors (see table below)
Checks health endpoints and response times
Checks container restart counts and OOM status
Checks disk space on /vmPool
Deduplicates against existing open Forgejo issues
Files new issues or adds comments to existing ones

Running the Probe

# On dev01:
cd ~/projects/forgejo-optimize/chrysopedia-health-probe

# Dry run — shows detections without filing issues
python3 probe.py --dry-run --verbose

# Live run — files/comments on Forgejo issues
python3 probe.py

Detection Matrix

Pattern	Severity	Source	Labels
Python traceback	high	all container logs	`bug`
HTTP 500	high	api/web logs	`bug`
HTTP 502/503/504	high	web logs	`infra`
DB connection error	high	api/worker logs	`database`, `infra`
Qdrant error	high	api logs	`search`, `infra`
Redis error	high	api/worker logs	`infra`
OOM killed	high	docker inspect	`infra`
Celery task failure	medium	worker logs	`celery`, `bug`
Celery task timeout	medium	worker logs	`celery`, `performance`
Slow response (>2s)	medium	health endpoints	`performance`
Container restart	medium	docker inspect	`infra`
Pipeline stage failure	medium	worker logs	`pipeline`, `bug`
LightRAG unreachable	medium	health endpoint	`infra`
Disk space >85%	low	`df /vmPool`	`infra`

Deduplication

Each detection gets a fingerprint ({container}::{pattern}) embedded in the issue title as [fp:xxx].

If an open issue with the same fingerprint exists → comment added instead of new issue
If the same issue has 5+ auto-comments without resolution → escalated with triage-needed label

Issue Labels

The probe uses these labels (auto-created on the repo):

Label	Color	Purpose
`auto-detected`	blue	Filed by health probe
`bug`	red	Application error
`performance`	yellow	Slow/high latency
`infra`	purple	Infrastructure issue
`celery`	light yellow	Worker/task issue
`database`	green	PostgreSQL issue
`search`	blue	Qdrant/embedding issue
`pipeline`	light blue	LLM pipeline issue
`triage-needed`	pink	Needs human review
`agent-fixable`	light blue	Claude agent can fix
`severity:high/medium/low`	red/pink/light	Impact level

Issue Template

Auto-filed issues follow this structure:

## Auto-Detected: {Category}

**Detected:** {timestamp}
**Container:** {container_name}
**Severity:** {high|medium|low}

### Evidence
{log snippet — max 30 lines}

### Context
- Container uptime / restart count

### Suggested Investigation
1. Step 1
2. Step 2
3. Step 3

Bugfix Agent

A scheduled Claude agent can triage auto-detected issues:

Prompt: forgejo-optimize/chrysopedia-bugfix-agent-prompt.md

Workflow:

Lists open issues labeled auto-detected + agent-fixable
For each (oldest first, max 3 per run):
- SSHs to ub01, reads logs, inspects source code
- Classifies as: agent-fixable, needs-human, transient, or resolved
- If fixable: creates branch, commits fix, opens PR
- If not: posts root-cause analysis comment

Manual Monitoring

What	How
Container status	`ssh ub01 "docker ps --filter name=chrysopedia"`
API health	`ssh ub01 "docker exec chrysopedia-api curl -s http://localhost:8000/health"`
LightRAG health	`ssh ub01 "docker exec chrysopedia-api curl -s http://chrysopedia-lightrag:9621/health"`
Resource usage	`ssh ub01 "docker stats --no-stream --format '{{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}' \| grep chrysopedia"`
Recent errors	`ssh ub01 "docker logs chrysopedia-api --tail 50 --since 1h 2>&1 \| grep -i error"`
Disk space	`ssh ub01 "df -h /vmPool"`

See also: Deployment, Architecture

Chrysopedia Wiki

Architecture

Features

Reference

Operations