chrysopedia/docs/graph-backend-evaluation.md
jlightner cfc7e95d28 feat: Wrote NetworkX vs Neo4j benchmark report with production measurem…
- "docs/graph-backend-evaluation.md"

GSD-Task: S06/T01
2026-04-04 14:05:55 +00:00

252 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Graph Backend Evaluation: NetworkX vs Neo4j
**Date:** April 2026
**Scope:** LightRAG graph storage for the Chrysopedia knowledge base
**Status:** Recommendation — stay on NetworkX; revisit at ~90K nodes
## Executive Summary
Chrysopedia's knowledge graph (1,836 nodes, 2,305 edges, 663 KB on disk) is managed entirely by LightRAG v1.4.13 via its HTTP API on port 9621. The application code never touches the graph storage directly — all access flows through LightRAG's `/query/data` endpoint.
At current scale, NetworkX handles the graph trivially: sub-millisecond lookups, ~510 MB resident memory, and instant file-based persistence. Neo4j would add 12 GB of JVM overhead, an additional Docker container, and operational complexity (backup, tuning, monitoring) with no measurable query-time benefit.
**Recommendation:** Remain on NetworkX. Monitor node count. Begin migration planning when the graph approaches **50,000 nodes** (~27× current size). Execute migration at **90,000 nodes** (~50× current). The migration is config-only — no application code changes required.
## Current Graph Measurements
Measured on the production LightRAG instance (`ub01`, `chrysopedia-lightrag` container).
| Metric | Value |
|--------|-------|
| Graph file | `graph_chunk_entity_relation.graphml` |
| File size | 663 KB |
| Total nodes | 1,836 |
| Total edges | 2,305 |
| Graph type | Undirected |
| Density | 0.001368 |
| Connected components | 185 |
| Largest component | 1,544 nodes |
| Isolated nodes | 120 |
### Content Behind the Graph
| Entity | Count |
|--------|-------|
| Creators | 26 |
| Source videos | 383 |
| Key moments | 1,739 |
| Technique pages | 95 |
LightRAG extracts 12 entity types: Creator, Technique, Plugin, Synthesizer, Effect, Genre, DAW, SamplePack, SignalChain, Concept, Frequency, SoundDesignElement. At ~70 nodes per creator, the graph grows roughly linearly with creator count.
## NetworkX at Current Scale
NetworkX stores the graph as nested Python dictionaries in-process. At 1,836 nodes this is well within its comfort zone.
### Performance Profile
| Operation | Latency | Notes |
|-----------|---------|-------|
| Neighbor lookup | < 1 ms | Dict key access |
| Degree calculation | < 1 ms | `len(adj[node])` |
| Shortest path (BFS) | < 1 ms | Small graph diameter |
| Full graph load from GraphML | < 100 ms | 663 KB file parse |
| GraphML serialization | < 100 ms | Write on every index operation |
### Resource Usage
- **Memory:** ~510 MB resident for the in-memory graph (node dicts, edge dicts, attribute storage). The LightRAG container's total footprint is dominated by the Python runtime and loaded models, not the graph.
- **Disk I/O:** GraphML is written on every `index_done_callback`. At 663 KB this is negligible. Becomes relevant above ~50 MB (roughly 100K+ nodes).
- **Concurrency:** Single-process, GIL-bound. LightRAG runs one worker process, so there is no contention.
### Failure Mode
If the LightRAG process crashes, the graph is reloaded from the last persisted GraphML file on restart. No data loss beyond in-flight writes that hadn't been serialized yet. At current file size, cold-start reload adds < 100 ms to container startup.
## Neo4j Analysis
### What It Would Provide
- **Transactional persistence:** ACID writes no window of data loss between serializations.
- **Native graph traversal:** Cypher query language with index-backed pattern matching. Advantage becomes real at depth > 2 hops on large graphs.
- **Concurrent access:** Multi-reader support with write locks. Would enable running multiple LightRAG workers in parallel.
- **Built-in monitoring:** Neo4j Browser, Bolt metrics, JMX.
### What It Would Cost
| Cost | Detail |
|------|--------|
| Memory | 12 GB base for the Neo4j Community Edition JVM heap. Grows with cache. |
| Docker container | Additional service in docker-compose.yml. ~500 MB image. |
| Operational complexity | JVM heap tuning, transaction log rotation, backup strategy, version upgrades. |
| Migration effort | Config-only for LightRAG, but requires full content re-index to populate Neo4j. |
| Cold start | Neo4j startup takes 1030 seconds (JVM initialization, recovery). NetworkX: < 1 second. |
### Net Assessment at Current Scale
At 1,836 nodes, Neo4j's overhead exceeds its benefit by a wide margin. The graph fits comfortably in a Python dict. Adding a JVM-based database for a 663 KB dataset trades simplicity for capability that won't be exercised.
## Growth Projections
Growth is driven primarily by creator count. Each creator contributes ~70 graph nodes and ~90 edges (techniques, plugins, effects, and their relationships).
| Scenario | Creators | Est. Nodes | Est. Edges | GraphML Size | NetworkX Viable? |
|----------|----------|-----------|-----------|-------------|-----------------|
| Current | 26 | 1,836 | 2,305 | 663 KB | Trivially |
| 2× | 50 | ~3,500 | ~4,500 | ~1.3 MB | Comfortable |
| 5× | 130 | ~9,000 | ~11,000 | ~3.3 MB | Fine |
| 10× | 260 | ~18,000 | ~23,000 | ~6.5 MB | Still fine |
| 25× | 650 | ~45,000 | ~58,000 | ~16 MB | Monitor serialization time |
| **50×** | **1,300** | **~90,000** | **~115,000** | **~33 MB** | ** Migration trigger** |
| 100× | 2,600 | ~180,000 | ~230,000 | ~65 MB | Migrate to Neo4j |
### Where NetworkX Starts to Strain
- **~50K nodes:** GraphML serialization approaches 1 second. Each index operation writes the full file. Acceptable but noticeable.
- **~90K nodes:** Serialization exceeds 23 seconds. Memory footprint reaches ~500 MB. Pathfinding queries for deep traversals (3+ hops) may exceed 100 ms. This is the practical migration point.
- **~200K+ nodes:** Serialization takes 10+ seconds, blocking index operations. Memory exceeds 1 GB for the graph alone. NetworkX is no longer suitable for a production workload at this scale.
### Time Horizon
At the current ingestion rate (26 creators over approximately 6 months of development), reaching 1,300 creators (the 50× threshold) would take **years** at organic growth rates. Even aggressive content expansion (10 new creators per month) reaches the migration trigger in ~10 years.
## Recommendation
**Stay on NetworkX.** The current graph is 50× below the migration threshold. NetworkX adds zero operational overhead, loads instantly, and handles every query LightRAG can throw at it in under a millisecond.
### Migration Triggers
Begin planning migration when **any** of these conditions are met:
1. **Node count exceeds 50,000** schedule migration within the next growth cycle.
2. **LightRAG query latency at p95 exceeds 500 ms** investigate whether graph traversal is the bottleneck.
3. **Need for concurrent LightRAG workers** NetworkX's single-process model prevents parallel indexing.
4. **GraphML serialization exceeds 2 seconds** measure with `time` on the container.
### Monitoring
Add a periodic check (cron or pipeline health endpoint) that reports:
```bash
# Node/edge count from the GraphML file
docker exec chrysopedia-lightrag python3 -c "
import xml.etree.ElementTree as ET
tree = ET.parse('/app/data/chrysopedia/graph_chunk_entity_relation.graphml')
ns = {'g': 'http://graphml.graphstruct.org/xmlns'}
nodes = len(tree.findall('.//g:node', ns))
edges = len(tree.findall('.//g:edge', ns))
print(f'graph_nodes={nodes} graph_edges={edges}')
"
```
When `graph_nodes` crosses 50,000, the migration plan below should be executed.
## Migration Plan: NetworkX → Neo4j
When the migration trigger is reached, execute these steps. Estimated effort: 24 hours for an operator familiar with the Docker Compose stack.
### Prerequisites
- Neo4j Community Edition Docker image (`neo4j:5-community`)
- 2 GB available RAM on the host for the Neo4j JVM
### Steps
**1. Add Neo4j to docker-compose.yml**
```yaml
chrysopedia-neo4j:
image: neo4j:5-community
container_name: chrysopedia-neo4j
environment:
NEO4J_AUTH: neo4j/${NEO4J_PASSWORD:-changeme}
NEO4J_PLUGINS: '["apoc"]'
NEO4J_server_memory_heap_initial__size: 512m
NEO4J_server_memory_heap_max__size: 1g
volumes:
- /vmPool/r/services/chrysopedia_neo4j/data:/data
- /vmPool/r/services/chrysopedia_neo4j/logs:/logs
ports:
- "127.0.0.1:7474:7474" # Browser
- "127.0.0.1:7687:7687" # Bolt
networks:
- chrysopedia-net
healthcheck:
test: ["CMD", "neo4j", "status"]
interval: 30s
timeout: 10s
retries: 5
restart: unless-stopped
```
**2. Update LightRAG environment variables**
In `.env.lightrag`:
```env
LIGHTRAG_GRAPH_STORAGE=Neo4JStorage
NEO4J_URI=bolt://chrysopedia-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=<secure-password>
```
**3. Deploy and verify Neo4j is healthy**
```bash
docker compose up -d chrysopedia-neo4j
docker exec chrysopedia-neo4j neo4j status # Should show "running"
```
**4. Re-index all content**
LightRAG will rebuild the graph in Neo4j during re-indexing. Trigger a full re-index via the pipeline or the LightRAG API:
```bash
# Option A: Re-run the pipeline for all videos
# Option B: Use LightRAG's /documents/upload endpoint for each document
```
The re-index duration depends on content volume and LLM extraction speed. At 90K nodes, expect 48 hours.
**5. Verify the migration**
```bash
# Check Neo4j node count via Cypher
docker exec chrysopedia-neo4j cypher-shell -u neo4j -p <password> \
"MATCH (n) RETURN count(n) AS nodes"
# Verify LightRAG query works
curl -s http://localhost:9621/query/data \
-H 'Content-Type: application/json' \
-d '{"query": "test query", "mode": "hybrid"}' | jq .
```
**6. Remove the GraphML file (optional)**
Once Neo4j is confirmed working, the GraphML file is no longer used. Archive or delete it.
**7. Update monitoring**
Replace the GraphML-based node count check with a Neo4j Cypher query:
```bash
docker exec chrysopedia-neo4j cypher-shell -u neo4j -p <password> \
"MATCH (n) RETURN count(n) AS nodes UNION ALL MATCH ()-[r]-() RETURN count(r) AS edges"
```
## Appendix: Architecture Context
```
Frontend / Chat
FastAPI API (backend/)
↓ httpx POST to :9621/query/data
LightRAG HTTP API
↓ ↓ ↓
Graph Storage Vector Storage KV Storage
(NetworkX/Neo4j) (Qdrant) (JSON files)
```
The application layer (`backend/search_service.py`, `backend/routers/chat.py`) interacts exclusively with LightRAG's HTTP API. The graph storage backend is an implementation detail of LightRAG swapping it changes nothing in the application code, API contracts, or frontend behavior.
LightRAG v1.4.13 ships both `NetworkXStorage` (`lightrag/kg/networkx_impl.py`) and `Neo4JStorage` (`lightrag/kg/neo4j_impl.py`, 1,908 lines) as built-in backends. The choice is a single environment variable.