chrysopedia/docs/graph-backend-evaluation.md
jlightner cfc7e95d28 feat: Wrote NetworkX vs Neo4j benchmark report with production measurem…
- "docs/graph-backend-evaluation.md"

GSD-Task: S06/T01
2026-04-04 14:05:55 +00:00

11 KiB
Raw Blame History

Graph Backend Evaluation: NetworkX vs Neo4j

Date: April 2026 Scope: LightRAG graph storage for the Chrysopedia knowledge base Status: Recommendation — stay on NetworkX; revisit at ~90K nodes

Executive Summary

Chrysopedia's knowledge graph (1,836 nodes, 2,305 edges, 663 KB on disk) is managed entirely by LightRAG v1.4.13 via its HTTP API on port 9621. The application code never touches the graph storage directly — all access flows through LightRAG's /query/data endpoint.

At current scale, NetworkX handles the graph trivially: sub-millisecond lookups, ~510 MB resident memory, and instant file-based persistence. Neo4j would add 12 GB of JVM overhead, an additional Docker container, and operational complexity (backup, tuning, monitoring) with no measurable query-time benefit.

Recommendation: Remain on NetworkX. Monitor node count. Begin migration planning when the graph approaches 50,000 nodes (~27× current size). Execute migration at 90,000 nodes (~50× current). The migration is config-only — no application code changes required.

Current Graph Measurements

Measured on the production LightRAG instance (ub01, chrysopedia-lightrag container).

Metric Value
Graph file graph_chunk_entity_relation.graphml
File size 663 KB
Total nodes 1,836
Total edges 2,305
Graph type Undirected
Density 0.001368
Connected components 185
Largest component 1,544 nodes
Isolated nodes 120

Content Behind the Graph

Entity Count
Creators 26
Source videos 383
Key moments 1,739
Technique pages 95

LightRAG extracts 12 entity types: Creator, Technique, Plugin, Synthesizer, Effect, Genre, DAW, SamplePack, SignalChain, Concept, Frequency, SoundDesignElement. At ~70 nodes per creator, the graph grows roughly linearly with creator count.

NetworkX at Current Scale

NetworkX stores the graph as nested Python dictionaries in-process. At 1,836 nodes this is well within its comfort zone.

Performance Profile

Operation Latency Notes
Neighbor lookup < 1 ms Dict key access
Degree calculation < 1 ms len(adj[node])
Shortest path (BFS) < 1 ms Small graph diameter
Full graph load from GraphML < 100 ms 663 KB file parse
GraphML serialization < 100 ms Write on every index operation

Resource Usage

  • Memory: ~510 MB resident for the in-memory graph (node dicts, edge dicts, attribute storage). The LightRAG container's total footprint is dominated by the Python runtime and loaded models, not the graph.
  • Disk I/O: GraphML is written on every index_done_callback. At 663 KB this is negligible. Becomes relevant above ~50 MB (roughly 100K+ nodes).
  • Concurrency: Single-process, GIL-bound. LightRAG runs one worker process, so there is no contention.

Failure Mode

If the LightRAG process crashes, the graph is reloaded from the last persisted GraphML file on restart. No data loss beyond in-flight writes that hadn't been serialized yet. At current file size, cold-start reload adds < 100 ms to container startup.

Neo4j Analysis

What It Would Provide

  • Transactional persistence: ACID writes — no window of data loss between serializations.
  • Native graph traversal: Cypher query language with index-backed pattern matching. Advantage becomes real at depth > 2 hops on large graphs.
  • Concurrent access: Multi-reader support with write locks. Would enable running multiple LightRAG workers in parallel.
  • Built-in monitoring: Neo4j Browser, Bolt metrics, JMX.

What It Would Cost

Cost Detail
Memory 12 GB base for the Neo4j Community Edition JVM heap. Grows with cache.
Docker container Additional service in docker-compose.yml. ~500 MB image.
Operational complexity JVM heap tuning, transaction log rotation, backup strategy, version upgrades.
Migration effort Config-only for LightRAG, but requires full content re-index to populate Neo4j.
Cold start Neo4j startup takes 1030 seconds (JVM initialization, recovery). NetworkX: < 1 second.

Net Assessment at Current Scale

At 1,836 nodes, Neo4j's overhead exceeds its benefit by a wide margin. The graph fits comfortably in a Python dict. Adding a JVM-based database for a 663 KB dataset trades simplicity for capability that won't be exercised.

Growth Projections

Growth is driven primarily by creator count. Each creator contributes ~70 graph nodes and ~90 edges (techniques, plugins, effects, and their relationships).

Scenario Creators Est. Nodes Est. Edges GraphML Size NetworkX Viable?
Current 26 1,836 2,305 663 KB Trivially
2× 50 ~3,500 ~4,500 ~1.3 MB Comfortable
5× 130 ~9,000 ~11,000 ~3.3 MB Fine
10× 260 ~18,000 ~23,000 ~6.5 MB Still fine
25× 650 ~45,000 ~58,000 ~16 MB Monitor serialization time
50× 1,300 ~90,000 ~115,000 ~33 MB ⚠️ Migration trigger
100× 2,600 ~180,000 ~230,000 ~65 MB Migrate to Neo4j

Where NetworkX Starts to Strain

  • ~50K nodes: GraphML serialization approaches 1 second. Each index operation writes the full file. Acceptable but noticeable.
  • ~90K nodes: Serialization exceeds 23 seconds. Memory footprint reaches ~500 MB. Pathfinding queries for deep traversals (3+ hops) may exceed 100 ms. This is the practical migration point.
  • ~200K+ nodes: Serialization takes 10+ seconds, blocking index operations. Memory exceeds 1 GB for the graph alone. NetworkX is no longer suitable for a production workload at this scale.

Time Horizon

At the current ingestion rate (26 creators over approximately 6 months of development), reaching 1,300 creators (the 50× threshold) would take years at organic growth rates. Even aggressive content expansion (10 new creators per month) reaches the migration trigger in ~10 years.

Recommendation

Stay on NetworkX. The current graph is 50× below the migration threshold. NetworkX adds zero operational overhead, loads instantly, and handles every query LightRAG can throw at it in under a millisecond.

Migration Triggers

Begin planning migration when any of these conditions are met:

  1. Node count exceeds 50,000 — schedule migration within the next growth cycle.
  2. LightRAG query latency at p95 exceeds 500 ms — investigate whether graph traversal is the bottleneck.
  3. Need for concurrent LightRAG workers — NetworkX's single-process model prevents parallel indexing.
  4. GraphML serialization exceeds 2 seconds — measure with time on the container.

Monitoring

Add a periodic check (cron or pipeline health endpoint) that reports:

# Node/edge count from the GraphML file
docker exec chrysopedia-lightrag python3 -c "
import xml.etree.ElementTree as ET
tree = ET.parse('/app/data/chrysopedia/graph_chunk_entity_relation.graphml')
ns = {'g': 'http://graphml.graphstruct.org/xmlns'}
nodes = len(tree.findall('.//g:node', ns))
edges = len(tree.findall('.//g:edge', ns))
print(f'graph_nodes={nodes} graph_edges={edges}')
"

When graph_nodes crosses 50,000, the migration plan below should be executed.

Migration Plan: NetworkX → Neo4j

When the migration trigger is reached, execute these steps. Estimated effort: 24 hours for an operator familiar with the Docker Compose stack.

Prerequisites

  • Neo4j Community Edition Docker image (neo4j:5-community)
  • 2 GB available RAM on the host for the Neo4j JVM

Steps

1. Add Neo4j to docker-compose.yml

chrysopedia-neo4j:
  image: neo4j:5-community
  container_name: chrysopedia-neo4j
  environment:
    NEO4J_AUTH: neo4j/${NEO4J_PASSWORD:-changeme}
    NEO4J_PLUGINS: '["apoc"]'
    NEO4J_server_memory_heap_initial__size: 512m
    NEO4J_server_memory_heap_max__size: 1g
  volumes:
    - /vmPool/r/services/chrysopedia_neo4j/data:/data
    - /vmPool/r/services/chrysopedia_neo4j/logs:/logs
  ports:
    - "127.0.0.1:7474:7474"   # Browser
    - "127.0.0.1:7687:7687"   # Bolt
  networks:
    - chrysopedia-net
  healthcheck:
    test: ["CMD", "neo4j", "status"]
    interval: 30s
    timeout: 10s
    retries: 5
  restart: unless-stopped

2. Update LightRAG environment variables

In .env.lightrag:

LIGHTRAG_GRAPH_STORAGE=Neo4JStorage
NEO4J_URI=bolt://chrysopedia-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=<secure-password>

3. Deploy and verify Neo4j is healthy

docker compose up -d chrysopedia-neo4j
docker exec chrysopedia-neo4j neo4j status  # Should show "running"

4. Re-index all content

LightRAG will rebuild the graph in Neo4j during re-indexing. Trigger a full re-index via the pipeline or the LightRAG API:

# Option A: Re-run the pipeline for all videos
# Option B: Use LightRAG's /documents/upload endpoint for each document

The re-index duration depends on content volume and LLM extraction speed. At 90K nodes, expect 48 hours.

5. Verify the migration

# Check Neo4j node count via Cypher
docker exec chrysopedia-neo4j cypher-shell -u neo4j -p <password> \
  "MATCH (n) RETURN count(n) AS nodes"

# Verify LightRAG query works
curl -s http://localhost:9621/query/data \
  -H 'Content-Type: application/json' \
  -d '{"query": "test query", "mode": "hybrid"}' | jq .

6. Remove the GraphML file (optional)

Once Neo4j is confirmed working, the GraphML file is no longer used. Archive or delete it.

7. Update monitoring

Replace the GraphML-based node count check with a Neo4j Cypher query:

docker exec chrysopedia-neo4j cypher-shell -u neo4j -p <password> \
  "MATCH (n) RETURN count(n) AS nodes UNION ALL MATCH ()-[r]-() RETURN count(r) AS edges"

Appendix: Architecture Context

Frontend / Chat
    ↓
FastAPI API (backend/)
    ↓ httpx POST to :9621/query/data
LightRAG HTTP API
    ↓                    ↓                  ↓
Graph Storage        Vector Storage      KV Storage
(NetworkX/Neo4j)     (Qdrant)            (JSON files)

The application layer (backend/search_service.py, backend/routers/chat.py) interacts exclusively with LightRAG's HTTP API. The graph storage backend is an implementation detail of LightRAG — swapping it changes nothing in the application code, API contracts, or frontend behavior.

LightRAG v1.4.13 ships both NetworkXStorage (lightrag/kg/networkx_impl.py) and Neo4JStorage (lightrag/kg/neo4j_impl.py, 1,908 lines) as built-in backends. The choice is a single environment variable.