NAS-Backed Storage Tiering
Status
accepted — Phase 1 is complete and backup rollout has begun; remaining migration and Mac Studio readiness work is still tracked below.
Context
The NAS (three-body, Asustor ADM OS) has 57TB free on HDD RAID5 and 1.78TB on NVMe RAID1, sitting mostly idle. Meanwhile, the Mac Mini’s internal SSD stores everything. When the Mac Studio arrives (April 10, 2026), both machines will need shared access to the same data. The NAS is the natural shared storage layer — reachable by both machines over 10GbE LAN.
Hardware
- NAS: Asustor (ADM OS, Linux-based, Docker capable)
- HDD tier: 8x ~10TB RAID5 (md1) → 64TB usable, 57TB free. ~150-250 MB/s sequential
- NVMe tier: 2x Seagate IronWolf ZP2000NM30002 2TB in RAID1 (md2) → 1.78TB usable. 824 MiB/s local write
- Network: 10GbE LAN, MTU 9000 (jumbo frames) on NAS eth2 + Mac en0
- NFS performance (tuned): Write ~660 MiB/s, Read ~1,000 MiB/s over NFS
- NVMe dirs:
/volume2/data/{typesense,redis,models,backups,sessions,transcripts,otel}
Soak Test Results (all gates passed)
- ✅ G1: Zero NFS failures over 48h
- ✅ G2: md2 RAID1 resync complete — [2/2] [UU]
- ✅ G3: Local NVMe write 824 MiB/s ≥ 700 MiB/s threshold
NFS Tuning Applied
| Setting | Before | After | Impact |
|---|---|---|---|
| MTU | 1500 | 9000 (jumbo frames) | Fewer packet headers |
| rsize/wsize | 32KB | 1MB | Fewer NFS round-trips |
| readahead | 16 | 128 | Aggressive prefetch |
| noatime | off | on | Skip access time updates |
Persistent via LaunchDaemons: com.joel.mtu-jumbo.plist (MTU) + com.joel.mount-nas-nvme.plist (NFS mount).
Decision
Three-tier storage strategy placing data on the right medium based on access pattern.
Tier 1: SSD (Hot) — Latency-Sensitive
Lives on the local machine’s SSD. Sub-millisecond access required.
| Data | Reason |
|---|---|
| Redis persistence | Write-ahead log, pub/sub latency |
| Inngest state | Step execution depends on fast state reads |
Active git repos (~/Code/) | Constant random I/O from builds, LSP, agents |
Vault (~/Vault/) | Obsidian reads on every note switch |
> Typesense: Currently Tier 1 (local SSD via k8s PVC). NAS NVMe performance (660 MiB/s write, 1 GB/s read) makes NAS viable for Typesense if shared access between machines is needed. Decision deferred to Mac Studio arrival — evaluate after testing Typesense directly on NAS NVMe.
Tier 2: NAS NVMe (Warm-Hot) — Shared Fast Storage
Mounted via NFS at /Volumes/nas-nvme. Both Mac Mini and Mac Studio access the same data.
| Data | NAS Path | Access Pattern |
|---|---|---|
| Model weights (ONNX, Whisper) | /volume2/data/models/ | Download once, read on cold start |
| Typesense snapshots | /volume2/data/backups/ | Nightly snapshot |
| Session transcripts | /volume2/data/sessions/ | Rotate from local after 7 days |
| Meeting transcripts | /volume2/data/transcripts/ | Write after meeting, read on recall |
| OTEL event archives | /volume2/data/otel/ | Rotated from Typesense after 90 days |
Tier 3: NAS HDD (Cold) — Deep Archive
Mounted via NFS at /Volumes/three-body. Rarely accessed.
| Data | NAS Path | Retention |
|---|---|---|
| Video archives | /volume1/home/joel/video/ | Already here. Permanent |
| Old daily logs | /volume1/joelclaw/archive/memory/ | Rotate from local after 30 days |
| slog JSONL archives | /volume1/joelclaw/archive/slog/ | Rotate monthly |
| Qdrant data dump (post-retirement) | /volume1/joelclaw/archive/qdrant/ | One-time, keep 90 days |
| Books | /volume1/home/joel/books/ | Permanent |
NFS Mount Setup (macOS Tahoe)
macOS Tahoe has read-only root (/mnt not writable). Use /Volumes/ for mount points. vifs cannot create /etc/fstab. LaunchDaemons are the persistent mount solution.
Mount points:
/Volumes/nas-nvme→three-body:/volume2/data(NVMe RAID1)/Volumes/three-body→<private-ip>:/volume1/joelclaw(HDD RAID5)
Mount options: resvport,rw,soft,intr,nfsvers=3,tcp,rsize=1048576,wsize=1048576,readahead=128,noatime
Critical macOS NFS notes:
soft,intrrequired — hard mounts (default) hang permanently on NAS restart, need full Mac reboot to clearnfsvers=3— NFSv4 has issues with Asustor ADM- UID mismatch: NAS joel=1002, Mac joel=501.
no_root_squash+chmod 777on NFS dirs as workaround
Backup Inngest Functions (not raw crontab)
All backup jobs run as Inngest cron functions for observability, retry, and gateway alerting.
| Function | Schedule | Action |
|---|---|---|
system/backup.typesense | Daily 3 AM | Typesense snapshot API → rsync to NAS NVMe |
system/backup.redis | Daily 3:30 AM | BGSAVE → kubectl cp → NAS NVMe |
system/backup.convex | Daily 4 AM | Convex export → NAS NVMe |
system/rotate.sessions | Weekly Sunday | Move transcripts older than 7 days to NAS |
system/rotate.logs | Monthly 1st | Archive daily logs + slog to NAS HDD |
system/rotate.otel | Monthly 1st | Export old otel_events as JSONL → NAS, delete from Typesense |
Mac Studio Migration (April 2026)
- Same NFS mounts at
/Volumes/nas-nvmeand/Volumes/three-body— identical paths - Same LaunchDaemons — copy plists, adjust interface name for MTU if different
- Tier 1 data is per-machine (each has its own SSD)
- Tier 2/3 data is shared via NAS — both machines see the same files
- Setup checklist:
~/Vault/docs/mac-setup-checklist.md
Implementation Progress
Phase 1: NAS Setup ✅
- NVMe RAID1 created (md2, 2x 2TB, 1.78TB usable)
- Btrfs filesystem on NVMe volume
- NFS exports configured (
<private-subnet>CIDR) - NFS mount on Mac Mini with LaunchDaemon
- Jumbo frames (MTU 9000) on both ends
- NFS tuning: rsize/wsize 1MB, readahead 128, noatime
- Soak test: all 3 gates passed
- Performance verified: 660 MiB/s write, 1 GB/s read
Phase 2: Backup Jobs ✅
-
system/backup.typesenseInngest function -
system/backup.redisInngest function -
system/backup.convexInngest function -
system/rotate.sessionsInngest function -
system/rotate.logsInngest function -
system/rotate.otelInngest function - Typesense snapshot cleanup + retention pruning after successful NAS sync (keep latest N snapshots)
- NFS mount health check in core health slice
- Inngest runtime self-healing dispatch for
Inngest/Workerdegradation (domain: inngest-runtime) - Backup failure router with AI-powered retry decisions
Phase 3: Tier 2 Migration
- Move model weights to NAS NVMe, local cache symlinks
- Video pipeline: switch from SSH to NFS mount for writes
- Expand Typesense PVC from 5Gi to 50Gi
Phase 4: Mac Studio Prep (before April 10)
- Mac setup checklist documented (
~/Vault/docs/mac-setup-checklist.md) - Test NFS access from second machine
- Plan Tier 1 data seeding (Typesense re-index from snapshots)
Consequences
Positive
- 57TB HDD + 1.78TB NVMe of durable shared storage
- Both machines access same data via NFS — no rsync/duplication
- Backups are automatic, observable (Inngest), and off-machine
- NFS performance (660 MiB/s write, 1 GB/s read) viable for most workloads
- Session transcripts and OTEL events preserved indefinitely
Negative
- NFS adds operational complexity (mounts, connectivity)
- NAS unavailability degrades Tier 2 access (soft mount returns errors, doesn’t hang)
- macOS NFS client historically flaky — mitigated by soft,intr + LaunchDaemon retry
- UID mismatch workaround (chmod 777) is not ideal
Risks
- NAS disk failure → RAID5 (HDD) and RAID1 (NVMe) provide redundancy. Monitor via ADM alerts
- Off-LAN access via Tailscale relay much slower than 10GbE LAN → only affects laptop, not always-on nodes
- NFS file locking with concurrent writers → only one machine writes to any given path
Audit (2026-02-22)
- Status normalized to
accepted(frompartially-implemented) to match canonical ADR taxonomy while preserving the staged rollout state. - Operational evidence reviewed from
system/system-log.jsonl:2026-02-21T15:22:40.529Z(tool: nas-nvme) NVMe RAID1 + directory topology created per ADR.2026-02-21T15:40:58.023Z(tool: nas-nfs) NFS exports + mounts configured for/volume2/dataand/volume1/joelclaw.2026-02-21T21:50:17.009Z(tool: nas-nfs) MTU/NFS tuning deployed with measured throughput gains.2026-02-21T22:05:38.918Z(action: deploy,tool: nas-backup) Phase 2 backup/rotation crons deployed to NAS NVMe.
- Phase 3/4 migration and validation tasks remain open in this ADR, so status is not upgraded to
implemented.
Addendum (2026-02-27): Connectivity Incident & Hardening
Incident: Tailscale Subnet Route Hijacked LAN Traffic
NAS went unreachable from Panda over LAN. Root cause: Tailscale on the NAS was advertising 192.168.1.0/24 as a subnet route, which injected a routing table rule (table 52, priority 5270) that sent LAN response packets through tailscale0 instead of eth2. Since all LAN devices run Tailscale directly, the subnet route was redundant. Removed from --advertise-routes in the Docker-based Tailscale container.
NAS Boot Persistence: S99local-tuning
ASUSTOR ADM lacks systemd. Created /usr/local/etc/init.d/S99local-tuning (installed via setup-tuning.sh on NAS) to persist across reboots:
ip link set eth2 mtu 9000(jumbo frames)hdparm -B 255 /dev/sda /dev/sdb /dev/sdh(IronWolf APM — drives report “not supported” but command is harmless)
Missing LaunchDaemon: three-body HDD Volume
Only com.joel.mount-nas-nvme.plist existed. Added com.joel.mount-three-body.plist to auto-mount /Volumes/three-body → three-body:/volume1/joelclaw with full tuned NFS options on Panda reboot.
NFS Mount Health Check
Added NFS mount probes to check-system-health.ts core health slice. Probes stat both /Volumes/nas-nvme and /Volumes/three-body with a 5s timeout. Stale or missing mounts now surface as degraded in gateway alerts before backup jobs fail silently.
Updated Phase 2 Status
Backup functions (system/backup.typesense, system/backup.redis, system/rotate.sessions, system/rotate.logs, system/rotate.otel) are all implemented and running in production with dual-transport failover (local NFS + remote SSH) and AI-powered failure routing.
Addendum (2026-03-01): DiskPressure RCA + Snapshot Retention Hardening
Incident Summary
Inngest instability was traced to cluster-level DiskPressure on the single Talos node (Colima). inngest-0 was evicted for ephemeral-storage pressure while Inngest/worker logic itself remained healthy.
Root Cause
system/backup.typesense created snapshots under Typesense local PVC storage (/data/snapshots) and copied them to NAS, but did not prune the in-pod source snapshots. Snapshot accumulation consumed node image filesystem capacity and triggered kubelet eviction pressure.
Hardening Applied
system/backup.typesensenow supports configurable snapshot roots with primary→fallback behavior.- Runtime Typesense is kept NAS-independent (no hard NFS mount required for pod startup).
- Backup transport now follows local mount → remote SSH/SCP → local queue fallback (
NAS_BACKUP_QUEUE_ROOT) when NAS is unavailable. - Post-sync cleanup now deletes the newly created in-pod snapshot path.
- Snapshot root pruning now keeps only the latest configured count (
TYPESENSE_SNAPSHOT_RETENTION_COUNT, default2). - Health automation now dispatches
system/self.healing.requestedfordomain: inngest-runtimewhenInngestorWorkerchecks degrade. - New handler
system/self-healing.inngest-runtimevalidates runtime health, runsjoelclaw inngest restart-worker --register --wait-ms 1500when needed, and re-checks status with OTEL evidence.
Design Note
Primary Typesense data remains local Tier 1 by design (latency + reduced dependency blast radius). NAS remains the durability tier for backups. Snapshot roots can be redirected to an alternate mounted path when available, but retention is enforced regardless of root choice. Runtime fallback policy now follows ADR-0187 (local/remote/queued degradation contract).
References
- ADR-0029: Colima + Talos k8s (where PVCs live)
- ADR-0082: Typesense unified search (primary Tier 1 consumer)
- ADR-0087: Observability pipeline (otel_events rotation to NAS)
~/Vault/docs/mac-setup-checklist.md: Full machine setup guide~/.joelclaw/NEIGHBORHOOD.md: NAS specs and network topology