ADR-0088shipped

NAS-Backed Storage Tiering

2026-02-21T00:00:00.000Z

Status

accepted — Phase 1 is complete and backup rollout has begun; remaining migration and Mac Studio readiness work is still tracked below.

The NAS (three-body, Asustor ADM OS) has 57TB free on HDD RAID5 and 1.78TB on NVMe RAID1, sitting mostly idle. Meanwhile, the Mac Mini’s internal SSD stores everything. When the Mac Studio arrives (April 10, 2026), both machines will need shared access to the same data. The NAS is the natural shared storage layer — reachable by both machines over 10GbE LAN.

Hardware

NAS: Asustor (ADM OS, Linux-based, Docker capable)
HDD tier: 8x ~10TB RAID5 (md1) → 64TB usable, 57TB free. ~150-250 MB/s sequential
NVMe tier: 2x Seagate IronWolf ZP2000NM30002 2TB in RAID1 (md2) → 1.78TB usable. 824 MiB/s local write
Network: 10GbE LAN, MTU 9000 (jumbo frames) on NAS eth2 + Mac en0
NFS performance (tuned): Write ~660 MiB/s, Read ~1,000 MiB/s over NFS
NVMe dirs: /volume2/data/{typesense,redis,models,backups,sessions,transcripts,otel}

Soak Test Results (all gates passed)

✅ G1: Zero NFS failures over 48h
✅ G2: md2 RAID1 resync complete — [2/2] [UU]
✅ G3: Local NVMe write 824 MiB/s ≥ 700 MiB/s threshold

NFS Tuning Applied

Setting	Before	After	Impact
MTU	1500	9000 (jumbo frames)	Fewer packet headers
rsize/wsize	32KB	1MB	Fewer NFS round-trips
readahead	16	128	Aggressive prefetch
noatime	off	on	Skip access time updates

Persistent via LaunchDaemons: com.joel.mtu-jumbo.plist (MTU) + com.joel.mount-nas-nvme.plist (NFS mount).

Decision

Three-tier storage strategy placing data on the right medium based on access pattern.

Tier 1: SSD (Hot) — Latency-Sensitive

Lives on the local machine’s SSD. Sub-millisecond access required.

Data	Reason
Redis persistence	Write-ahead log, pub/sub latency
Inngest state	Step execution depends on fast state reads
Active git repos (`~/Code/`)	Constant random I/O from builds, LSP, agents
Vault (`~/Vault/`)	Obsidian reads on every note switch

> Typesense: Currently Tier 1 (local SSD via k8s PVC). NAS NVMe performance (660 MiB/s write, 1 GB/s read) makes NAS viable for Typesense if shared access between machines is needed. Decision deferred to Mac Studio arrival — evaluate after testing Typesense directly on NAS NVMe.

Tier 2: NAS NVMe (Warm-Hot) — Shared Fast Storage

Mounted via NFS at /Volumes/nas-nvme. Both Mac Mini and Mac Studio access the same data.

Data	NAS Path	Access Pattern
Model weights (ONNX, Whisper)	`/volume2/data/models/`	Download once, read on cold start
Typesense snapshots	`/volume2/data/backups/`	Nightly snapshot
Session transcripts	`/volume2/data/sessions/`	Rotate from local after 7 days
Meeting transcripts	`/volume2/data/transcripts/`	Write after meeting, read on recall
OTEL event archives	`/volume2/data/otel/`	Rotated from Typesense after 90 days

Tier 3: NAS HDD (Cold) — Deep Archive

Mounted via NFS at /Volumes/three-body. Rarely accessed.

Data	NAS Path	Retention
Video archives	`/volume1/home/joel/video/`	Already here. Permanent
Old daily logs	`/volume1/joelclaw/archive/memory/`	Rotate from local after 30 days
slog JSONL archives	`/volume1/joelclaw/archive/slog/`	Rotate monthly
Qdrant data dump (post-retirement)	`/volume1/joelclaw/archive/qdrant/`	One-time, keep 90 days
Books	`/volume1/home/joel/books/`	Permanent

NFS Mount Setup (macOS Tahoe)

macOS Tahoe has read-only root (/mnt not writable). Use /Volumes/ for mount points. vifs cannot create /etc/fstab. LaunchDaemons are the persistent mount solution.

Mount points:

/Volumes/nas-nvme → three-body:/volume2/data (NVMe RAID1)
/Volumes/three-body → <private-ip>:/volume1/joelclaw (HDD RAID5)

Mount options: resvport,rw,soft,intr,nfsvers=3,tcp,rsize=1048576,wsize=1048576,readahead=128,noatime

Critical macOS NFS notes:

soft,intr required — hard mounts (default) hang permanently on NAS restart, need full Mac reboot to clear
nfsvers=3 — NFSv4 has issues with Asustor ADM
UID mismatch: NAS joel=1002, Mac joel=501. no_root_squash + chmod 777 on NFS dirs as workaround

Backup Inngest Functions (not raw crontab)

All backup jobs run as Inngest cron functions for observability, retry, and gateway alerting.

Function	Schedule	Action
`system/backup.typesense`	Daily 3 AM	Typesense snapshot API → rsync to NAS NVMe
`system/backup.redis`	Daily 3:30 AM	BGSAVE → kubectl cp → NAS NVMe
`system/backup.convex`	Daily 4 AM	Convex export → NAS NVMe
`system/rotate.sessions`	Weekly Sunday	Move transcripts older than 7 days to NAS
`system/rotate.logs`	Monthly 1st	Archive daily logs + slog to NAS HDD
`system/rotate.otel`	Monthly 1st	Export old otel_events as JSONL → NAS, delete from Typesense

Mac Studio Migration (April 2026)

Same NFS mounts at /Volumes/nas-nvme and /Volumes/three-body — identical paths
Same LaunchDaemons — copy plists, adjust interface name for MTU if different
Tier 1 data is per-machine (each has its own SSD)
Tier 2/3 data is shared via NAS — both machines see the same files
Setup checklist: ~/Vault/docs/mac-setup-checklist.md

Implementation Progress

Phase 1: NAS Setup ✅

NVMe RAID1 created (md2, 2x 2TB, 1.78TB usable)
Btrfs filesystem on NVMe volume
NFS exports configured (<private-subnet> CIDR)
NFS mount on Mac Mini with LaunchDaemon
Jumbo frames (MTU 9000) on both ends
NFS tuning: rsize/wsize 1MB, readahead 128, noatime
Soak test: all 3 gates passed
Performance verified: 660 MiB/s write, 1 GB/s read

Phase 2: Backup Jobs ✅

Phase 2.5: k8s ↔ NAS Networking ✅ (2026-03-20)

Diagnosed route: VZ NAT on eth0 doesn’t forward LAN traffic properly
Fix: route 192.168.1.0/24 via col0 bridge (192.168.64.1) — macOS IP forwarding already enabled
Persisted via Colima provision script (runs on colima start)
Also added to colima-tunnel script (runs on tunnel restart, covers warm resume)
NFS PV/PVC created: nas-nvme → 192.168.1.163:/volume2/data (1.5TB, RWX)
Verified: k8s pods can mount NAS NVMe, read/write confirmed
Fixed minio PV hostname→IP (minio-nfs-pv still uses non-existent /volume1/joelclaw path — separate issue)

Phase 3: Tier 2 Migration

Move model weights to NAS NVMe, local cache symlinks
Video pipeline: switch from SSH to NFS mount for writes
Expand Typesense PVC from 5Gi to 50Gi

Phase 4: Mac Studio Prep (before April 10)

Mac setup checklist documented (~/Vault/docs/mac-setup-checklist.md)
Test NFS access from second machine
Plan Tier 1 data seeding (Typesense re-index from snapshots)

Consequences

Positive

57TB HDD + 1.78TB NVMe of durable shared storage
Both machines access same data via NFS — no rsync/duplication
Backups are automatic, observable (Inngest), and off-machine
NFS performance (660 MiB/s write, 1 GB/s read) viable for most workloads
Session transcripts and OTEL events preserved indefinitely

Negative

NFS adds operational complexity (mounts, connectivity)
NAS unavailability degrades Tier 2 access (soft mount returns errors, doesn’t hang)
macOS NFS client historically flaky — mitigated by soft,intr + LaunchDaemon retry
UID mismatch workaround (chmod 777) is not ideal

Risks

NAS disk failure → RAID5 (HDD) and RAID1 (NVMe) provide redundancy. Monitor via ADM alerts
Off-LAN access via Tailscale relay much slower than 10GbE LAN → only affects laptop, not always-on nodes
NFS file locking with concurrent writers → only one machine writes to any given path

Audit (2026-02-22)

Status normalized to accepted (from partially-implemented) to match canonical ADR taxonomy while preserving the staged rollout state.
Operational evidence reviewed from system/system-log.jsonl:
- 2026-02-21T15:22:40.529Z (tool: nas-nvme) NVMe RAID1 + directory topology created per ADR.
- 2026-02-21T15:40:58.023Z (tool: nas-nfs) NFS exports + mounts configured for /volume2/data and /volume1/joelclaw.
- 2026-02-21T21:50:17.009Z (tool: nas-nfs) MTU/NFS tuning deployed with measured throughput gains.
- 2026-02-21T22:05:38.918Z (action: deploy, tool: nas-backup) Phase 2 backup/rotation crons deployed to NAS NVMe.
Phase 3/4 migration and validation tasks remain open in this ADR, so status is not upgraded to implemented.

Addendum (2026-02-27): Connectivity Incident & Hardening

Incident: Tailscale Subnet Route Hijacked LAN Traffic

NAS went unreachable from Panda over LAN. Root cause: Tailscale on the NAS was advertising 192.168.1.0/24 as a subnet route, which injected a routing table rule (table 52, priority 5270) that sent LAN response packets through tailscale0 instead of eth2. Since all LAN devices run Tailscale directly, the subnet route was redundant. Removed from --advertise-routes in the Docker-based Tailscale container.

NAS Boot Persistence: S99local-tuning

ASUSTOR ADM lacks systemd. Created /usr/local/etc/init.d/S99local-tuning (installed via setup-tuning.sh on NAS) to persist across reboots:

ip link set eth2 mtu 9000 (jumbo frames)
hdparm -B 255 /dev/sda /dev/sdb /dev/sdh (IronWolf APM — drives report “not supported” but command is harmless)

Missing LaunchDaemon: three-body HDD Volume

Only com.joel.mount-nas-nvme.plist existed. Added com.joel.mount-three-body.plist to auto-mount /Volumes/three-body → three-body:/volume1/joelclaw with full tuned NFS options on Panda reboot.

NFS Mount Health Check

Added NFS mount probes to check-system-health.ts core health slice. Probes stat both /Volumes/nas-nvme and /Volumes/three-body with a 5s timeout. Stale or missing mounts now surface as degraded in gateway alerts before backup jobs fail silently.

Updated Phase 2 Status

Backup functions (system/backup.typesense, system/backup.redis, system/rotate.sessions, system/rotate.logs, system/rotate.otel) are all implemented and running in production with dual-transport failover (local NFS + remote SSH) and AI-powered failure routing.

Addendum (2026-03-01): DiskPressure RCA + Snapshot Retention Hardening

Incident Summary

Inngest instability was traced to cluster-level DiskPressure on the single Talos node (Colima). inngest-0 was evicted for ephemeral-storage pressure while Inngest/worker logic itself remained healthy.

Root Cause

system/backup.typesense created snapshots under Typesense local PVC storage (/data/snapshots) and copied them to NAS, but did not prune the in-pod source snapshots. Snapshot accumulation consumed node image filesystem capacity and triggered kubelet eviction pressure.

Hardening Applied

system/backup.typesense now supports configurable snapshot roots with primary→fallback behavior.
Runtime Typesense is kept NAS-independent (no hard NFS mount required for pod startup).
Backup transport now follows local mount → remote SSH/SCP → local queue fallback (NAS_BACKUP_QUEUE_ROOT) when NAS is unavailable.
Post-sync cleanup now deletes the newly created in-pod snapshot path.
Snapshot root pruning now keeps only the latest configured count (TYPESENSE_SNAPSHOT_RETENTION_COUNT, default 2).
Health automation now dispatches system/self.healing.requested for domain: inngest-runtime when Inngest or Worker checks degrade.
New handler system/self-healing.inngest-runtime validates runtime health, runs joelclaw inngest restart-worker --register --wait-ms 1500 when needed, and re-checks status with OTEL evidence.

Design Note

Primary Typesense data remains local Tier 1 by design (latency + reduced dependency blast radius). NAS remains the durability tier for backups. Snapshot roots can be redirected to an alternate mounted path when available, but retention is enforced regardless of root choice. Runtime fallback policy now follows ADR-0187 (local/remote/queued degradation contract).

References

ADR-0029: Colima + Talos k8s (where PVCs live)
ADR-0082: Typesense unified search (primary Tier 1 consumer)
ADR-0087: Observability pipeline (otel_events rotation to NAS)
~/Vault/docs/mac-setup-checklist.md: Full machine setup guide
~/.joelclaw/NEIGHBORHOOD.md: NAS specs and network topology