NAS Degradation Local/Temp/Queue Fallback Contract
Status: accepted
Date: 2026-03-01
Updated: 2026-03-01
Deciders: Joel Hooks, Panda
Related: ADR-0088 (storage tiering), ADR-0138 (self-healing backup), ADR-0153 (docs API)
Context
NAS availability is variable. When NAS is unreachable, critical runtime services must not fail closed.
Observed failure mode:
- A hard NFS mount in
typesense-0blocked pod startup (ContainerCreating+FailedMount), which cascaded into docs search socket failures and repeateddocs-ingestfailures.
This violates the system rule that infra failures must degrade gracefully and visibly.
Decision
Adopt a mandatory NAS degradation contract for all NAS-coupled operations.
1) Runtime isolation (hard rule)
Critical runtime pods must not have hard NAS mount startup dependencies.
- Typesense and other core services must boot from local state first.
- NAS is durability/transport, not process liveness.
2) Three-tier NAS write fallback (hard rule)
All NAS write flows must follow this order:
- local mount path (fast path; e.g.
/Volumes/three-body) - direct remote copy (SSH/SCP to NAS path)
- local queue spool when NAS is unavailable (deferred sync)
Queue spool path default:
/tmp/joelclaw/nas-queue(override withNAS_BACKUP_QUEUE_ROOT)
3) Degraded-mode observability (hard rule)
Every fallback write must persist:
transportMode:local | remote | queuedtransportAttemptsqueuedPathand reason when mode isqueued
Degraded/queued operation without telemetry is non-compliant.
4) Replay requirement
Queued artifacts must be replayed by a dedicated flush path (follow-up work).
Minimum replay behavior:
- idempotent copy from queue root to canonical NAS destination
- deterministic success/failure accounting
- OTEL events for
queued,flushed, andflush_failed
Implementation sequence (vector clock)
- Remove hard NAS mount dependency from runtime Typesense pod startup.
- Add queue fallback tier to NAS backup transport (
local -> remote -> queued). - Surface queued transport metadata in backup result/OTEL payloads.
- Add replay worker/command for queue flush (next increment).
Consequences
Good
- NAS outages no longer block core pod startup.
- Backup operations stay durable via deferred queue instead of hard-fail.
- Degradation state becomes explicit and searchable.
Tradeoffs
- Queue introduces eventual-consistency windows for NAS durability.
- Requires queue replay hygiene to avoid backlog growth.
Compliance
Any new NAS-integrated flow that does not define local/remote/queued fallback behavior is non-compliant.