ADR-0187accepted

NAS Degradation Local/Temp/Queue Fallback Contract

2026-03-01T00:00:00.000Z

Status: accepted
Date: 2026-03-01
Updated: 2026-03-01
Deciders: Joel Hooks, Panda
Related: ADR-0088 (storage tiering), ADR-0138 (self-healing backup), ADR-0153 (docs API)

Context

NAS availability is variable. When NAS is unreachable, critical runtime services must not fail closed.

Observed failure mode:

A hard NFS mount in typesense-0 blocked pod startup (ContainerCreating + FailedMount), which cascaded into docs search socket failures and repeated docs-ingest failures.

This violates the system rule that infra failures must degrade gracefully and visibly.

Decision

Adopt a mandatory NAS degradation contract for all NAS-coupled operations.

1) Runtime isolation (hard rule)

Critical runtime pods must not have hard NAS mount startup dependencies.

Typesense and other core services must boot from local state first.
NAS is durability/transport, not process liveness.

2) Three-tier NAS write fallback (hard rule)

All NAS write flows must follow this order:

local mount path (fast path; e.g. /Volumes/three-body)
direct remote copy (SSH/SCP to NAS path)
local queue spool when NAS is unavailable (deferred sync)

Queue spool path default:

/tmp/joelclaw/nas-queue (override with NAS_BACKUP_QUEUE_ROOT)

3) Degraded-mode observability (hard rule)

Every fallback write must persist:

transportMode: local | remote | queued
transportAttempts
queuedPath and reason when mode is queued

Degraded/queued operation without telemetry is non-compliant.

4) Replay requirement

Queued artifacts must be replayed by a dedicated flush path (follow-up work).

Minimum replay behavior:

idempotent copy from queue root to canonical NAS destination
deterministic success/failure accounting
OTEL events for queued, flushed, and flush_failed

Implementation sequence (vector clock)

Remove hard NAS mount dependency from runtime Typesense pod startup.
Add queue fallback tier to NAS backup transport (local -> remote -> queued).
Surface queued transport metadata in backup result/OTEL payloads.
Add replay worker/command for queue flush (next increment).

Consequences

Good

NAS outages no longer block core pod startup.
Backup operations stay durable via deferred queue instead of hard-fail.
Degradation state becomes explicit and searchable.

Tradeoffs

Queue introduces eventual-consistency windows for NAS durability.
Requires queue replay hygiene to avoid backlog growth.

Compliance

Any new NAS-integrated flow that does not define local/remote/queued fallback behavior is non-compliant.