ADR-0187accepted

NAS Degradation Local/Temp/Queue Fallback Contract

Status: accepted
Date: 2026-03-01
Updated: 2026-03-01
Deciders: Joel Hooks, Panda
Related: ADR-0088 (storage tiering), ADR-0138 (self-healing backup), ADR-0153 (docs API)

Context

NAS availability is variable. When NAS is unreachable, critical runtime services must not fail closed.

Observed failure mode:

  • A hard NFS mount in typesense-0 blocked pod startup (ContainerCreating + FailedMount), which cascaded into docs search socket failures and repeated docs-ingest failures.

This violates the system rule that infra failures must degrade gracefully and visibly.

Decision

Adopt a mandatory NAS degradation contract for all NAS-coupled operations.

1) Runtime isolation (hard rule)

Critical runtime pods must not have hard NAS mount startup dependencies.

  • Typesense and other core services must boot from local state first.
  • NAS is durability/transport, not process liveness.

2) Three-tier NAS write fallback (hard rule)

All NAS write flows must follow this order:

  1. local mount path (fast path; e.g. /Volumes/three-body)
  2. direct remote copy (SSH/SCP to NAS path)
  3. local queue spool when NAS is unavailable (deferred sync)

Queue spool path default:

  • /tmp/joelclaw/nas-queue (override with NAS_BACKUP_QUEUE_ROOT)

3) Degraded-mode observability (hard rule)

Every fallback write must persist:

  • transportMode: local | remote | queued
  • transportAttempts
  • queuedPath and reason when mode is queued

Degraded/queued operation without telemetry is non-compliant.

4) Replay requirement

Queued artifacts must be replayed by a dedicated flush path (follow-up work).

Minimum replay behavior:

  • idempotent copy from queue root to canonical NAS destination
  • deterministic success/failure accounting
  • OTEL events for queued, flushed, and flush_failed

Implementation sequence (vector clock)

  1. Remove hard NAS mount dependency from runtime Typesense pod startup.
  2. Add queue fallback tier to NAS backup transport (local -> remote -> queued).
  3. Surface queued transport metadata in backup result/OTEL payloads.
  4. Add replay worker/command for queue flush (next increment).

Consequences

Good

  • NAS outages no longer block core pod startup.
  • Backup operations stay durable via deferred queue instead of hard-fail.
  • Degradation state becomes explicit and searchable.

Tradeoffs

  • Queue introduces eventual-consistency windows for NAS durability.
  • Requires queue replay hygiene to avoid backlog growth.

Compliance

Any new NAS-integrated flow that does not define local/remote/queued fallback behavior is non-compliant.