Skip to content

Incident response

ADMINISTRATOR ::: danger Restricted

:::

What to do when something is on fire.

Triage order

  1. Is the app down? → check /api/health. If non-200, the Worker itself is broken.
  2. Is the DB down? → /api/health returns 200 but every other endpoint 500s. Check D1 console.
  3. Is AI down? → app loads, but /api/ai-status shows all-failed. Likely a single provider (rotate key) or the budget guard tripped.
  4. Is one vessel broken? → other vessels load fine, one returns 500. Likely a phantom column or a partial migration. See v2.31.0.30.

Known-incident playbooks

"ReferenceError: X is not defined" — entire SPA white-screened

Symptom. Every page load throws a console ReferenceError, white screen, no nav.

Cause. vite build shipped a bundle missing an import. Happens when tsc is not gating the build.

Fix.

  1. git revert the offending commit
  2. Run scripts/deploy.sh to ship the prior version
  3. Reproduce locally with tsc --noEmit — it should error
  4. Add the missing import, re-deploy

Prevention. scripts/deploy.sh runs tsc --noEmit before vite build since v2.29.5.

D1_ERROR: no such column: X

Symptom. One endpoint 500s, others fine. Worker log shows the SQLITE error text.

Cause. A SELECT references a column that was never added by a migration. Or a column was dropped without finding all SELECT sites.

Fix.

  1. grep -rn "SELECT.*X" src/ — find every binding site
  2. Either drop the column from the SELECT (if the column was always phantom — see v2.31.0.30 source_documents.classification), or write a migration to add it
  3. Add a unit test that exercises the endpoint against a fresh D1 fixture

KB orphan heal returns errorSamples[] with SQLITE errors

Symptom. POST /api/admin/kb-orphan-heal { dryRun: true } returns scanned=N healed=0 errors=N with errorSamples[].error containing a SQL error text.

Cause. Schema drift in ucs_master_list — column rename (e.g. namecomponent_name per migration 0011) broke the heal SQL.

Fix. See v2.31.0.20 — src/kb-orphan-heal.js candidate prefilter must use the current column name. Add an active-version filter (version_id IN (SELECT id FROM ucs_foundation_versions WHERE is_active=1)) so codes from old foundation versions can never become heal targets.

CL Build "Master list id=null not found"

Symptom. Start CL Build returns 422 / 500 with this string after the operator completes Admin → Import.

Cause. The CL Skeleton runner only consulted the legacy master_lists table, not ucs_foundation_versions(is_active=1).

Fix. v2.31.0.16 introduced src/cl-skeleton/active-master-resolver.js (resolveActiveMaster + listActiveMasters) which prefers ucs_foundation_versions and falls back to master_lists. v2.31.0.17 wired the runner to consume the GLOBAL UCS Manifest Master directly from ucs_foundation_files (kind=master_full).

Login probe fails after deploy

Symptom. scripts/deploy.sh exits 3.

Cause. Often a CSP / cookie / origin regression rather than auth itself.

Fix.

  1. curl -i -X POST https://pmsplanner.com/api/auth/login -d '{"username":"admin","password":"Spb812"}'
  2. If 200 with Set-Cookie, the deploy script's curl is wrong; check /tmp/pms-cookies.txt writability
  3. If 401, the users.password_hash was clobbered — restore from R2 backup
  4. If 5xx, Worker error — check the Worker log

AI provider single-failure cascading to all-failed swarm

Symptom. /api/ai-status shows everything red even though only one provider had a real outage.

Cause. Some chains have hard fan-out without per-provider isolation; one external 5xx propagates.

Fix. Per-provider AbortSignal.timeout in src/gemini.js + Promise.race wrapper in src/ai-router.js (v2.30.0). Workers AI binding can't take an AbortSignal so we race .run() against setTimeout reject. All timeout errors emit timeout: <Nms> so ai_calls rows distinguish timeouts from external 5xx.

Budget guard tripped — 429 storm

Symptom. Users see "AI temporarily unavailable" toasts; ai-status thread shows budget cap reached.

Fix. Wait until UTC midnight (the cap resets). Or temporarily raise the cap — but expect a real bill. The sentinel-deduped ai-status thread prevents spam at every overrun.

RAG eval recall@5 ≥ 5pp regression

Symptom. Daily 02:00 UTC cron sends an ai-status thread + Postmark email.

Triage.

  1. What changed yesterday? Diff against the prior day's eval set output.
  2. Is mandatory_class backfill complete? POST /api/admin/backfill/rag-chunks-mandatory-class { dryRun: true } — if totalUpdated > 0, run for real.
  3. Did a UCS cascade run? If yes, the rag-cascade pass (F-5 milestone) hasn't shipped yet — old codes resolve via code_history but rag-chunks still hold old codes.

Restoring from backup

R2 backups at backups/pms-db-YYYY-MM-DD.sql (since 2026-04-23). Retention 30 days.

bash
# list
wrangler r2 object list pms --prefix backups/

# download
wrangler r2 object get pms backups/pms-db-2026-04-27.sql > snapshot.sql

# inspect
head -100 snapshot.sql

# restore (this is destructive — confirm first)
wrangler d1 execute pms-db --remote --file=snapshot.sql

Restore is destructive

A restore overwrites the live DB with the snapshot. Schedule a maintenance window, post ai-status notice, and verify the snapshot is from the right date (the file name is the export-date, not the data-date).

Escalation

There is no second engineer. The owner is the sole dev. If a problem cannot be resolved, the path is:

  1. Pause writes — flip the MAINTENANCE_MODE flag (App.tsx maintenance banner)
  2. Take a fresh D1 export to R2
  3. Restore from the most recent clean R2 snapshot
  4. Open a fresh deploy session with full handoff context

RAPAX PMS Help · v2.31.0.26 · released 2026-04-28