Incident response
ADMINISTRATOR ::: danger Restricted:::
What to do when something is on fire.
Triage order
- Is the app down? → check
/api/health. If non-200, the Worker itself is broken. - Is the DB down? →
/api/healthreturns 200 but every other endpoint 500s. Check D1 console. - Is AI down? → app loads, but
/api/ai-statusshows all-failed. Likely a single provider (rotate key) or the budget guard tripped. - Is one vessel broken? → other vessels load fine, one returns 500. Likely a phantom column or a partial migration. See v2.31.0.30.
Known-incident playbooks
"ReferenceError: X is not defined" — entire SPA white-screened
Symptom. Every page load throws a console ReferenceError, white screen, no nav.
Cause. vite build shipped a bundle missing an import. Happens when tsc is not gating the build.
Fix.
git revertthe offending commit- Run
scripts/deploy.shto ship the prior version - Reproduce locally with
tsc --noEmit— it should error - Add the missing import, re-deploy
Prevention. scripts/deploy.sh runs tsc --noEmit before vite build since v2.29.5.
D1_ERROR: no such column: X
Symptom. One endpoint 500s, others fine. Worker log shows the SQLITE error text.
Cause. A SELECT references a column that was never added by a migration. Or a column was dropped without finding all SELECT sites.
Fix.
grep -rn "SELECT.*X" src/— find every binding site- Either drop the column from the SELECT (if the column was always phantom — see v2.31.0.30
source_documents.classification), or write a migration to add it - Add a unit test that exercises the endpoint against a fresh D1 fixture
KB orphan heal returns errorSamples[] with SQLITE errors
Symptom. POST /api/admin/kb-orphan-heal { dryRun: true } returns scanned=N healed=0 errors=N with errorSamples[].error containing a SQL error text.
Cause. Schema drift in ucs_master_list — column rename (e.g. name → component_name per migration 0011) broke the heal SQL.
Fix. See v2.31.0.20 — src/kb-orphan-heal.js candidate prefilter must use the current column name. Add an active-version filter (version_id IN (SELECT id FROM ucs_foundation_versions WHERE is_active=1)) so codes from old foundation versions can never become heal targets.
CL Build "Master list id=null not found"
Symptom. Start CL Build returns 422 / 500 with this string after the operator completes Admin → Import.
Cause. The CL Skeleton runner only consulted the legacy master_lists table, not ucs_foundation_versions(is_active=1).
Fix. v2.31.0.16 introduced src/cl-skeleton/active-master-resolver.js (resolveActiveMaster + listActiveMasters) which prefers ucs_foundation_versions and falls back to master_lists. v2.31.0.17 wired the runner to consume the GLOBAL UCS Manifest Master directly from ucs_foundation_files (kind=master_full).
Login probe fails after deploy
Symptom. scripts/deploy.sh exits 3.
Cause. Often a CSP / cookie / origin regression rather than auth itself.
Fix.
curl -i -X POST https://pmsplanner.com/api/auth/login -d '{"username":"admin","password":"Spb812"}'- If 200 with
Set-Cookie, the deploy script's curl is wrong; check/tmp/pms-cookies.txtwritability - If 401, the
users.password_hashwas clobbered — restore from R2 backup - If 5xx, Worker error — check the Worker log
AI provider single-failure cascading to all-failed swarm
Symptom. /api/ai-status shows everything red even though only one provider had a real outage.
Cause. Some chains have hard fan-out without per-provider isolation; one external 5xx propagates.
Fix. Per-provider AbortSignal.timeout in src/gemini.js + Promise.race wrapper in src/ai-router.js (v2.30.0). Workers AI binding can't take an AbortSignal so we race .run() against setTimeout reject. All timeout errors emit timeout: <Nms> so ai_calls rows distinguish timeouts from external 5xx.
Budget guard tripped — 429 storm
Symptom. Users see "AI temporarily unavailable" toasts; ai-status thread shows budget cap reached.
Fix. Wait until UTC midnight (the cap resets). Or temporarily raise the cap — but expect a real bill. The sentinel-deduped ai-status thread prevents spam at every overrun.
RAG eval recall@5 ≥ 5pp regression
Symptom. Daily 02:00 UTC cron sends an ai-status thread + Postmark email.
Triage.
- What changed yesterday? Diff against the prior day's eval set output.
- Is
mandatory_classbackfill complete?POST /api/admin/backfill/rag-chunks-mandatory-class { dryRun: true }— iftotalUpdated > 0, run for real. - Did a UCS cascade run? If yes, the rag-cascade pass (F-5 milestone) hasn't shipped yet — old codes resolve via
code_historybut rag-chunks still hold old codes.
Restoring from backup
R2 backups at backups/pms-db-YYYY-MM-DD.sql (since 2026-04-23). Retention 30 days.
# list
wrangler r2 object list pms --prefix backups/
# download
wrangler r2 object get pms backups/pms-db-2026-04-27.sql > snapshot.sql
# inspect
head -100 snapshot.sql
# restore (this is destructive — confirm first)
wrangler d1 execute pms-db --remote --file=snapshot.sqlRestore is destructive
A restore overwrites the live DB with the snapshot. Schedule a maintenance window, post ai-status notice, and verify the snapshot is from the right date (the file name is the export-date, not the data-date).
Escalation
There is no second engineer. The owner is the sole dev. If a problem cannot be resolved, the path is:
- Pause writes — flip the
MAINTENANCE_MODEflag (App.tsx maintenance banner) - Take a fresh D1 export to R2
- Restore from the most recent clean R2 snapshot
- Open a fresh deploy session with full handoff context