February 3, 2026
A solo founder has to design for recovery
Things will break. The question is not whether the system is perfect. The question is whether you can understand it, fix it, and keep moving without a team of specialists.
Perfection vs recoverability
Perfect architecture is a luxury. Recoverability is a requirement when you are the on-call rotation.
Design assumes failure: deploys go wrong, APIs change, disks fill, credentials expire. The goal is not zero incidents—it is short time-to-understanding and a fix you can ship before the day is gone.
Observability for one person
You need logs you can search, errors that point to a file, and alerts that wake you only when action is required. Fancy tracing is optional; clarity is not.
A one-page runbook beats a diagram nobody opens: how to roll back, where secrets live, what “healthy” looks like.
After something breaks
Capture what happened in plain language. Patch the immediate fire. Add the guardrail that would have made the failure boring next time—often a test, a limit, or a check—not a new platform.
I have kept products alive by favoring systems I can re-derive from docs and code layout, not tribal knowledge.
How to apply it
- Practice rollback before you need it.
- Alert on user-visible failure, not on every blip.
- Time-box incident follow-up: fix, guardrail, document.
- Delete paths you cannot explain to future-you.