Designing a Multi-Tenant LLM Inference Platform, Part 2
Scaling a serving cell when cold starts take minutes: sizing warm spare from forecast error, model-local standby, draining, and failing honestly when the KV cache is gone.
A collection of thoughts, experiences, and life updates.
Scaling a serving cell when cold starts take minutes: sizing warm spare from forecast error, model-local standby, draining, and failing honestly when the KV cache is gone.