If having to manage virtual machines gets cumbersome at scale, it
probably won’t come as a surprise to you that it was a problem Google hit pretty early on
—nearly ten years ago, in fact. If you’ve ever
had to manage more than a few dozen VMs, this will be familiar to
you. Now imagine the problems when managing and coordinating
millions of VMs
At that scale, you start to re-think the problem entirely, and that’s
exactly what happened. If your plan for scale was to have a staggeringly
large fleet of identical things that could be interchanged at a
moment’s notice, then did it really matter if any one of them failed?
Just mark it as bad, clean it up, and replace it.
Using that lens, the challenge shifts from configuration management
to orchestration, scheduling, and isolation. A failure of one computing
unit cannot take down another (isolation), resources should be
reasonably well balanced geographically to distribute load (orchestration),
and you need to detect and replace failures near instantaneously