Nonstop IT and the myth of zero-downtime

Barry Morris

Downtime cost estimates generally produce arresting numbers, but for many CIOs they don’t really tell the story. Your downtime costs could be measured in hundreds of dollars per hour or millions of dollars per hour, and your downtime risks could be measured in billions of dollars. An automated trading system that is down during a market correction event could quickly represent a multi-billion dollar loss. But in truth downtime is no longer about the costs and risks of crisis. It has become the new norm: “All systems up, all the time.” Welcome to nonstop IT.

So we have become quite excited about redundant microservices that neatly avoid overload by “just adding nodes,” and that enable graceful node failure by seamlessly redirecting load to live nodes. Elastic microservices address both capacity adjustment and transparent failover at the application layer. Nice.

Nonstop IT is also how BlueGreen has gone from a classic Crayola color to a model for continuous deployment of applications, the Julia Child approach of one-I-prepared-earlier. Switching over between full application stacks at the web service level is an elegantly blunt solution for live application evolution.

So for all greenfield applications that are stateless, the recipe for nonstop IT is in place. For everything else there is, of course, a much bigger problem. The world of databases, specifically, is considerably more challenging.

“Best practices” for zero-downtime with databases are “best” not because they are good ideas, but because of a paucity of alternatives. Planned database downtime and unplanned database downtime are the hardest challenges as we strive for the plenty-of-nines service-level agreement (SLA).

High availability (HA) solutions for traditional relational database management systems (RDBMS) are complex and fragile. The RDBMS administrator is faced with a spectrum of alternatives, with a set of trade-offs for each. A typical answer to live upgrades is to use Oracle Data Guard to create an active standby and bravely leap from old to new. In the best case, these help to reduce downtime but are not architected as zero-downtime solutions. And naturally, the people and tools costs can be quite significant.

The dream is a database system with:

  • No single point of failure (SPOF)
  • Dynamic provisioning
  • Push-button backup and restore
  • Online maintenance (upgrades, schema changes, storage management, etc.)
  • Easy automation of administrative tasks
  • A Jobs-esque insane simplicity of usage

You might call that a “cloud-native database.”

If the nonstop IT imperative leads away from traditional single-server databases and toward cloud-native databases, then what realistic alternative directions are emerging for an organization building and migrating applications for elastic infrastructures? The answers really fall into three categories:

  1. Database as a service (DBaaS)
  2. NoSQL solutions
  3. Elastic SQL solutions

1  2  Next Page