We the reliability of our hardware and software platforms as a given. We don’t often think about how this level reliability is possible. We have never stressed the platforms as much; we use only a fraction the theoretical capabilities of the technologies that we deploy. This is about to change.
As technology such as grid computing and virtualization become more mainstream, the usage of our three main resources- storage, compute cycles, and connectivity-will increase. These rates will soon approach the theoretical limits of their capacity. We will need to have better monitoring and instrumentation tools. This may lead us to face unexpected reliability issues.
Inadvertently, one of my clients stepped into uncharted territory. A combination of hardware, software, and other components that had been performing flawlessly was suddenly unable to meet a threshold. This led to a series of failures, which eventually led to the triggering of an unrecognized loop condition.
One of the core business processes of the company was therefore temporarily suspended for two days. This was not due to some obscure combination of technologies. It was caused by extremely high usage levels. The client questioned its vendors and discovered that the products were never tested for such high usage.
Manufacturers had concluded that such testing was not cost-effective and did not indicate the maximum level at which they could make quality assurance or capability claims. My client was able to reproduce the problem using a different combination technology at the same usage level.
This thresholdof-failure phenomenon has been seen in other high-use technology combinations. One example is a client who runs virtualized servers in an environment that consumes more than 95% of the available hardware.
The company is now experiencing unexpected-failures events, which are difficult to pinpoint and hard to prevent. This might seem like a high usage level. Virtualization software vendors recommend that users use at least 80 percent to maintain stable operations. However, infrastructure managers will be under more financial pressure to increase their usage to 100 percent. These examples are a reminder that technology strategists and CIOs need to ask a fundamental question: How do vendors make reliability claims?
It seems that physical testing (run to destroy) is 50 percent and simulation and modeling are 50 percent based on physical test data. It is fine as long as you keep within the performance envelope. But what does this tell you about life at or near the edge of theoretical performance. The emerging answers are “Not enough” and “not a lot”.
Next time you are thinking about expanding your infrastructure to meet budget constraints, ask your suppliers about the limits of their products and adjust your plans accordingly.