Key to minimising datacentre downtime

Over the course of the past decade, enterprise business processes have fundamentally changed in many regards. Among these changes, none has been more profound than the increased reliance on IT systems to support business-critical applications.

For many of today’s datacenters, throughput has evolved into a monetised commodity. The datacentre and its availability are no longer supporting just the internal needs of the organisation but have become essential to many companies whose customers pay a premium for continuous access to a variety of applications and resources.

And in today’s enterprise datacentre, proactive support is often considered to be the key to preventing unwanted downtime through the use of configurations, patches, and firmware levels. But while these are critical assets in supporting today’s complex virtual environments, they are only part of the preventive maintenance story. The other side of proactive support includes factors that are so obvious they tend to be overlooked completely.

Indeed, my conversations with datacentre managers have repeatedly shown that a lack of attention to these seemingly obvious issues is the common theme that underpins so many significant outages. So what exactly are these uncomplicated ‘complexities’ that exist in the modern datacentre environment and what steps should be taken to minimise the unwanted downtime they so often cause?

Well, sitting at the forefront of any datacentre manager’s agenda should be the proper design and implementation of all power and cooling processes. A customer interviewed by IDC recently relayed a story about a significant downtime event that could easily have been averted. The customer’s IT department was undertaking server consolidation, which involved running numerous servers in parallel until all replication was complete. However, IT had not communicated its plan to the facilities department, meaning the datacenter did not have enough power to run all these systems at the same time, resulting in a catastrophic power outage.

Clean power is another key component in keeping a datacentre up and running smoothly. Any spikes in electrical current can have a devastating effect if they hit the datacentre, particularly if the power coming in is not properly “cleaned”. Effectively detecting and recovering from power system failures or fluctuations is crucial to organisations of all sizes, and by integrating a comprehensive monitoring solution — including battery and branch circuit monitoring — IT staff should be able to quickly identify, isolate, and address any power equipment issues before they cause any significant damage.

Change-control processes must also be routinely enacted in the modern datacentre environment to ensure all system changes are assessed and approved before their implementation. This can only be accomplished with a formal set of procedures and processes that follow generally accepted guidelines for change and configuration management. Indeed, all work that takes place within the datacenter should have a written procedure, with a library of such processes made readily available for all scheduled maintenance operations, corrective maintenance, and installation activities.

In addition, every action in a mission-critical environment must be documented. And the documentation must provide value by measuring an expected result, creating a foundation for corrective actions, or promoting proactive, continuous improvement. Vendor turnover documentation is a vital component of the operation, but just as important for ensuring smooth systems operations is the availability of accurate network and system topology drawings. Information such as equipment lists, maintenance scopes of work, and maintenance schedules seem simple, but they often turn up missing, inaccurate, or inadequate when needed.

Similarly, when installing any asset in a datacentre, be sure to label and document all aspects of the equipment. Proper installation and cabling of assets is a must; a cable being pulled or accidentally tripped over can easily cause datacenter outages. I have encountered countless stories of engineers causing damage and outages by turning off power to the wrong server, network device, or storage array because of improper documentation and labelling. And once all assets have been properly labelled and installed, IT staff must diligently ensure that the datacenter is kept dust free and clean at all times.

Perhaps most importantly of all, datacentre access should be monitored, documented, and restricted to those with relevant training. Given the variety and complexity of devices in a datacentre, it is crucial to have well-trained staff on hand, and anyone accessing the datacenter should be educated on all policies and procedures for ensuring proper conduct in the environment. Human error is one of the major causes of datacentre outages, and such instances can only be reduced with proper education and proper observance of datacentre procedures.

As the demands placed on organisations intensify, datacentres will inevitably become even more complicated to manage than they are today. As a result, CIOs must draw up a clear shortlist of tasks that should never be compromised in the datacentre environment. This includes incorporating sufficient power and cooling, implementing proper documentation, ensuring scrupulous datacentre maintenance and cleanliness, and utilising only specified, well-trained personnel.

Following these simple but often overlooked datacentre procedures can help reduce unwanted downtime and save the enterprise thousands, if not millions, of dollars each year. And while the tasks involved may seem self-evident, they will only grow in importance as the complexity of the datacentre increases.

The columnist is group vice-president and regional managing director for the Middle East, Africa, and Turkey at global ICT market intelligence and advisory firm International Data Corporation (IDC).