With increasing demands for mission-critical system reliability, prevention of data centre and edge computing downtime is far better than a reactive cure. Anticipating that an outage or power event could happen at any time, and taking steps to minimise the likelihood of disruption, is of course preferable to dealing with any potential problems after they have arisen, but what steps can operators take in the quest to mitigate downtime?
The Uptime Institute’s 10th annual data centre survey found that outages are occurring with disturbing frequency and are becoming both more damaging and expensive. One third of survey participants, for example, admitted to experiencing a major outage in the last 12 months, and one in six claimed it had cost them more than $1m (+£731K).
What’s interesting is that 75% of survey respondents stated that downtime could have been prevented with better management, training and improved processes. Consequently, a well thought out and meticulously followed proactive maintenance, or digital servicing, program is essential to safeguard against the risk of IT or network malfunction.
Whether downtime is caused through neglect, failure, human error, or through wear and tear of the IT, power or cooling equipment, one can never completely remove the ability to mitigate major events. However, with a smart and diligent approach to planning, we can at least reduce its impact on business if and when such an event occurs.
Traditional servicing models
Today, preventative maintenance typically means regular inspection and testing of mission-critical systems, including timely replacement of consumables such as batteries, which are crucial to ensure the reliability of uninterruptible power supplies (UPS). Other critical replacements can include capacitors, filters and humidifier cylinders, updates or replacement of intelligent software and firmware, alongside a detailed and up-to-date record of the status of any such repairs with key stakeholders.
In the past, much of the physical maintenance was calendar based, with manual checks being made at regular, specified intervals. As always, the trade-off between cost and reliability is often determined by when such checks were last made, or the criticality of the systems. This, for example, might take place quarterly, half yearly or annually. In many cases, the more frequent the support, the more reliable the operation but the greater the cost, which could often be a key deterrent to decision makers. If budgets were tight, then maintenance was less frequent, which undoubtedly creates risk for business and mission-critical assets.
Evolving your service plan
More recently, one of the major issues has been the adoption of edge computing, where with a greater number of systems deployed over a larger geographical area, it is no longer practical or cost effective to have service personnel on site. Therefore end-users have begun to look to technology as an enabler to overcome this issue.
During the last decade, software systems have had to quickly evolve, especially to meet the accelerated demands for digital transformation experienced during the last year. Now, with the growing availability and capabilities of intelligent management software, and the proliferation of distributed IT, remote management of multiple sites from a single software application is becoming more common.
Not only does such software provide a cost-effective means of monitoring the status of distributed assets in real time, it helps to move the practice of preventative maintenance from a traditional ‘calendar-based’ model, to a conditions-based model.
Further, armed with immediate status information, maintenance visits for data centres and critical IT can be kept to minimum, ensuring costs remain manageable. Instead, via proactive alarms and data-driven algorithms, customers can check the health and status of their equipment at any moment, ensuring any issues are proactively rectified.
Not only does this drive operational efficiency and uptime, it reduces cost, especially where time was incurred for maintenance of systems that were in an acceptable working order. Due diligence, however, remains absolutely critical, but utilising IoT-enabled technology to streamline costs, operations and maintenance removes much of the headache from the traditional approach, and allows digital servicing to be performed at the time when it is most beneficial.
Digital twins and condition-based maintenance
Condition-based maintenance is considered the next evolution in digital servicing and utilises artificial intelligence (AI) and big data analytics to continuously provide an estimate of how well an individual component or a specific system is performing. This data-driven approach provides end-users and external service partners, such as trusted vendors and managed service providers (MSPs), with up-to-the-minute insight and schedules intervention only when necessary. Data is crucial and the more information gathered by such remote monitoring systems, the more accurate the algorithms can be in calculating performance, and telling whether or not it is in need of repair or replacement.
The growing adoption of intelligent systems alongside big data analytics, AI and machine learning is producing more sophisticated calculation of critical components such as battery life expectancy in UPS. One such example is the evolution of digital twin technology, in which the provision of a digital replica of a physical asset, deployed in a data centre or edge environment, can be modelled and analysed.
Monitoring the difference in performance between the actual and digital systems can alert an operator if the performance of the real-world system has degraded below expectations, and should a component fail or be in need of maintenance or replacement, critical service teams can be dispatched to quickly remedy the situation.
Frankly, the condition-based approach offers a more cost-efficient solution to maintaining a systems health without impairing reliability. However, the next evolutionary state in preventive methodologies is risk-informed maintenance, which builds on the analytical capabilities of modern approaches via next-generation data centre infrastructure management (DCIM) software. This enables maintenance to be planned based on an assessment of risks, effects of failure and calculated costs.
Such a strategy attempts to balance the Probability of Failure (PoF) and Consequences of Failure (CoF) of each asset, yet the promise of risk-informed maintenance is that interventions across a number of sites or installations can be prioritised with limited maintenance resources. However, to ensure reliability, such an approach will depend on an operators ability to accurately calculate the probability of failure, meaning software is absolutely crucial to the process.
As of today, a fully risk-informed maintenance program is probably beyond the means of most mission-critical facility operators. Nonetheless, the proliferation of data being captured by today’s systems and the continuous improvement in data analytics and AI are leading the industry toward a situation where risk-informed maintenance will, in future, become the norm – improving operational efficiency as well as the reliability of mission-critical infrastructure.
However, what’s crucial to remember is that expertise is essential, and no matter how technology is used to reduce downtime and mitigate risk, none will ever remove the need for technically competent maintenance staff to effect any changes and upgrades. Especially when, despite all precautions, a critical malfunction occurs.
As technologies continue to evolve the human element becomes ever more critical, and when transitioning from preventative to remedial modes, one will always depend on its ability to collaborate with expert technical personnel. That much remains unconditional.
by Wendy Torell, Senior Research Analyst, Schneider Electric Data Centre Science Centre