With increasing demands for
mission-critical system reliability, prevention of data centre and edge
computing downtime is far better than a reactive cure. Anticipating that an
outage or power event could happen at any time, and taking steps to minimise
the likelihood of disruption, is of course preferable to dealing with any
potential problems after they have arisen, but what steps can operators take in
the quest to mitigate downtime?
The Uptime Institute’s 10th annual data centre survey found that outages are occurring
with disturbing frequency and are becoming both more damaging and expensive.
One third of survey participants, for example, admitted to experiencing a major
outage in the last 12 months, and one in six claimed it had cost them more than
What’s interesting is
that 75% of survey respondents stated that downtime could have been prevented
with better management, training and improved processes. Consequently, a well thought out and meticulously followed proactive
maintenance, or digital servicing, program is essential to safeguard against the
risk of IT or network malfunction.
Whether downtime is caused through neglect,
failure, human error, or through wear and tear of the IT, power or cooling
equipment, one can never completely remove the ability to mitigate major events.
However, with a smart and diligent approach to planning, we can at least reduce
its impact on business if and when such an event occurs.
Today, preventative maintenance typically
means regular inspection and testing of mission-critical systems, including timely
replacement of consumables such as batteries, which are crucial to ensure the
reliability of uninterruptible power supplies (UPS). Other critical
replacements can include capacitors, filters and humidifier cylinders, updates or
replacement of intelligent software and firmware, alongside a detailed and
up-to-date record of the status of any such repairs with key stakeholders.
In the past, much of the physical
maintenance was calendar based, with manual checks being made at regular,
specified intervals. As always, the trade-off between cost and reliability is often
determined by when such checks were last made, or the criticality of the
systems. This, for example, might take place quarterly, half yearly or
annually. In many cases, the more frequent the support, the more reliable the
operation but the greater the cost, which could often be a key deterrent to
decision makers. If budgets were tight, then maintenance was less frequent,
which undoubtedly creates risk for business and mission-critical assets.
your service plan
More recently, one of the major issues has
been the adoption of edge computing, where with a greater number of systems
deployed over a larger geographical area, it is no longer practical or cost
effective to have service personnel on site. Therefore end-users have begun to
look to technology as an enabler to overcome this issue.
During the last decade, software systems have had to quickly evolve, especially to meet the accelerated demands for digital transformation experienced during the last year. Now, with the growing availability and capabilities of intelligent management software, and the proliferation of distributed IT, remote management of multiple sites from a single software application is becoming more common.
Not only does such software provide a
cost-effective means of monitoring the status of distributed assets in real time, it helps to move the practice of preventative
maintenance from a traditional ‘calendar-based’ model, to a conditions-based model.
Further, armed with immediate status information, maintenance visits for data centres and critical IT can be kept to minimum, ensuring costs remain manageable. Instead, via proactive alarms and data-driven algorithms, customers can check the health and status of their equipment at any moment, ensuring any issues are proactively rectified.
Not only does this drive operational efficiency and uptime, it reduces cost, especially where time was incurred for maintenance of systems that were in an acceptable working order. Due diligence, however, remains absolutely critical, but utilising IoT-enabled technology to streamline costs, operations and maintenance removes much of the headache from the traditional approach, and allows digital servicing to be performed at the time when it is most beneficial.
Digital twins and condition-based maintenance
Condition-based maintenance is considered
the next evolution in digital servicing and utilises artificial intelligence
(AI) and big data analytics to continuously provide an estimate of how well an
individual component or a specific system is performing. This data-driven
approach provides end-users and external service partners, such as trusted vendors
and managed service providers (MSPs), with up-to-the-minute insight and schedules
intervention only when necessary. Data is crucial and the more information gathered
by such remote monitoring systems, the more accurate the algorithms can be in
calculating performance, and telling whether or not it is in need of repair or
The growing adoption of intelligent systems
alongside big data analytics, AI and machine learning is producing more
sophisticated calculation of critical components such as battery life
expectancy in UPS. One such example is the evolution of digital twin
technology, in which the provision of a digital replica of a physical asset,
deployed in a data centre or edge environment, can be modelled and analysed.
Monitoring the difference in performance
between the actual and digital systems can alert an operator if the performance
of the real-world system has degraded below expectations, and should a
component fail or be in need of maintenance or replacement, critical service
teams can be dispatched to quickly remedy the situation.
Frankly, the condition-based approach offers
a more cost-efficient solution to maintaining a systems health without impairing
reliability. However, the next evolutionary state in preventive methodologies
is risk-informed maintenance, which builds on the analytical capabilities of
modern approaches via next-generation data centre infrastructure management
(DCIM) software. This enables maintenance to be planned based on an assessment
of risks, effects of failure and calculated costs.
Such a strategy attempts to balance the
Probability of Failure (PoF) and Consequences of Failure (CoF) of each asset,
yet the promise of risk-informed maintenance is that interventions across a
number of sites or installations can be prioritised with limited maintenance
resources. However, to ensure reliability, such an approach will depend on an
operators ability to accurately calculate the probability of failure, meaning
software is absolutely crucial to the process.
As of today, a fully risk-informed
maintenance program is probably beyond the means of most mission-critical
facility operators. Nonetheless, the proliferation of data being captured by
today’s systems and the continuous improvement in data analytics and AI are leading
the industry toward a situation where risk-informed maintenance will, in
future, become the norm – improving operational efficiency as well as the
reliability of mission-critical infrastructure.
However, what’s crucial to remember is that
expertise is essential, and no matter how technology is used to reduce downtime
and mitigate risk, none will ever remove the need for technically competent
maintenance staff to effect any changes and upgrades. Especially when, despite
all precautions, a critical malfunction occurs.
As technologies continue to evolve the
human element becomes ever more critical, and when transitioning from preventative
to remedial modes, one will always depend on its ability to collaborate with expert
technical personnel. That much remains unconditional.
by Wendy Torell, Senior Research Analyst, Schneider Electric Data Centre Science Centre