By Peter Krüssel, managing partner - global chapter head communication industries, Detecon
Telecommunications networks are among the critical infrastructures of a society. Resilience and security from failures ensure high stability, especially during times of crisis. Audits and professional safety concepts are some of the key pillars to assure this stability.
There is no end to the reports about serious crises. There was the financial crisis, the euro crisis, the migration crisis, the climate crisis, the COVID-19 crisis, and the globalization crisis, and now we have the war in Ukraine. The vehemence of this uninterrupted series of profound disruptions has given rise to multiple fundamental questions — about how we do business or live our lives, what political actions we take, or how we can maintain our prosperity and competitiveness, about how societies can become more resilient and self-sufficient.
The global COVID-19 crisis strikingly revealed how dependent individual nations and economic sectors are on globally dispersed supply chains, how quickly previously unknown shortage situations can arise and affect entire industries.
All industries are affected by these developments, including and in particular the telecommunications industry and network operators. The COVID-19 crisis provided impressive proof that well-developed network infrastructures enabling digital collaboration formats when working from home are the key prerequisite for ensuring that business and government can continue to function mostly trouble-free even during acute crises.
Telecommunications — the backbone of a modern society
Telecommunications networks are among the critical infrastructures of a society. They serve as the backbone of a modern, digitalized society. Their uninterrupted performance capability at a high level is a prerequisite for the functioning of the economy, the government, and our private lives.
Telecommunications companies are also aware of their social responsibility and have quickly initiated numerous support measures in Germany in response to the flow of refugees from Ukraine (as one example). The list of actions includes direct donations from network operators, job offers for Ukrainian refugees, the provision of an efficient Wi-Fi infrastructure free of charge to refugee camps, offers of free phone calls and text messages to and from Ukraine, free calls to Ukraine from public phone booths, the waiving of international roaming charges, the distribution of free prepaid SIM cards to refugees, the activation of a text-based donation function to aid organizations, the short-term expansion of capacities in mobile networks along the Ukrainian border, or even the provision free of charge of Ukrainian TV channels in IPTV packages.
Network resilience audits as protection from crisis scenarios
Looking beyond the acute situation in Ukraine and the current as well as possible future requirements from government authorities, however, telecommunications providers are called upon to make long-term and systematic preparations for dealing with threat scenarios, to secure the resilience of their infrastructures, to shorten their response times, and to review their supply chains. The possible threat scenarios are many and complex. They range from earthquakes to climatic weather phenomena such as flood disasters, storms, and severe weather (hurricane season in the Caribbean, monsoon-related flooding in Asia) to armed conflicts, criminal cyberattacks, industrial policy activities (trade restrictions, embargoes of certain vendors), supply chain disruptions (as results of other factors and as themselves the causes of issues), accidents of all kinds (e.g., the blockage of the Suez Canal by the “Ever Given” that also affected the supply situation for telcos in March 2021), or acts of sabotage prompted by a wide range of motives (for instance, 5G towers in several countries have been damaged by conspiracy theorists because the new network standard was allegedly responsible for the spread of the COVID-19 virus). In view of the described risks, the assessment of these threats and the catalog of countermeasures must be reviewed and updated regularly.
The focus of these actions is on securing the resilience of telecommunications networks. The goal of network resilience audits is to identify and eliminate any and all weaknesses, above all the so-called “single points of failure”, as these precautions heighten the security of the network from failure. There are two areas that must be considered first and foremost: equipment resilience and network resilience. Network operators bear full responsibility for network resilience while equipment resilience is designed by the manufacturers of active network components for the network operators.
Equipment resilience refers to the inherent technical redundancy concept behind individual network elements procured from grid outfitters. Network elements from reputable manufacturers usually come with internal redundancy features such as redundant power supplies or processors that ensure uninterrupted operation in the event of partial failures. Moreover, many active components today meet standards that secure the operational integrity of the equipment even when subjected to shocks and vibrations that can occur during more powerful earthquakes (for example). Reputable manufacturers already offer largely fail-safe equipment as their market interests alone would give them sufficient incentive to do so. The selection of the more resilient equipment is handled by the network operator, who must weigh the decision criteria (economic aspects, quality, reliability, performance, etc.), some of which are in conflict with one another. Constellations in which the vendor is not only a supplier of components to the network operator, but also assumes responsibility for the design of the networks or acts as a supplier of turnkey projects or even as a managed service provider, represent a special challenge.
Network resilience, in turn, concerns how the procured network elements are connected and interact with one another to ensure that completely redundant end customer services can be provided with a high level of security from failure. The concept typically distinguishes among three areas: network topology, network capacity, and functions to increase security from failure.
The underlying network topology plays a decisive role in the enhancement of protection from network failures, especially in the aggregation network and backbone network areas that connect critical network elements with one another. Every network area must be planned with a degree of redundancy that prevents major service failures from the loss of single network elements or even entire sites. Key elements include redundancy of network elements, geo-redundancy as the duplication of critical network elements at various sites, and connection of the network elements via redundant transmission networks to prevent any “single points of failure” from escalating into extensive network failures. In addition to the advisable redundancies, the selection of the site itself and the quality of the installations (e.g., plug fuses, labeling, grounding of the components, etc.) are of immense importance and a basic prerequisite for security from failure.
Networks with redundant network topology must also provide sufficient reserve capacities that are able to assume the transport and handling (e.g., safety inspections) of the network load if and when individual network elements fail. As a rule, two scenarios are distinguished during capacity planning: sufficient reserve capacity to handle the failure of a network element and sufficient reserve capacity to handle the failure of a complete site (the failure of all of the network elements located at one site (geo-redundancy)).
Even redundant networks with sufficient reserve capacity can suffer significant network failures if no automatic network functions for the prevention of failures are in place. In this sense, automatic network functions to prevent failures are the third focal point of network resilience audits. If, for example, one redundant network element fails, its clone should seamlessly and automatically take over performance of the function of the failed network element. If connections among various network elements fail, the traffic must be diverted automatically. Analogously to the analysis of sufficient reserve capacities, the analysis of automatic network functions also considers two scenarios for preventing failure — automatic network functions for preventing general failure if one network element fails and automatic network functions for preventing general failure if all network elements at one site fail. In the absence of these types of automatic network functions for the prevention of general failure, the network must be reconfigured manually to restore end customer services if individual network elements fail. But manual configurations take time and cause the network failures to last even longer.
The described concepts of network resilience should be integrated into a holistic security concept — business continuity management (BCM) — that identifies and analyzes the potential threats facing a company and the impact of these potential threats on the company and proactively defines countermeasures. Similarly, the focal points for network operators with regard to BCM are on the technical recovery capability and the technical support processes. BCM and emergency management are closely related. In broad terms, BCM can be viewed as the totality of the planning activities conducted by an organization as precautions in advance of incidents aimed at heightening protection from failures caused by any such incidents. Emergency management encompasses the activities that are carried out after the occurrence of an incident, and in this sense, it is an integral component of BCM.
While equipment and network resilience concentrates primarily on the prevention of incidents, attention in the area of technical recovery capability tends to be devoted more to the availability of equipment and functions for restoring end customer services after the occurrence of incidents. The technical recovery capability is the responsibility of both the producers of network elements and of the network operators. Standard components of technical recovery capability include availability of offline standby systems for emergency restoration; availability of spare parts, backup, and recovery systems; and backup network monitoring centers for emergency network monitoring. Technical support processes include features such as backup and recovery procedures and guidelines, spare parts management, and support processes for producers of network elements. The specifics of a concrete crisis must always be taken into account when defining such processes. For instance, a hurricane triggering a crisis may also cause long delays in the delivery of replacement components from other countries because food shipments will have clear priority during customs clearance.
The described concepts for network resilience and business continuity management help to prevent possible loss of revenue, costs, fines, claims for damages from customers and regulators, and loss of reputation. In view of increasing volatilities, uncertainties, and dependencies, professional management of these risks is mandatory.
Finally, it should be noted that there may be crisis situations of massive, disruptive, structural, and lasting significance that can be controlled solely to a limited extent by a single carrier within the scope of the processes it has implemented, both as preventive measures and in acute situations. Examples include environmental disasters that affect an entire country or larger regions and all critical infrastructures such as energy, water, telecommunications, etc. In such cases, concerted actions are required in addition to the various individual emergency plans, and orchestration by government organizations may be required. In the short term, an analysis of the damage would be needed to determine priorities: in what region should and can what services and what target groups be served at all, at what level of quality and based on what technology. In a disaster of this magnitude, focus will be less on the functionality of the network of any specific network operator and more on the communications infrastructure in the country or region as a whole. At this time, concepts that rely on the cooperation of all involved parties (network operators, satellite network operators, equipment manufacturers, energy suppliers, regulators, authorities, etc.) and concepts that explore short-term, innovative technical possibilities (ad hoc networks, Cells on Wheels, satellite networks, increased use of low frequency bandwidths, openness of all networks to all users, NB-IoT or LoRA WAN for simple, long-range communication, etc.) will be in demand.