More service providers are moving cloud-native architectures from the lab to production networks. In this new frontier, they are encountering performance issues like network function failures that bring services crashing down. These failures don’t always have straightforward or well-known resolutions. And because they’re not identified in pre-production testing, they impact the customer experience in production network scenarios.
In recent discussions with customers, Spirent has been asked how to identify and provide insights for proactively fixing these issues before they penetrate the production network. The answer lies in a new test paradigm for the pre-production lab:.
What’s so different about cloud-native? In a word, dependency
Software-based, cloud-native environments are a dramatic change from traditional monolithic networks.
Microservices comprise multiple cloud-native network functions (CNFs) that use many pods to provide specific functions. The pods are deployed across numerous nodes running on clouds that dynamically handle different workloads. As a result, a service may depend on hundreds of connections and thousands of transactions, any one of which can take too long to respond or simply fail. This high degree of dependency raises the probability of failures and makes problem identification and resolution complex and time-consuming.
Workloads have also become dynamic, with each networking vendor providing frequent updates on individual timetables.
As service providers deploy cloud-native network functions, this new level of complexity and inter-dependency is directly impacting the reliability of production networks.
The industry is just starting to realize it must rethink test strategies to identify and resolve CNF issues during pre-production, not in the production network where the stakes are much higher with costly outages and service interruptions impacting customer Quality of Experience. After all, only when cloud-native networks are comprehensively tested in pre-production will they be able to scale.
Rethinking testing in a CNF world
Cloud-native characteristics make pre-production CNF testing essential, but the old way of testing single-vendor, integrated networks is not up to the challenge because:
It’s harder to simulate reality. The cloud used in the lab to test CNFs is more stable, well-understood, and well-behaved than the cloud or clouds the CNFs will utilize in the production network.
Performance is more vulnerable. The hundreds of pods and nodes a production CNF might utilize must communicate within millisecond latencies to avoid timeouts. One delayed link in the chain can cause failures that cascade rapidly and ultimately result in 5G service failures.
The unexpected will most likely occur. When SLAs depend on disaggregated, distributed microservices interacting in just the right way at just the right time, there is a high probability that one or more links between CNF pods will break or time out.
Let’s dive deeper into CNF testing.
Probability metrics underly pre-production CNF resiliency testing
The dynamic nature of cloud-native means a given user activity may not lead to the same performance result or failure every time since it may take different paths on different infrastructures. Therefore, testing must focus on failure probabilities, causes, and the impact of failure on each scenario. Comprehensive resiliency testing must be performed under real-world scenarios with intentional fault insertions, not just ideal conditions.
CNFs implicitly expect certain packet loss, latency, and CPU and storage response times—and the cloud normally provides them. But because the expectations are implicit, the cloud does not take actions to ensure the specific CNF’s needs are met. Service providers are already seeing on average up to a dozen CNF production failures per quarter that they didn’t predict during pre-production testing.
Until now, there hasn’t been a clear understanding of the points where performance degrades enough to impact service. Those statistical breaking points need to be identified for each CNF, for packet loss, latency, CPU, and storage.
So, for example, if you measure cloud fabric packet loss at a particular location as a function of 5G active sessions you will see the point at which performance degrades (where the blue line in the figure starts to fall). Depending on the CNF, this performance drop may or may not be tolerated.
Pre-production measurements such as these determine Key Failure Indicators (KFIs) for each CNF and provide important performance insights in the lab.
This CNF resiliency testing data is incredibly powerful and something that can’t be done with today’s lab test methods.
CNF resiliency testing provides value in production networks, too
Pre-production measurements provide important performance insights in the lab—and also for the production network. By providing operations teams with the root cause probabilities of outages and their impact on 5G services, these measurements become essential factors for production network monitoring and troubleshooting. They enable resolution prioritization based on subscriber impact, as well as rapid troubleshooting and remediation.
As an example, the table below illustrates the key failure indicator metrics for cloud infrastructure packet loss compared to no packet loss, for registration, connect time, and http traffic network functions. The metrics circled in red show where cloud packet loss degrades significantly. Such data enable efficient monitoring and faster failure resolution in the production network.
The data also enable rapid root cause analysis when issues arise in production, helping operations to quickly identify and focus on the problematic area instead of doing painful and time-consuming troubleshooting on a conference call with 25 people from operations and various vendor teams.
By measuring performance for each product release, operators can quickly identify the specific release that’s degrading performance and provide the relevant vendor with precise data to facilitate rapid resolution.
The benefits of CNF resiliency testing
Resiliency testing of cloud-native environments may be more complex than traditional lab testing, but it is worth the journey. By understanding exactly what each CNF needs from the cloud and how each CNF is vulnerable, many problems can be avoided before they become an issue in production. As a result, new high-quality 5G services can be moved quickly into production and remain reliable even in challenging cloud conditions. Production network issues will be reduced and when they do happen, they can be resolved quickly.
CNF resiliency testing makes business sense as well, by harnessing cloud-native efficiencies and having the agility to reduce operating costs. Infrastructure investments can be targeted on components that will drive the biggest improvements to server performance and efficiency. And more stringent and lucrative SLAs can be offered and delivered.
At Spirent, we help communications service providers understand the impact of cloud-native on pre-production testing and introduce CNF resiliency testing. We’ve deployed our own cloud-native 5G core based on open source and have demonstrated the value of CNF resiliency testing.
Learn more about CNF resiliency testing in our articlein Everything RF or delve into the details in our .