When Reliability Goes Wrong in Cloud Networks
Summary
Cloud network reliability has become a catch-all for four related concerns: availability, resiliency, durability, and security. In this post, we’ll discuss why NetOps plays an integral role in delivering on the promise of reliability.
In the first part of this series, I introduced network reliability as a concept foundational to success for IT and business operations. Reliability has become a catch-all for four related concerns: availability, resiliency, durability, and security. I also pointed out that because of necessary factors like redundancy, the pursuit of reliability will inevitably mean making compromises that affect a network’s cost or performance.
As a cloud solutions architect with Kentik, I have the opportunity to work with some of the planet’s most cutting-edge, massively scaled networks. This has given me a front-row seat to innovative design and implementation strategies and, most importantly, hopeful solutions with unintended consequences.
In this article, I want to underscore why NetOps has an integral role (and more responsibility) in delivering on the promise of reliability and highlight a few examples of how engineering for reliability can make networks less reliable.
Reliability is a massive burden for network operators
While DevOps teams and their SREs focus on the reliability of the application environment, an enterprise’s network concerns often extend well beyond the uptime of a suite of customer-facing web and mobile apps. For many enterprises, applications represent only a portion of a much larger reliability mandate, including offices, robotics, hardware, and IoT, and the complex networking, data, and observability infrastructure required to facilitate such a mandate.
For NetOps, this mandate includes a wide range of tasks, including monitoring and identifying top talkers, careful capacity planning, resource availability and consumption, path analysis, security and infrastructure monitoring and management, and more. As I mentioned, these responsibilities fall under one of the four closely related reliability verticals: availability, resiliency, durability, and security.
I want to look at each of these individually and examine some of the pitfall scenarios I’ve seen as teams attempt to bolster the reliability of their networks.
Availability
The core mission of availability is uptime, and the brute force way to see this done is via redundancy, typically in the form of horizontal scaling to ensure that if a network component is compromised, there is another instance ready to continue.
Under this model, network topology is highly variable, creating a complexity that can mask root causes and make proactive availability configurations a highly brittle point of the network. A single misconfiguration, such as an incorrect firewall rule or a misrouted connection, can trigger a cascade of failures. For instance, an erroneous firewall configuration might be unprepared for a redundant router or application instance, blocking traffic critical to maintaining uptime.
Resiliency
One resiliency strategy for NetOps teams working with the cloud to consider is multizonal deployments. In this case, region-wide disruptions of the internet or your cloud provider affect only a portion of traffic and provide safe destinations to re-route this affected traffic. Status pages are a great way to communicate with customers or users about outages or shifts in deployment regions (you do have a status page, don’t you?).
Here are a few examples of potential unintended side effects of relying on multizonal infrastructure for resiliency:
Split-brain scenario: In a multizonal deployment with redundant components, such as load balancers or routers, a split-brain scenario can occur. This happens when communication between the zones is disrupted, leading to the independent operation of each zone. In this situation, traffic may be routed to both zones simultaneously, causing inconsistencies in data processing and potentially leading to data corruption or other issues.
Failover loops: When implementing failover mechanisms across multiple zones, there is a risk of creating a failover loop. This occurs when numerous zones repeatedly detect failures in each other and trigger failover actions back and forth. As a result, traffic continuously switches between zones, causing unnecessary network congestion and affecting the overall performance and stability of the system.
Out-of-sync state: Maintaining a consistent state across all zones can be challenging in a network with multizonal deployments. In some cases, due to network latency or synchronization delays, different zones may have slightly different versions of the data or application state. This can lead to unexpected traffic patterns as zones exchange data or attempt to reconcile inconsistent states, potentially causing increased network traffic and delays.
Durability
In the context of cloud networks, durability refers to the ability of the network to retain and protect data over an extended period of time, even in the face of hardware failures, software bugs, or other types of disruptions like attacks on the network. While much of the data-specific measures like replication, versioning, or use of distributed file systems like Amazon S3 fall under the purview of data engineers, it is up to NetOps to monitor and manage this infrastructure’s connections to the network.
This is no small feat and can lead to significant overhead and resource consumption. While cloud providers invest heavily in being able to ensure their data services are highly durable (S3 guarantees eleven 9s, or 99.999999999%, data accuracy over a given year), this represents only one portion of the durability story. As this data moves to and from highly durable storage services, NetOps must guarantee that the data remains secure and accurate. Replication, analysis, and data transfer all present opportunities for security threats, data integrity loss, and intense bandwidth and memory consumption.
In distributed, service-oriented development environments, it is not uncommon for these efforts to happen primarily outside of the attention of NetOps until a problem arises. While these teams and development efforts may be distributed, the underlying network infrastructure is often shared: so what? The intense resource demands of durability efforts can create contention, which triggers latency and cascading failures. The network underneath becomes overconsumed due to contention, latency, and retries of all the cascading failures.
I wrote an article a while ago addressing latency. This contention becomes a de facto result of most latency, which can take out many services and networking devices.
Security
There are several reliability strategies that if not properly and carefully accounted for, can increase a network’s threat surface area. Here are two examples:
-
Redundancy and high availability: While redundant components and geographic distribution of resources enhance reliability, multiple network entry points and load balancing mechanisms can be exploited to launch DDoS attacks, bypass security controls, or otherwise overwhelm resources.
-
Elasticity and scalability: As new instances, containers, VMs, or other network resources are dynamically added to the network, improper configurations and monitoring can create vulnerabilities for attackers to exploit.
Conclusion
Reliability is a cornerstone of delivering top-tier IT services and customer experiences. For NetOps, especially those responsible for the highly scaled networks found in enterprises, service providers, and telecom companies, the pursuit of reliability via availability, resiliency, durability, and security measures can introduce its own challenges. Handling these challenges can come down to rigorous planning, careful monitoring, and dynamic systems, but as any network specialist knows, there are always unknown unknowns.
In this series’ next and final installment, I will examine how network observability offers an opportunity to engage with these unknown unknowns, the most comprehensive and robust path to addressing and avoiding these challenges.