Optimizing Cloud Networks: The Strategic Approach to Eliminating Suboptimal Routing
Summary
In this post, we look at optimizing cloud network routing to avoid suboptimal paths that increase latency, round-trip times, or costs. To mitigate this, we can adjust routing policies, strategically distributing resources, AWS Direct Connects, and by leveraging observability tools to monitor performance and costs, enabling informed decisions that balance performance with budget.
Cloud networking has become a foundational element of modern IT infrastructure, so the need for efficient, reliable, optimal, and cost-effective routing is more critical than ever. The thing is, words like “efficient” and “optimal” can mean very different things depending on who you talk to.
Routing algorithms usually try to figure out the best path according to some type of cost. But what if what’s important to us differs from what our favorite routing protocol uses to determine the best path?
Suboptimal routing, for example, is often defined as when packets take longer or more expensive paths through the network than necessary. This can have various implications, such as increased latency, higher round-trip times, and inflated transit costs, especially when factoring in services like AWS Transit Gateway. But what’s suboptimal for one organization may be just fine for another.
What does suboptimal mean to you?
Suboptimal routing occurs when network traffic takes a less-than-ideal path through the network, leading to inefficiencies. This can manifest in several ways.
Suboptimal could refer to the number of hops it takes between a source and destination. Many hops implies many devices, each of which needs to process each packet. This could lead to an increase in latency, round-trip time, and so on, all of which can adversely affect application performance.
However, suboptimal could also refer to a path with a very small number of hops but slow links. In this case, though the shortest physical path between source and destination may seem best if the links are all T1s, sending traffic over a longer distance with more hops and high-speed links may be much better.
These are routing choices network engineers make all the time. But what about cost optimization?
In cloud networking, suboptimal routing could refer to a path that causes traffic between a source and destination to cross-cloud regions or availability zones unnecessarily. This may happen inadvertently because it’s the best path in terms of latency, but in terms of cost, it could be very (unnecessarily) expensive.
A balancing act
Sometimes we want to use a different calculation than routing algorithms for how we want to engineer traffic. Instead of path cost in terms of latency, delay, hops, and so on, consider “path price.”
Ultimately, the traffic needs to get where we want it to go, but there are many interesting ways to route traffic in the cloud. For instance, you might want to create a Transit Gateway and attach it to all your VPCs and use that to route all your traffic between them. This would work great, but then you get the AWS bill….
For most organizations, it really is more of a balancing act of finding the sweet spot of acceptable application performance and staying within budget. There are always exceptions, but most organizations can tolerate an extra 10 milliseconds of latency if it saves a boatload of money.
AWS charges a premium for data that crosses regional boundaries or enters/exits through Transit Gateways and across Availability Zones, a common occurrence in container networking, both of which can quickly escalate costs. But maybe those costs are worth it because the end-to-end latency is much lower for a mission-critical application.
We really need to start weighing what’s important. Is saving every possible penny in transit cost the most important thing, or is maintaining only the very best application performance the most important thing, regardless of cost?
Mitigating suboptimal routing
Regardless of how you define suboptimal routing, we need to take a proactive approach to network design and monitoring to address it.
The internet is dynamic, so one important mitigation step is to adjust BGP routing policies and other traffic engineering mechanisms to prioritize performance metrics like latency and throughput where possible. This may involve some tweaking, but creating custom route priorities that align with your goals is possible.
For AWS networking, use Direct Connect and Private Links, especially for critical workloads. This will reduce the number of hops and control the traffic path more directly, reducing cost and improving performance.
Another mitigation is to distribute the most-accessed resources over a larger geographic region so that they’re closer to end users. This would minimize the need for long-haul routing over multiple ASNs, which is usually more costly. In AWS, this can be done through strategically located VPCs or multi-region deployments.
Lastly, and this may go without saying, you need to monitor network performance and cost to know that routing is suboptimal. This is where cloud network observability comes in. Your monitoring platform should allow you to analyze an application’s path alongside performance metrics for each hop and transit cost information for that routing decision. With that historical and real-time information correlated, you can quickly understand why your cloud cost is what it is. Set your team up with the right telemetry, capable tools for analysis, and a strategic understanding of what “suboptimal” design means for an application, initiative, or company, and they can identify opportunities to optimize and make informed decisions on how to solve.
Kentik provides a unified solution for understanding hybrid-cloud, multi-cloud, data-center, and campus networks to effectively troubleshoot, interrogate all network telemetry, harden policy, and optimize software infrastructure.