The Network Also Needs to be Observable: Part 1 in a Series on Network Observability
The network is the keyThe move to observabilityThe three keys to network observabilityTelemetry requirements to support network observabilityData platform requirements for network observabilityActions that network observability platforms must assist withTraditional network monitoring leaves question gapsDevOps observability is great, but can’t answer many network questionsConclusion
Summary
The goal of network observability is to answer any question about your network infrastructure and to have support from your observability stack to get those answers quickly, flexibly, proactively, and interactively. In this post, Kentik CEO Avi Freedman gives his thoughts on the past, present, and future of network observability.
The network is the key
The 21st century has made it abundantly clear that networking infrastructure is critical to connect people, applications, and the economy and distributed workforce that make the world go.
At the same time, networks and IT infrastructure overall are becoming more diverse, dynamic, and interdependent. The internet is now the critical glue that connects traditional and cloud infrastructure. And the distributed workforce and online-focused lives we’re living have driven growing adoption of SASE, CDNs, and other methods of delivering service to the edge.
The move to observability
The last five years have seen a major move to observability from the systems and application side. There are a number of definitions, but observability in the DevOps world has been about using diverse telemetry to know the internal states of systems over time (generally focused around metrics, logs, and traces), and providing answers to the unbounded questions needed to run modern applications.
Since Kentik’s launch, we’ve been at ground zero for a parallel and exciting move towards network observability, and are excited to continue to partner to move the industry forward.
Observability for the network looks at different telemetry and with a networking spin, but is based on the same principles — answering the questions you need to run the network infrastructure that drives the digital world.
How do we define network observability?
The goal is to answer any question of your network infrastructure — quickly and easily:
- Across any kind of network (cloud or on-prem)
- Across any kind of network element (physical, virtual, or cloud service)
- Whether isolated at the network level or with application and business context
… and to have support from your observability stack to get those answers quickly and flexibly, and both proactively and interactively.
The goal is to free up ops team time to architect, build, and develop for increased orchestration, automation, uptime, and performance!
The three keys to network observability
Our most successful customers to learn from their observability journey have invested in three key areas: telemetry, data platform, and action.
I’ll talk more about each of these areas this month in subsequent blog posts.
Telemetry requirements to support network observability
In order to see and reason about the network, it’s critical to gather telemetry:
- From all networks (cloud, data center, WAN, SD-WAN, internet, mobile, branch, and edge)
- Of all telemetry types, including flow/traffic, device metrics, performance tests, configuration, routing, and provisioning/orchestration
- From all types of network elements, physical and virtual, forwarding and appliances, and dedicated or cloud-native
Without a complete picture of the state and activity of all your networks, you’re missing key capabilities to ask the questions and take the actions needed to ensure great traffic delivery.
Data platform requirements for network observability
To take telemetry and support asking questions, knowing about issues, and driving the actions needed to run infrastructure, there are common patterns and requirements for underlying data platforms:
- Sending telemetry live to the system, usually via a streaming message bus
- Enriching network telemetry with context such as user, application, customer, threat, and physical location, live at ingest time to match the real-time orchestration that continually changes these context streams
- Supporting network storage and query primitives such path and prefix, underlay and overlay, and joining with routing and other types of information not found outside of the networking world
- High resolution storage and querying to support asking questions that were not planned in advance — requiring preserving the high cardinality (number of unique values) found in network data, such as IP addressing and port information
- Supporting open integrations across the ingest layer (telemetry, context, and provisioning APIs); query layer APIs; and outbound interactive and streaming push APIs to send unified telemetry, and insights and action triggers to other observability and action platforms
- Real-time learning, typically including feature extraction, baselining, and algorithmic and more advanced ML techniques, to surface insights before users know what questions to ask or ask them
Depending on the scale of the architecture, planning, engineering, and operations teams, it can also be important that the underlying data platforms are:
- Fast, providing answers to old and new types of questions at the speed of thought
- Multi-tenant, with appropriate security and integrity, and maintaining speed while many users and automated endpoints are asking questions
Actions that network observability platforms must assist with
For network observability, the goal of asking questions is to understand and take action. As we look across the networks we work with, they say they are looking to be able to:
- Answer questions in guided interactive ways (using and filtering/zooming in on maps, dashboards, and other defined views)
- Answer questions in unbounded ways, not pre-defined — and to zoom in to any granularity as needed
- Have insights (questions and their answers) proactively surfaced and presented to the users, ideally with suggested action
- Use workflows that automate human drudgery and increase efficiency of routing tasks like traffic engineering, bill auditing, cost reduction, and performance optimization
- Integrating with chatops and workflow tools like Slack, Teams, PagerDuty, and ServiceNow
- Flexibly integrate with orchestration and automation platforms to drive automatic remediation and scaling
Traditional network monitoring leaves question gaps
How is network observability different from the hundreds of existing network monitoring management tools and platforms that have been around for many years?
Historic tools have been standalone, closed systems, generally on-prem and one or few-node, without modern open data architectures.
With limited enrichment, granularity, and retention they’ve also generally focused on the kind of rollups and pre-defined queries that have driven the move towards observability. Often vendor-specific, they generally don’t understand cloud or orchestration at all, or at most view them as separate kinds of networks.
These systems have also been geared at a deep network expert, and as infrastructure layers converge, and ops teams need infrastructure and application visibility, these older more closed and limited systems have not found a place in greenfield observability and monitoring stacks.
DevOps observability is great, but can’t answer many network questions
DevOps observability platforms have been a driver over the last few years at unifying a wide set of telemetry — traditional APM instrumentation with traditional logging, as well as metrics and the more recent waves of innovation in distributed tracing. Many of the platforms (though not all) also can deal in part or whole with the kind of cardinality seen in network data.
But viewed from a “can I ask these questions about the network?” lens there are still some gaps in how easily the leading DevOps platforms take network telemetry.
And more critically, gaps in understanding of network primitives like prefix, path, underlay, and overlay, and gaps in the kinds of workflows that network professionals engage in to plan, build, operate, debug, scale, and automate their infrastructures.
This all makes sense — even network observability platforms like Kentik that take application-layer data as telemetry don’t have the kind of workflows that developers and app operations teams need to ask questions requiring deep application context.
My view is — better together!
At Kentik, we’re super excited about helping bridge the DevOps/NetOps gap.
Watch this blog over the next month for a series of announcements about how we’ll be feeding unified, enriched network telemetry to a wide range of observability platforms, and some exciting work to drive network-focused views in leading DevOps and App Observability platforms — and the reverse, in Kentik.
Conclusion
Networkers need the same observability principles, tooling, and platforms that those up the stack have been building towards, but with a network-savvy bent.
The legacy network tools aren’t architected for modern infrastructure and the more modern DevOps-focused platforms still lack network savvy, especially around what happens when packets leave eth0.
Network teams practicing observability in architecture and action are already driving better performance, reliability, security, remediation, and growth. As a passionate network, data, and ops nerd, I’m beyond excited about what these emerging practices mean for the industry over the next decade and beyond.
It’s possible to get there, whether building yourself, working with a vendor, or both. At Kentik, we’re here as a resource wherever you are in your observability journey.
Read the next post in this series.