eBPF Explained: Why it's Important for Observability
What is eBPF?How does eBPF work?BytecodeUsing Python to write eBPF applicationsHigher level abstractionseBPF Programs: Interacting between user space and kerneleBPF MapsLightweight performance monitoring with eBPFeBPF and application latencyActive vs. passive monitoringActive monitoringPassive monitoringeBPF use casesWhat can you learn with eBPF?Monitoring containers with eBPFKappa: Kentik’s process-aware telemetry agentKappa featuresEnrichment for container monitoringConclusion
Summary
eBPF is a powerful technical framework to see every interaction between an application and the Linux kernel it relies on. eBPF allows us to get granular visibility into network activity, resource utilization, file access, and much more. It has become a primary method for observability of our applications on premises and in the cloud. In this post, we’ll explore in-depth how eBPF works, its use cases, and how we can use it today specifically for container monitoring.
eBPF is a lightweight runtime environment that gives you the ability to run programs inside the kernel of an operating system, usually a recent version of Linux. That’s the short definition. The longer definition will take some time to unpack. In this post, we’ll look at what eBPF is, how it works, and why it’s become such a common technology in observability.
What is eBPF?
eBPF, which stands for Extended Berkeley Packet Filter, is a lightweight virtual machine that can run sandboxed programs in a Linux kernel without modifying the kernel source code or installing any additional modules.
eBPF operates with hooks into the kernel so that whenever one of the hooks triggers, the eBPF program will run. Since the kernel is basically the software layer between the applications you’re running and the underlying hardware, eBPF operates just about as close as you can get to the line-rate activity of a host.
An application runs in what’s called user space, an unprivileged layer of the technology stack that requires the application to request resources via the system call interface to the underlying hardware. Those calls could be for kernel services, network services, accessing the file system, and so on.
When an application runs from the user space, it interacts with the kernel many, many times. eBPF is able to see everything happening at the kernel level, including those requests from the user space, or in other words, by applications. Therefore, by looking at the interactions between the application and the kernel, we can learn almost everything we want to know about application performance, including local network activity.
Note that eBPF can also be used to monitor user space via uprobes, but we focus primarily on kernel activity for network observability.
How does eBPF work?
Bytecode
The BPF virtual machine runs a custom bytecode designed for verifiability, which is to say that you can write directly in bytecode, though writing directly in bytecode is onerous at best. Typically, eBPF programs are written to bytecode using some other language. For example, developers often write programs in C or Rust compiled with clang, which is part of the LLVM toolchain, into usable bytecode.
Bytecode is generated by a compiler, but the actual programs are compiled just-in-time (JIT). This also allows the kernel to validate the code within boundaries before running it. The JIT step is optional and should occur after validation.
eBPF bytecode is a low-level instruction set written as a series of 64-bit instructions executed by the kernel. The eBPF bytecode instructions are expressed as hexadecimal numbers, each consisting of an opcode and zero or more operands.
Here’s an example of what eBPF bytecode might look like:
0x85, 0x00, 0x00, 0x00, 0x02 ; load 2 into register 0
0x18, 0x00, 0x00, 0x00, 0x00 ; load 0 into register 1
0x07, 0x00, 0x00, 0x00, 0x00 ; add registers 0 and 1, store result in register 0
0xbf, 0x00, 0x01, 0x00, 0x00 ; exit syscall with return value from register 0
This example code loads the value 2 into register 0, loads 0 into register 1, adds the values in registers 0 and 1, and then exits the program with the result in register 0. This is a simple example, but eBPF bytecode can perform much more complex operations.
Using Python to write eBPF applications
Additionally, developers often use a python front end to write an eBPF application in user space. This makes writing eBPF programs much easier because of how commonly used python is, but also because of the many existing libraries for developers to take advantage of. However, it’s important to note that these Python programs are very specific to the BCC toolchain, not a general way of writing BPF apps.
To program eBPF with Python, you can use the bpf
module in the bpfcc
library. This library provides a Python interface to the BPF Compiler Collection (BCC), which allows you to write and load eBPF programs from Python.
For example, to write an eBPF program in Python to monitor tcpretransmits
, you can use the BPF
class from the bpfcc
library to define a kprobe
that attaches to the tcp_retransmit_skb
function and captures information about retransmissions.
Here’s an example of what the Python code might look like:
from bcc import BPF
# define the eBPF program
prog = """
#include <uapi/linux/ptrace.h>
BPF_HASH(start, u32);
BPF_PERF_OUTPUT(events);
int kprobe__tcp_retransmit_skb(struct pt_regs *ctx, struct sock *sk, struct sk_buff *skb) {
u32 pid = bpf_get_current_pid_tgid();
// track the starting time of the retransmit attempt
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
int kretprobe__tcp_retransmit_skb(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid();
u64 *tsp = start.lookup(&pid);
// calculate the duration of the retransmit attempt
if (tsp != NULL) {
u64 now = bpf_ktime_get_ns();
u64 delta = now - *tsp;
events.perf_submit(ctx, &delta, sizeof(delta));
start.delete(&pid);
}
return 0;
}
"""
# create and load the eBPF program
bpf = BPF(text=prog)
# attach the eBPF program to the tcp_retransmit_skb function
bpf.attach_kprobe(event="tcp_retransmit_skb", fn_name="kprobe__tcp_retransmit_skb")
bpf.attach_kretprobe(event="tcp_retransmit_skb", fn_name="kretprobe__tcp_retransmit_skb")
# define a function to handle the perf output events
def print_event(cpu, data, size):
# unpack the duration of the retransmit attempt
event = bpf["events"].event(data)
duration = float(event) / 1000000
print("TCP retransmit detected (duration: %0.2f ms)" % duration)
# loop and handle perf output events
bpf["events"].open_perf_buffer(print_event)
while True:
bpf.kprobe_poll()
In this example, we define an eBPF program that creates a hash map to track the starting time of retransmit attempts and a PERF_OUTPUT
event that captures the duration of the retransmit attempt. We then attach the eBPF program to the tcp_retransmit_skb
function using both a kprobe
and a kretprobe
, which allows us to capture both the start and end of the function call.
We define a function to handle the PERF_OUTPUT
events, which unpacks the duration of the retransmit attempt and prints it to the console. Finally, we loop and handle the perf output events using the open_perf_buffer
and kprobe_poll
methods.
This eBPF program will track tcpretransmits
and print out the duration of each retransmit attempt. You can modify the program to capture other information as well, such as the number of retransmit attempts or the source and destination IP addresses of the affected packets.
This method can help us understand the cause of an application performance problem. In general, this is a sign that some packet loss is occurring on the network. This could be due to congestion or even errors on the remote NIC. Usually, when this happens, the network connection seems slow, but things are still working.
Higher level abstractions
For another level of abstraction, open source tools have emerged, such as Cilium, which runs as an agent in container pods or on servers. Often tied with common tools like Grafana and Prometheus, Cilium is a management overlay used to manage container networking using eBPF. However, Cillium is also much more than this. It’s a data plane that leverages eBPF to implement service meshes, observability, and networking functions as well.
Now owned by New Relic, Pixie is another popular open source eBPF management overlay with an attractive graphical user interface. Since these management tools operate at the kernel level via eBPF, they can also be used for observability, especially with containers.
eBPF Programs: Interacting between user space and kernel
Regardless of how you write them, the eBPF programs themselves are loaded from user space into the kernel and attached to a kernel event. This is when we start to see the benefits of eBPF because when the event we attached our program to occurs, our program runs automatically. So after being loaded from user space, an eBPF program will live in the kernel.
Before the program is loaded into the kernel, it’s run through a built-in verification function called a verifier that ensures the eBPF program is safe from both operational and security perspectives. This is important because it’s in this way that we know that our eBPF programs won’t use resources they shouldn’t or create a type of loop scenario. However, it’s important to note that the verifier doesn’t perform any sort of policy checks on what can be intercepted.
After the eBPF program passes the verifier, it’s just-in-time compiled into native instructions and attached to the hooks you want to use for your custom program. On the left side of the graphic below, you can see the eBPF program go from user pace (Process) through the verifier, the JIT compiler, and then on the right, attached to the relevant hook(s).
Hooks can be almost anything running in the kernel, so on the one hand, eBPF programs can be highly customized, but on the other hand, there are also inherent limitations due to the verifier limiting access to the program.
eBPF Maps
Once run, the eBPF program may have gathered information that needs to be sent back to user space for some other application to access. This could be to retrieve configuration to run when a hook is triggered or to store gathered telemetry for another program to retrieve. For this, we can use eBPF maps. eBPF maps are basically generic data structures with key/value pairs and read/write access by the eBPF program, other eBPF programs, and user space code such as another application.
Like eBPF programs, eBPF maps live in the kernel, and they are created and accessed from user space using the BPF syscall and accessed by the kernel via BPF helper functions. There are several types of maps, such as an array, hash, prog array, stack trace, and others, with hash maps and arrays being the most commonly used.
Though eBPF maps are a common method for coordinating with user space, Linux perf events would also likely be used for large volumes of data like telemetry.
Lightweight performance monitoring with eBPF
In the context of eBPF, “lightweight” means several things. First, eBPF is fast and performant. The eBPF program uses very minimal resources. eBPF uses a just-in-time (JIT) compiler, so once the bytecode is compiled, it isn’t necessary to re-interpret the code every time a program is run. Instead, the eBPF program runs as native instructions, which is a faster and more efficient method for running the underlying bytecode.
Second, an eBPF program doesn’t rely on probes or a visibility touchpoint in the network or application, so no traffic is added to the network. This may not be an issue in a very small, low-performance network; however, in a large network that requires many probes and touchpoints to monitor effectively, adding traffic can adversely affect the performance of the network in terms of latency, thereby skewing the monitoring results and possibly impacting application performance.
There is an important distinction between monitoring traffic originating from or terminating at the system running the BPF program, as opposed to network traffic in general.
Of course, using probes and artificially generated traffic isn’t inherently bad. That sort of monitoring is very useful and plays a significant role in active monitoring. Still, in some scenarios, passive monitoring is required to get the granular, real-time performance statistics of production traffic as opposed to the artificial traffic among monitoring agents.
Third, because eBPF can glean telemetry directly from the processes running in the kernel, there’s no need to capture every single packet to achieve extremely granular visibility. Imagine a scenario in which you’re running 40Gbps, 100Gbps, or even 400Gbps links, and you need that level of granularity. Capturing every packet at those link rates would be nearly impossible, let alone prohibitively expensive to do. Using eBPF, there’s no need for an additional physical tap network, and there’s no need to store the enormous number of copied packets.
Next, eBPF doesn’t rely on traffic passing through probes or agents, which may need to traverse a variety of network devices both on-premises and in the cloud. For example, to determine latency using traffic generated from probes or by analyzing packets, that traffic would likely pass through routers, firewalls, security appliances, load balancers, etc. Each of those network elements could potentially add latency, especially the security devices doing DPI.
Lastly, prior to eBPF, kernel modules had to be written and inserted into the kernel. This could potentially, often did, have catastrophic results. Before eBPF, if a new module inserted into the kernel faulted, the module would also cause the kernel to crash.
eBPF and application latency
When determining application latency accurately, eBPF is very useful because it draws information directly from the kernel and not from traffic moving around the network. Additionally, those routers, load balancers, and firewalls could potentially route traffic differently packet-by-packet or flow-by-flow, meaning the visibility results may not be accurate.
Deterministic best-path selection is a strength of modern networking, but when it comes to measuring latency, if your probes take a different path each time, it poses a problem in getting an accurate picture of network latency between two targets.
Instead, an eBPF program is designed to observe what’s happening in the kernel and report on it. Network and kernel I/O latency have a direct relationship with application latency, and there are no probes to skew the data or packets to capture and process.
Active vs. passive monitoring
There are two main categories of visibility types, active and passive.
Active monitoring
Active visibility tools modify a system, in our case a network, to obtain telemetry or perform a test. This is very useful, especially in networking, because we can use active visibility to test the state of a network function or network segment without relying on end-user production traffic.
For example, when you ping a target to test its availability, you add non-user-related ICMP traffic to the network. In that way, you can see if your resource is online and responding, at least at layer 3. You can also get an idea of round trip time, latency, jitter, and so on.
This is also how synthetic testing, sometimes called synthetic monitoring or just synthetics, works. Synthetic testing also uses artificial traffic instead of production traffic to perform some type of test function, such as measuring latency, confirming availability, or monitoring path selection.
Synthetic tests can be very advanced in what they monitor. For example, we can use synthetic testing to simulate an end-user logging into an e-commerce site. Using a synthetic test, we can capture the metrics for each component of that interaction from layer 3 to the application layer itself.
However, though active visibility is very powerful and should be a part of any overall monitoring solution, there are several inherent drawbacks.
First, by adding traffic to the system, you’re technically not measuring what’s happening with your application or end-user traffic. You’re collecting telemetry on the test traffic. This isn’t necessarily a bad thing, but it isn’t the same as collecting metrics on the application’s activity itself.
For example, suppose you want to know the accurate network latency affecting a production application located in your private cloud. In that case, the traffic of a ping or even a synthetic test may take a different path there and back. Therefore, you would have a twofold problem: first, the active monitoring didn’t test actual application activity, and second, the results may be for a completely different network path.
Second, in a busy network, devices such as routers, switches, and firewalls may be very busy operating at a relatively high CPU. This is common in service provider networks and for data center core devices. In this scenario, sending test traffic to the busy router or switch would be a bad idea, thereby adding to the packets it has to process. In some instances, the ongoing monitoring activity of a router might be enough to affect the performance of other applications adversely.
Passive monitoring
Passive monitoring provides information on what’s happening both historically and in near-real-time. The telemetry gathered from passive monitoring is of actual application, system, and network activity, making the information the most relevant for knowing how an application performs and what the end-user experience is like. No changes are made to the system that would affect the results of your visibility results.
However, passive monitoring also has its limitations. Because you’re gathering telemetry from actual production traffic, you’re relying on end-user activity to tell you if things are bad. That means to know if there’s a problem, your end-users are probably already having a poor experience.
One workaround is that passive telemetry tools can use hard and dynamic thresholds to alert you when metrics are trending worse. In that way, an engineer can anticipate a poor end-user experience before it happens, or at least before it gets really bad. However, alerting with passive monitoring still relies on production traffic trending worse, so though we can anticipate poor performance to an extent, it’s still not ideal.
In its truest form, observability is about monitoring a system without affecting or changing it. It’s about looking at the various outputs of a system to determine its health. eBPF sees the activity happening in the kernel and reports on it rather than adding anything to the system other than the nominal resources it consumes to operate.
Therefore, eBPF is a form of passive monitoring because no changes are made to the system, the application, or the traffic itself.
eBPF use cases
There are several use cases for running eBPF at the kernel level. The first is for networking, specifically routing. Using eBPF, we can program kernel-level packet forwarding logic, which is how certain high-performance routers, firewalls, and load balancers operate today. Programming the forwarding logic at the kernel level results in significant performance gains since we are, in effect, routing in hardware at line-rate.
The most commonly used hooks for networking are XDP, or eXpress Data Path, tc, or traffic control, and the variety of hooks used for programming the data plane directly. XDP and tc are often used in conjunction because XDP can capture only ingress traffic information, so tc will also be used to capture information about egress traffic.
Second, eBPF can be used for both packet-level and system-call visibility and filtering, making it a powerful security tool. If an undesirable or potentially malicious system-call is observed, a rule can be applied to block it. If certain packet-level activity is observed, a filter can be applied to modify it. The benefit of this is providing visibility and remediation as close to the target as possible.
A third use case is observability, which we’ll focus on in this post. In the classic sense, observability is determining the state of a system by looking at its outcomes without making any changes to the system itself. Since eBPF doesn’t affect the performance of the kernel, including its processes, we can get extremely accurate information about network and application performance without it being skewed by having to draw resources from the kernel itself.
In this way, you can gather runtime telemetry data from a system that does not otherwise have to expose any visibility points that take up system resources. Furthermore, collecting telemetry this way represents data at the actual source of the event rather than using an exported format of sampled data.
What can you learn with eBPF?
In the graphic below from Brendan Gregg’s website, dedicated to his extensive work with eBPF and observability, notice the variety of data we can collect directly from a device’s kernel.
You can learn a tremendous amount of information using eBPF; this graphic represents only a portion of what you can do. eBPF is event-driven, so we collect information about every event in the kernel. We can then learn about everything happening on a host machine or container, including each individual application.
An eBPF program runs when the kernel or the application you’re interested in passes a specified hook or hook point, including network events, system calls, function entry, function exit, etc.
So if we want to know about a single application’s activity and overall performance, we can learn by using specific hooks that grab that telemetry without modifying the application or inadvertently affecting its performance.
In the graphic above, the network stack is the area in light green. Notice again what kind of information you can learn directly from the source using eBPF.
These are all important functions of observability at the network layer, so to expand on just a few:
- tcptop allows you to summarize, send, and receive throughput by the host.
- tcpdrop allows you to trace TCP packet drops.
- tcpconnect allows you to trace active TCP connections.
- tcpretransmit allows you to see the retransmission of TCP packets when an acknowledgement expires, a common cause of latency.
- tcpstate allows you to see the TCP state changes and the duration in each part of the process.
With the information we get from the functions above other eBPF tracing functions, we can ask questions such as:
-
Is my slow application experiencing TCP retransmits?
-
Is network latency affecting the performance of my interactive application?
-
Is traffic from my container(s) going to an embargoed country?
-
Is the remote server I don’t own taking longer than expected to process my TCP request?
Monitoring containers with eBPF
Since we’re running production workloads today using containerized microservices, we can’t ignore the need for container visibility. However, containers present a problem for traditional visibility tools and methods.
First, containers are ephemeral by nature, meaning they are often short-lived. They are spawned when needed and destroyed when unnecessary. Though we can do the same with virtual machines, it’s done so frequently with containers that capturing telemetry information gets difficult.
Typical application, network, or infrastructure monitoring can’t easily capture the information we want from containers. You can consider each container as an individual host. Since containers can be enormous in number in a production environment, the sheer amount of metrics and telemetry available to gather is overwhelming.
Also, containers are usually deployed en masse in cloud environments, making getting visibility information that much more difficult. It’s not as simple as monitoring virtual machines in your EC2 instance, running an APM solution, and collecting packets and flows from the network devices between you and your cloud environment.
Since eBPF runs at the kernel level of a host, or in this case, a container, we can use eBPF programs to collect telemetry from ephemeral constructs such as containers and, to an extent, consolidate network, application, and infrastructure visibility tools into a single eBGP-based visibility solution.
So with eBPF, we can capture information about processes, memory utilization, network activity, file access, and so on, at the container level, whether those containers are deployed in the cloud or not.
Kappa: Kentik’s process-aware telemetry agent
Kappa is Kentik’s host-based telemetry agent, designed to address visibility gaps in east/west flows across production data centers. Kappa was built to help organizations better understand traffic flows, find congestion and performance hotspots, visualize and identify application dependencies, and perform network forensics across on-premises and cloud workloads.
Kappa features
Kappa uses eBPF to consume as few system resources as possible and to scale to 10 gigabytes of persistent traffic throughput while consuming only a single core. Generating kernel flow data using eBPF allows Kentik to see the total traffic passing between any source and destination IP, port, and protocol across every conversation taking place within a host, cluster, or data center. Because this information is generated using the Linux kernel, Kappa also reports performance characteristics such as session latency and TCP retransmit statistics.
Enrichment for container monitoring
Kappa also enriches these flow summaries with application context. Using Kappa, we can associate conversations with the process name, PID, and the command-line syntax used to launch the IP conversation.
The container ID is also associated if the process runs inside a container. If the container was scheduled by Kubernetes, Kappa enriches the flow record with the Kubernetes pod, namespace, workload, and the relevant node identifiers.
Before exporting these records to Kentik, Kappa also looks for any records associated with other nodes in an environment and joins the duplicate traffic sources together with the source and destination context. This gives us a more complete picture of application communication within a data center.
Though container network monitoring is an important use case for Kappa, it was designed to seamlessly monitor bare-metal, containerized, and cloud-native workloads using a single, flexible agent deployed as a process, container, or container directly into a Kubernetes cluster.
Kappa is distributed as Kubernetes configuration files. Linux packages for VM/bare metal use are available at https://packagecloud.io/kentik/kappa, and its configuration can be viewed on GitHub (https://github.com/kentik/kappa-cfg) or can be cloned to your workstation:
$ git clone https://github.com/kentik/kappa-cfg.git
Conclusion
The nature of modern application delivery requires new methods for observability. Because so many applications are delivered over a network, ensuring application availability and great performance means having deep system visibility and granular network visibility in the context of those applications.
This means gathering telemetry for a variety of devices related to applications and their delivery, including containers. And when you also factor in public cloud, SaaS, and the various network overlays of today’s WAN and campus, gathering this telemetry becomes more important and difficult.
eBPF has emerged as a perfect solution for collecting passive telemetry in modern environments, especially in the context of cloud and cloud-native containers. Operating at the kernel level and being lightweight means eBPF can provide us the telemetry we need specifically about application activity without inadvertently and adversely affecting application performance.
Though eBPF has other uses, such as networking and security, its benefits to modern observability are changing the way we both see and understand what’s happening with the applications making their way over networks around the world.