More episodes
Telemetry Now  |  Season 2 - Episode 66  |  January 22, 2026

Practical MLOps for Network Operations at Uber

Play now

 
Host Philip Gervasi talks with Uber's Vishnu Acharya about how Uber applies machine learning and MLOps to network operations at hyperscale. Vishnu explains Uber’s intentionally simple network design across on-prem and multi-cloud, then shares practical machine learning use cases like predictive capacity planning, hardware failure rate-tracking, and alert correlation to reduce noise and speed mitigation. They also discuss organizational issues, including building blended network/software teams, partnering with internal ML groups, and focusing on service-level outcomes over hype.

Transcript

Modern network operations are operating at a scale and at a pace that just didn't exist a decade ago.

As networks become more dynamic, more distributed, and more tightly coupled to business outcomes, those more traditional approaches to troubleshooting capacity planning and operations in general kinda start to break down.

On this episode of Telemetry Now, we're joined by Vishnu Acharya from Uber to talk about what it really takes to run network operations at hyperscale and how Uber's NetOps team has applied machine learning and MLOps principles to tackle real operational challenges.

So from capacity planning and reactive troubleshooting to data engineering hurdles and organizational realities, this is a really interesting conversation and I really think cuts through a lot of the hype out there about ML and especially AI.

So we'll be focusing on practical use cases, lessons learned, and what actually works in production.

So if you're curious about how ML and AI are being applied inside one of the world's most demanding hyperscale network environments and what that might mean for your own NetOps journey, this is a great episode for you.

My name is Philip Gervasi, and this is Telemetry Now.

Thanks Vishnu so much for joining today. We have been going back and forth for like months now, trying to schedule a time to to record an episode. So I'm really excited about, about having you on today. Thanks for joining.

Yeah, thank you so much for having me. Excited to be here.

Thank you. Great, great. So before we get started, and get really deep into technology and use cases and some of the politics even, I do want to level set about what you do for Uber, what your role is, what your team is doing, that sort of thing. And then we can, you know, frame the rest of our episode today in that activity that you're doing with MLOps and AIOps in your in your environment.

Sure. Sounds great. Yeah. So Vishnu Acharya, I manage and lead a network infrastructure team based here in Amsterdam and the Netherlands. I've been at Uber about eleven and a half years in the networking space.

You know, started out building Uber's first on prem networks and data centers back in twenty fourteen, all the way fast forward to today. And my organization today, is responsible for sort of the full spectrum, network engineering and software automation around that for Uber's infrastructure. So design architecture, build, operate observability, tools and network automation, software platforms for Uber's on prem data centers, backbone, as well as our cloud cloud networking in our in our public cloud deployments.

Okay. Great. Now this is gonna be maybe maybe it's not, but this might be kind of a weird question. I I am sure that our audience is familiar with Uber and what Uber does and what Uber is.

But can you tell our audience? Can you tell me what is Uber's network like from your perspective? From a from a perspective of scale and how fast things move or how slow things move, how much automation you've employed over the years and that sort of thing?

Sure. Yeah, absolutely. So so our you know, I like to say that our network is is is actually very big, but it's also very simple in its basic building blocks, you know, intentionally. Right?

So, you know, we had the advantage of really designing our network architecture standing on the shoulders of companies like, you know, at the time, Facebook, Google, Microsoft, etcetera. So borrowing some large scale hyperscaler designs, you know, into what we're doing. We had a clean sheet and it's very rare in my career. This is the only time it's happened where I've had a clean sheet and almost unlimited budget, at least back in twenty fourteen when it comes to building infrastructure.

So our network was really designed to be as large, a large, flexible environment for our service owners and our internal stakeholders to then go play with their services, right? So we don't do a lot of customization of both bespoke network designs, you know, per service. We have just a large scale parallel plane class fabric design in our data centers.

We're running all, you know, well known, you know, protocols like such as BGP everywhere. In our backbone, we have some some MPLS and RSVPTE, but other than that, it's very, very simple. We avoid vendor specific technologies.

We don't really have any underlay/overlay networks.

We're just doing layer three everywhere inside the inside the network and outside.

Now the advantage of that is, you know, we have this large scale fabric with a ton of physical, you know, redundancy built in, so we can withstand a lot of physical failure and still operate.

Our service owners can deploy their workloads anywhere, anytime, and they'll get the same network behavior as they expect, no matter the zone or on prem region. Now that's the on prem version of our network, and that's where we started. Now as we've gone through the years, you know, we've deployed like everyone else into the cloud, so we're in three different cloud providers today. Now those environments are a little bit more specialized.

We are using some vendor specific, you know, technologies where they're either best in breed, or they offer some capability that we could not, right? So an example of that is our edge deployments in Google Cloud, which are globally deployed. They just have a much larger footprint than we could build, obviously, in the timeframe we had.

There's other examples where we're using specific things in the cloud that are useful to our service owners. But from a network perspective, all of our networks are sort of designed the same in terms of the size, whether they're in a cloud VPC or on prem zone. So we've kept sort of that standardized footprint across all of our zones, whether they're on prem or cloud. And the biggest thing is predictable or as predictable latency we can, you know, latency SLAs that we can provide to our service summers where they're operating their microservices. So now for us, if, you know, the north south traffic to Uber is actually not really that big, right? So most of our traffic, like many companies, is very, very heavily on the data infrastructure side, and it's a lot of East West traffic that goes on either inside a zone or between zones and across the region.

So our backbone, as well as our sort of inter an intra data center capacity is really where our focus is for supporting our services.

Okay, great. So then, you know, it sounds like the goal from the very beginning was understanding that, yeah, we're gonna scale this to a very large network, maybe even service provider like in some aspects based on some of the things that you said. Yeah. Like, you didn't mention any, like, Wi Fi and closet switches, although I know you have those things. Yeah. Necessarily your focus.

And then the transition to a more cloud heavy design, of course, in this hybrid environment that you live in. But ultimately, all of that to serve a modular approach, your focus isn't necessarily like you're not doing like day trading. So you're really focused on stability, reliability, low latency. Like you said, of course, it's always a thing. Latency is the new outage for sure. Is that about right?

Yeah, that's that's a great summary. So, you know, I think for us, the the biggest thing is reliability because, you know, I've worked at other companies in the past. Worked at Netflix, for example. You know, reliability is also very, very important to them.

However, the implications of incidents and outages is different. You know, somebody if we have an outage in an environment like that, in general, nobody's going to cancel their subscription for, you know, fifteen minute outage, whereas a fifteen minute outage in an Uber world is people stranded on the street or taking alternative methods, earners who can't earn on the app, you know, restaurants that can't deliver food. And those transactions are sort of ephemeral. They're not coming back, that particular transaction.

Right. So the financial implications are huge. So we invested a lot in in physical redundancy and as well as in the design being resilient.

Okay, reliability, predictability. Those are the kind of the operative words I'm hearing. So let's talk about some of those operational challenges. You know, I've worked me personally, I've worked in enterprise for years and years as a as a delivery engineer, as a solutions architect, that sort of thing, from kind of the SMB space all the way up to global enterprise. And, you know, the operational challenges were very similar in all of those environments.

And but I have to imagine that there are some that are unique to what you're doing at Uber. So let's talk about that. What are some of the operational challenges that you faced, not only in the beginning, but even today going through this past decade plus that you've been there?

Sure. Yeah, so I think, you know, when we started out, Uber was even though the growth rate and the sort of the growth of the company was like nothing I've ever seen.

It was it was it was much more predictable, right? The business line was was simple. The product was rides. It was the, you know, the number of markets we're in was much smaller.

So the explosion of growth in the business was the number one challenge, Just keeping up with capacity. So all of our software automation, all of our tooling was really designed around one thing, which is like, how do we stamp out these modular, you know, data center footprints over and over again, and sometimes in parallel in different parts of the world, different parts of the country with a small team, right? This is a team of maybe ten or twelve engineers for many of the beginning years. So that was number one challenge. Yeah.

And then I think the next challenge was as Uber's business diversified, Then things became more complex in terms of the requirements on the network from our services. So think about, you know, when we launched Uber Eats as a totally different sort of business line.

There's a lot of integrations with very large third parties like, you know, large restaurant chains like McDonald's, etcetera, which have their own tight operational SLAs, you know, requirements.

And then you layer on top of that sort of the regional and local complexity and differences in different markets. So we offer different products as Uber, you know, you can get a, you can ride on the back of a motorcycle in a lot of, you know, Uber markets in other parts of the world.

In Europe, they have different, you know, different integrations with some of the more traditional taxi companies, for example, like here in the Netherlands. So all those integrations were, you know, added some complexity.

Think the other thing right now, I think the biggest challenge is probably in the autonomous vehicle space. So, you know, Uber has been partnering with Waymo and other, you know, autonomous vehicle providers. I think we're up to twenty four partnerships today. Many of these were announced, you know, just in the last, let's say, two years. And so that infrastructure is very different from, again, what we're used to building, which is, hey, we're building data centers or we're building cloud connectivity or we're deploying, you know, new networks in new VPCs and cloud providers around the world, which is sort of like a well known and, you know, well understood problem for us.

You know, deploying large scale networks to, you know, tens or hundreds of austere sites, I'll say, across the world where these autonomous vehicles will be operating from is challenge that we're working on right now.

Okay, right. So sounds like capacity planning. It sounds like making sure that the network is as modular, predictable as possible. Those are the kind of goals that you have for your team that kind of over the overarching goals, overall, the various actual day to day operational activity and decisions that you have to make.

Yeah. Like which protocol do we use here and and how do we traffic engineer here? But it all kind of boils up to bubbles up, whatever the term is to how do we make this thing more predictable, more reliable, more rock steady? And where are those bottlenecks that we can stamp out?

Now, you said stamping out new data centers. And as soon as you said that, I'm thinking about like, you know, that whack a mole game where you're like at the county fair and you're trying to like every time there's like a little bottleneck and that's affecting latency or like, how do we whack that and get that out of there? So how do we do that ahead of time? So what I want to transition to now is talking about what you have used with regard to machine learning, ML Ops, and we can talk about some contemporary AI concepts as well, because I know you're doing some stuff there.

So some of that technology that you've used to achieve some of those goals, and you know, we're going to get specific, I know. So feel free to take the conversation any way you want to go, because I really want to know basically what you what your team is doing with machine learning at Uber and network operations. So let's define what what you actually mean by machine learning and ML ops.

Sure. Yes. So for us, I think from a network engineering perspective, when it comes to machine learning and AI ops, you know, we're really looking at some very simple metrics to start with, right? So the metrics that we're really collecting now, we're collecting, you know, tons of metrics across all of our network devices from our cloud providers, etcetera.

But what we really care about is latency and loss, right? So we're looking at packet loss across different segments of the network, being within some certain thresholds, latency obviously within those thresholds. And anytime we're deviating from that is something we're very interested in. And now the underlying cause of that is, you know, could be further gathered from other metrics we're looking at.

So, you know, we're obviously harvesting all of our syslog data from all our network devices. We're looking at flow data.

We're looking at actual, you know, metrics from in terms of, like, link quality in terms of errors, QoS queues, drops, link flaps, BGP transitions, all of those things.

And it's almost a case of like too much, too many metrics there for humans to understand. So that's why I think for me at the base level, like ML and AI ops sort of come into play is like, how do we, you know, make sense of this data in a real time way where we can have smarter alerting, smarter response, smarter incident management, and ultimately, you know, reduced mean time to mitigation is always our goal. So an example of that is, you know, one of the early examples of that is, you know, in a large scale environment where we're deploying hardware platforms all the time, you know, from some of the big name network providers out there, network device providers, as well as the components such as optics, line cards, etcetera.

So one of the first use cases we're really interested in is tracking sort of failure rates across those hardware platforms. And we didn't have a good way of doing that for many, many years. So, know, we'd sort of anecdotally figure out like, this certain type of optic seems to be failing more often than we would expect. We didn't really have the data in a format where we could put some intelligence to it and actually figure out what is the failure rate of this hardware platform, and is this, you know, within what we would expect or is it not?

What is our baseline? So defining that, working with vendors and understanding, is this a failure rate that we should expect or not? So those are things that we did invest some time in because that, you know, when you're operating, let's say we're operating one data center, it's not such a big deal, but you know, if you're in twenty two different data centers and you have, you know, literally hundreds of thousands of optical links, you know, failure rates become really important, especially when you start to get, you know, start worrying about cost and being more efficient as everyone does, right?

So those are things we sort of started with. Now, you know, fast forward today, and it's much more sort of around service behavior and thinking of network as a service, right? So yes, we run networks in on prem, we run networks in the backbone, run networks in the cloud, but from our stakeholder perspective, it's a network, right? They want the network to run.

They want their service to be able to communicate. And it doesn't matter to them necessarily whether it's cloud or on prem or data center. So that's where we have to figure out, you know, how we're using, the metrics that we have to really understand the behavior of the network and how it's interacting with our service.

Okay. So networking aside, it sounds like the initial impetus was a data problem. Were ingesting a tremendous variety, I assume volume because of the scale of the network of metrics.

And we can go into, like, what kind of metrics. You mentioned a few, syslog flow and things like that. But, I'm sure that has only grown once you start adding, like, VPC flow logs and Google logs and, whatever you're, you know, pulling and screen scraping and all that kind of stuff.

But the goal of that collection was to it sounds like almost a predictive analytics initiative. You wanted to understand where the failures could happen or were likely to happen so you can get ahead of it rather than simple, not simple, rather than only reactive troubleshooting. Is that right? And ultimately have a more stable, reliable network.

Yeah. And I think we're still we're still on that journey, right? So I think I think where we've actually made some real good progress is on sort of the predictive capacity analysis and prediction, right? So one of the one of the and this this actually was a big topic for us last year.

So we actually, you know, had an internal code yellow where we realized that we're exceeding safe capacity within certain segments of our network, meaning that we're essentially, you know, losing our n plus one or n plus three redundancy because of the capacity usage. And also, you know, being in the middle of sort of this massive cloud migration as we are. Right? So we're moving tons and tons of data, you know, from on prem to cloud, between clouds, between data center zones.

So it gets very hard to predict, you know, where is the bottleneck potentially going to be ahead of time. So one of the things we really worked on last year was, you know, was that, right? So we're ingesting a bunch of metrics around our capacity utilization across all of our, you know, links within our data center and backbone, but in particular paying attention to sort of our edge pods with where we connect different data centers, data center zones within a region together. And then between our edge pods and our backbone and across our backbone, and then our backbone to cloud provider.

So those those sort of three things are areas where we really focus on because what we noticed, you mentioned whack a mole earlier is like, you know, we would upgrade a certain, segment of our network and it would sort of unleash a bunch of capacity to our service teams, which they would immediately fill up again. And we the bottleneck would essentially move further upstream. Right? And then, you know, when you layer on top of that complexity of like doing hardware refreshes, maybe upgrading hardware platforms at the same time, you know, going from one hundred gig to four hundred gig, for example, and just really understanding all of that together was very difficult for any human being.

And even though we have outstanding engineers, it's, you know, we needed a system to do it. So we really we really spent a lot of time.

And what we've built is actually a model that ingests a bunch of these utilization metrics. We've tuned an algorithm to sort of do some prediction around, okay, this segment, you know, at the current utilization rates and expected growth rate organically is going to run out of safe capacity, so many n plus one, n plus three at this state or roughly this state. And so that gives us a rough estimate of like, you know, based on organic growth, six months from now, you know, you're gonna need to upgrade this segment of the network.

Now that's just sort of step one because for us, like the organic growth is is is sort of predictable. Like our business is actually really good at predicting here's Uber's growth. And we we understand what that means for the network. Now, sort of the surprises for us that come up are these step changes in capacity utilization that happen because there's some new service or there's some new launch or there's some new migration that needs to happen that we didn't plan for. Right? So it's also besides just a purely sort of technical problem, it's also a challenge, organizational challenge, right? Communication challenges, all of those things wrapped into one.

And I and I wanna get into that part for sure. That's we don't think about it as often as engineers. We we stay within, you know, the the OSI model. But then there is layer eight, the the politics and the organizational activity that we need to discuss. But, you alluded very briefly to what you used to do before you use this newer ML Ops paradigm.

Can, can we get into that just like what did you do to solve capacity planning prior to tuning your own model and algorithm?

Yeah. So this is it's funny. I'm laughing a little bit because it's going to sound very, very low tech.

That's okay. Yeah, I want to get it.

We had a very we had a very we had a we a few, but we had a in particular, there was one very experienced network engineer at Uber who actually designed our data center fabrics starting in twenty fourteen.

And he just recently left Uber after eleven and a half years or so.

But he, it sort of fell all on his shoulders, right? So it's sort of crazy to think that, you know, one hundred and sixty billion dollar enterprise.

But that is what it is, right? He he he would he would spend time really digging into the metrics, looking at graphs, understanding, okay, hey, this segment of network looks like it's gonna be running hot soon. Like, we need to upgrade, start planning those upgrades, start ordering, you know, whether it's start fiber, ordering equipment, whether it's ordering links, cross connects, whatever. And then the work would be would be done.

And that that got us through a lot, right? Got us through many years. But obviously it's not sustainable to have one or a handful of engineers doing capacity planning sort of manually, which is where we're at. Because, you know, for us, again, our focus for so long has been just on building it bigger, right?

So we've been building as big as we can for as long as we can remember. And the sort of turn to being more efficient didn't happen till sort of later in Uber's arc, you know, which is, let's say, the last two years, was like, okay, we we can't just throw capacity in a problem. We have to actually understand it because, you know, also by doing it manually and just trying to add as much capacity as we can, As you mentioned, you know, there's different choke points that that pop up. So we we may be adding capacity in the wrong place and we may still have an issue.

So, yes, that's sort of inherent cost, operational cost, and then just financial cost.

And then opportunity cost, money that you, you know, or spend that you could have spent somewhere else just to add capacity. So you wanna do that as intelligently as possible. And what's interesting to me, Vishnu, is that you mentioned that you did do it. You were successful at it, but it was literally like a guy or a small, small team of people that did that with their domain knowledge in their heads.

Like, they knew that if you wiggle a red wire in closet for, you know, the Chicago location will go down like that, that kind of like embedded in my brain knowledge. And it worked. It worked. Engineers do that.

But it's like you said, not scalable, certainly not to the scale that you're operating now. And it's, you know, there's a certain level of intuition that is fallible, I would say, you know, I mean, we're good at it because we have the years of experience, but ultimately, it's not sustainable. So so you wanted to add some sort of intelligence, but you're not talking about the latest chat bots and things like that, that everybody is really, really excited about lately. And now we're talking about agent systems and everything kind of revolves around large language models, which I think are fantastic.

And we can certainly touch on that. But really, thus far, this conversation is about how can we do some of the engineering activities, the operational engineering activities more programmatically with more intelligence and with a much greater scale because of the inherent problem of data volume and variety, and then getting some of that domain knowledge out of people's brains and into, like you mentioned, fine tuned model. And so that's really neat. So why did you believe that going that direction would help?

All the different things that you could have done, perhaps buying something off the shelf or whatever, is it And the reason I ask is because, one, this was a few years ago, the AI buzz hadn't started. Another thing is that you don't normally hear folks in network teams talking about ML kind of technologies. So what got you thinking in this direction?

Yeah. So and actually, you know, I think you're absolutely right. Like, we don't we don't talk about it enough sort of in network circles is that, you know and I'm I'm the same way. Right?

It's like, hey. As a network engineer, you know, we design a network to to set up requirements, rebuild it, and then it just sort of runs. Right? And that's we all know that's not where the job ends, but that in some sense, it's like, okay, well, it's there.

But it's not this, like, monolithic thing that's just gonna sit there. Like, it's gonna evolve. The demands are constantly changing. So to really understand that, you have to get into that data, and you have to be able to analyze it at scale and at speed, and then near real time speed.

So, you know, for us, like, the next step is, you know, we sort of understand our raw capacity needs, but then the next sort of missing piece is also that service attribution, right? So, you know, Uber is running, Uber runs something like five thousand micro services, right? So to understand how, you know, what is the growth rate of all of those individual services when it comes to network capacity is something we don't actually know today, right? So then, the only way to know that is to really dig into metrics and data, and use MLOps to understand it, before you can start getting into, okay, how do I predict and build for this?

But, and you know, I think there is also a sense that, you know, once we understand that, we can go back to service owners. And I think we're going to also uncover a lot of potential inefficiencies in how we're doing service to service routing, you know, where we're placing workloads. Because there is this sort of this concept, which I'm sure you've is like, hey, you know, in some sense, bandwidth is free. Right?

It's not free to us as a network engineering team, but to our service owners, it's it's free today. And and we want to enable them to to operate at speed as well and deploy what they need to deploy and keep the business moving. But I also think we need to just expose to them that, you know, if you maybe did this change in how you're deploying your service, or where you're deploying your service, the upstream or downstream callers or callees, we would gain this much efficiency and we'd use this much less capacity, right? We don't have any of that insight today that the data is, it's all there in the data, right?

I think that's the key is really digging into that and understanding that. And it's not something that I've done as a network engineer or network engineering leader in previous places. So, you know, it but it is something that's super important.

But your team owns this initiative, right?

Yes. That's correct. So we own this.

Okay. So how in the world does that happen considering I'm I'm guessing I don't know. Correct me if I'm wrong. You probably don't have a bunch of like data analysts and ML Ops engineers, and maybe you do now. But especially when you started, that's not those aren't typical roles on a traditional network operations team.

Yeah, they're they're not. Right? And so one of, you know, and we begin we can sort of, we mentioned it earlier, we talked about sort of the organizational acts aspects of question. So one of the advantages I've had in in especially in in in coming to Amsterdam and building the organization here is I brought some of those skill sets closer together.

Right? So, whereas the team in the US that I'm I'm coming out of was was strictly, traditional, you know, routing and switching network engineers and optical network engineers, etcetera. The team in Amsterdam has some of that, right? So we have network engineers, but we also have some very skilled software engineers within the same team, within the same organization.

Now, now that will get us, you know, a certain amount in terms of this journey. I think where we really need some help is on sort of the the machine learning AI AI skills and ML skills. And so, you know, within Uber, there's there's there's a very, very good team that works on machine learning, and some of them are here in Amsterdam as well. So up to this point, we've been sort of sort of borrowing time from them, but we'll see how this sort of partnership with them, you know, evolves over time.

They're very interested because it's a completely new use case for them.

Okay. Yeah.

And for us, you know, understanding, you know, none of us are machine learning engineers within my team, however, you know, I think we have some people who are very interested in this area as well. So we're trying to we're trying to collaborate across organizations to utilize what they've already built. Because as you can imagine, Uber has a very sophisticated machine learning apparatus and and systems in place. So how can we love love the chat for for network? Right.

Yeah. What you're already doing on the data analysis side, you know, has nothing to do with your IT operations. Those are your data analysts or whatever you guys call them. Yeah.

You have that in house. How do you leverage that same skill set, even if it's not the same people, and apply it to network operations? And, you know, the thing is that and I've and I've been doing this for quite a while. It's a different problem, like you said.

And and I'm I'm sure it sparked some interest because it's not it's not a static data set. You're not looking at this big two hundred gig file of static, singular type of data and then applying a model to it and saying, all right, let's run some regression algorithms and then see where this is going to go over this time series. It's a, like you said, real time or near real time data that you're dealing with to do something interesting, whether it's prediction or looking for correlation or something like that. Now but before we get into that, I'd like to know if you're able to share, what kind of models now you you describe training your own model, but what kind of models are you applying if you're able to speak to that?

Are they traditional ML models like the regression family or time series models, that sort of thing?

Yeah. So so it's sort of all of the above. Right? So regression family, time series.

There's there's some there's some newer ones that they've been working on, which I'm not an expert on, so I won't comment on them. But, you know, if you think about it, this, that team is working on stuff like we are discussing here today, but also, you know, sort of on the operational side. But they're also working on a lot of very sophisticated machine learning prediction models for things like matching, right, riders and drivers. So actually on the production side, right, restaurants, recommendations, all of those things.

Now, those don't, you know, those are very specific use cases for enabling the Uber business. But if you think about it, you know, some of those may be interesting to us. Right?

Like, if you're looking at, you know, how what what is the best, routes to route a vehicle within a city, You know, is that is that kind of similar to how do I route network traffic across my network?

Yeah.

Maybe, maybe not. Right? So those are areas I really wanna dig in with the team is, like, to see how, some of these super sophisticated things that they've been working on for years could potentially be sort of, simplified for our use case and and applied.

Okay. Well, what let's move on to some of those hurdles that you faced once you settled, you know, your team settled on this is the direction we wanna go, and you were able to allocate some resources from other teams. You said you partnered with other teams. Maybe, know, we're able to put out some job recs and hire some internally.

What were some of those initial hurdles? I mean, you you some of the difficulties you mentioned thus far were like kind of the data engineering stuff. How do we how do we get a handle on all this? So how did you how did you handle that?

And then any other kind of challenges that you faced?

Yeah. So I think one challenge so there's a couple. One is sort of organizational, which is true anywhere. It's like trying to get alignment and buy in from especially, you know, another team, another org that has very important road map already defined. Like, how do you, you know, try to try to validate that what we're working on, you know, is worth the investment in improving that, especially because in the early days we don't have a well defined sort of ROI in what we're trying to do on our side. Now I think where that argument became easier is that, you know, the network at Uber in most places is sort of the most foundational service, if you're thinking of it as a service.

You know, obviously data centers, power cooling, all those things, but I think it's sort of from the service layer, you know, network connectivity is that foundational layer. And so there is a lot of interest in helping us become more reliable in what we do, and also reducing blast radius when there are incidents, reducing our mean time to mitigation when those incidents do occur. So I think organizationally, the alignment initially was a little bit difficult, but we got over that. I think the next thing that is difficult for us is really the cleanliness and accuracy of the data that we're actually looking at, right?

So we're collecting tons of metrics from all kinds of, you know, every network device we have, what is actually useful for the problem we're trying to solve, right? There's this tendency for everybody, myself included, it's like, we just want to collect as much of everything as we can, where a more focused approach might be better for something like this. So we're really trying to understand, okay, let's take our network, let's break it down into its segments. Let's understand, you know, let's try this out somewhere where it's both impactful to our reliability in a positive way if we solve it, and it's also manageable, right?

So we really focus on our backbone just because, you know, if we started looking at data center fabrics, there's, you know, hundreds and hundreds of devices, thousands of devices. Our backbone is a much more manageable number of devices that we can look at, And then really understand the metrics that we're pulling out of this device, the metrics that we're pulling out of our black box monitoring from our Ping fabric that we're running, all of those things, and then really combining it and pulling out the metrics actually matter for both reliability and capacity in building models.

Yeah. And I have heard from data engineers over the years that have made the comment, like the off the cuff comment that feature engineering and selecting what kind of data is necessary for this kind of task. It's almost like an art. There is an experience. So we talked about how do we get that domain knowledge out of the network engineers headed into an algorithm. The data engineers are doing the same thing.

They're doing the same thing a lot of time.

And understanding, you know, what is networking? What kind of metrics do we need for this kind of problem? And do we really need to get all of these? Because it slows down running the model like crazy, and we need the answer yesterday.

So Yeah. Yeah. Yeah. So, you know, from from from a data perspective, are we talking about almost entirely like hard metrics, streaming telemetry from devices, looking at routing tables, flow data, and all that kind of thing?

Or is it going outside of that into maybe even unstructured data like config files and know, verbose syslog, things like that?

Yeah. So so it's it's more of the former. So we are interested in the latter. Right?

So we are really and that's sort of where the the the hope at least for for agentic AI comes in is for sort of the the the unstructured data, the syslogs, etcetera. But the hard metrics we're looking at, you know, whether it's from SNMP, OpenTelemetry, those sorts of things, our ping fabric that we're running, you know, across all our zones, generates synthetic traffic to capture, you know, things like latency and loss. So all of those metrics we have today, and we're getting those, and we're learning on those. We're, you know, we're paging on call engineers based on sort of thresholds, but it is a very static thing that we want to get more dynamic in.

What I mean by that is, you know, we're measuring latency and loss, and we have certain thresholds across, you know, that fit within our SLA that we're guaranteeing to our service owners. However, what we've seen is that some of those static thresholds, there is impact, There could be impact that doesn't rise to the level of an incident as defined by Uber engineering, but it's still degradations that we need to understand the network.

Right.

And the problem is keeping up with those thresholds and keeping those thresholds accurate, because those change over time, depending on, you know, what services are deployed where, the sensitivity of those services, to latency and loss, new services coming online. So so our SLAs, even though they're well defined upfront, we're finding that, you know, the actual performance of our network is variable and we have to understand how that impacts services. And so the first step for us is like really actually understand what is the variable performance of our network, right? Because we're we're we're sort of at at the point where we're collecting a bunch of metrics. We're visualizing those like everyone, you know, using Grafana, etcetera. We actually have some other observability tools we're using internally.

But at that point, there's something that doesn't rise to the level of paging an engineer, but we still under like to understand, like, what is that interaction on the network that's less than perfect, but not alerting with the services that depend on our network. So that, to us, is also an area that we're really trying to invest in. And the goal of this is both to provide a better service to our service owners, the internal services that run on Uber, but also, we also, for our own engineer, on call sanity, right? So many teams at Uber and many teams across the industry I know are working on this concept of sort of an on call copilot or AI, you know, root cause analysis agent. It's been called many different things, but we're also sort of very interested in working on something like that to enable our network engineers to understand what is happening on the network before it reaches that point of, you know, major outage.

Yeah. And the reality is that all of these tools and mechanisms are gonna be part and parcel of an entire system at some point. Yeah. You know, when we do do talk about agentic AI systems and there's some sort of tool calling happening, some of those tools are applying a model to some data.

So you're building the foundation, these things, these worlds will come together. One of the things that you mentioned, though, that reminded me of something someone said to me years and years ago years and years as in, like, about a decade twenty seventeen, twenty eighteen, a little less. I was speaking to a colleague who PhD from Texas A and M in oh, I remember what school. Anyway, PhD in computer science, specialty in machine learning, not a network engineer, but he was being his skills were being utilized at an early on stage in what we're doing in networking.

And the way he explained it to me was it's the difference between finding something or understanding what's weird and what's bad. Yeah. And so what's bad is you pass a threshold, hard down, degradation, users are affected. You know, obviously, send out the fire trucks, and it's it's a, you know, p zero, whatever system you guys use.

Weird is, you know, I got this four hundred gig link, and the capacity has been creeping up slowly. It's still minuscule. But why is it creeping up? It doesn't affect anything.

No alerts triggered, but it's something that the system caught that no human would likely have caught again because it, you know, it's not hitting a threshold, but there is a pattern emerging. And now that's when you start to get into like, well, we can apply ML to do that kind of stuff. Really interesting that you said that because it reminded me of that conversation I had with him those years ago. And here we are doing it.

Yeah. Yeah. Absolutely. And it is you know, it's a very interesting thing because if you think about a network design, you know, how no matter how much redundancy you build in, resilience you build into your design, there's also, like, second and third order effects that happen when there's a failure that we also don't that's also difficult for a human to anticipate or understand. So that's where I'm hoping that some of this analysis could tell us like, okay, hey, you guys have added capacity here, but in this sort of complex, you know, multi step failure scenario, it's going to result in something really bad happening over in this other area because Right.

The way your routing is set up or whatever.

Right? And there's no way, again, for, like, a human to really be able to understand those second and third order effects. Whereas I think, you know, especially when you have a lot of rich past incident data like we do, that we could, you know, do some analysis and really understand, you know, try to predict where some of these failures could happen. Right. You know, that that aren't that aren't the obvious ones. Yep.

Yeah. I've been there. Why is it that when I make this DNS change here in New York that, you know, my London data center latency, like, drops way down on this one particular link? That has makes no sense to me.

Why is it happening? Then you flip it back and it gets resolved, then you flip it back and it starts happening. I've I've been there. Yeah.

And and it kinda boggles your mind, and then you sorta, like, leave it and walk away. But I I wanna transition now for the remainder of our episode, unless you have something compelling you really wanna talk about, of course. But I I do wanna transition to the kinda layer eight stuff now. Sure.

I love talking models. I love talking packets. But I wanna talk about people and organizational stuff, including, you know, budget and things like that. But I'm gonna ask a pointed question, so feel free to sidestep it and backpedal as much as you like.

Did going in this direction actually change the nature of your team, perhaps even displace any engineering where you had to change job recs, move roles around, that sort of thing?

Yeah.

So I mean, that's a great question. Right? And and it is it's still an an experiment that's that's in progress, I would say. So, you know, one of the reasons I came to Amsterdam is that I would have this chance to sort of bring multiple disciplines into one team and one org. So previously, you know, we had a group of network engineers, we had a group of software engineers, and they were different managers, even at one point, different directors, who were working on networking space. So the understanding across those teams was not where it needed to be. So coming to Amsterdam, we brought that all under myself.

And so, but it is a work in progress because I have network engineers who are on a spectrum of, I'm a network engineer, I love networking, I love layer three stuff, I don't want to learn how to code, I'm not interested in it. Two, I have network engineers who are on the cusp of being considered a software engineer. Then I have software engineers and all the software engineers I hired in Amsterdam, almost none of them had network domain experience, but they're very, very good software engineers and they're very interested in large scale systems. So trying to bring those two worlds together, you know, we're two and a half years into this experiment.

We made a lot of progress. Our software engineers are much, much more expert in networking than they were when they walked in the door. Our network engineers are moving towards more automation and coding. But there's still sort of this cultural difference between sort of a network engineer and a software engineer.

And I see it when I manage them. They have different sort of needs and wants in terms of feedback and they have different rituals, you know, software engineers, especially in a scrum agile environment, like, you know, a very sort of regiment. I mean, have their rituals that they go through in terms of sprint planning, estimations, reviews, all of those things. As network engineers, we tend to be much more sort of like fly by night to put in one way.

It's like, you know, we're used to very operational heavy work, right? It was like we're firefighting, you know, a lot and that's what we're used to. So changing the mindset to how do we think in systems and not just networks, and how do we think of our services is what I'm really focused on. But it is hard, you know, and I think as this AI agentic environment evolves, that's going to become even more important to that, that we understand, you know, what people are interested in working on, that we give them the tools to do it and to be successful.

But, know, as I said, there's network engineers who are not interested and that's totally fine. Like, I want to build large scale networks. And maybe that's at Uber, but maybe it's, you know, if they want to do that, maybe they, you know, they end up going somewhere else, which is, you know, happened happened recently a couple times where, you know, we had engineers who were just like, I don't want to learn how to code. I don't want to work on cloud networking.

I want to build networks, right? And so they go to a place where they can do that. And that's that's totally fine. I totally understand that.

Yeah.

And and those kinds of very, very, very focused mindsets are necessary, you know, perhaps fewer than in previous years, but certainly still necessary. You need someone to really understand MPLS and BGP at a very deep level and focus on a big backbone.

And I, know, I I I tell my team, I I really want them to be expert in in at least one thing and then and then generalists in in many things. Right? So so I want that network engineer who is an expert in MPLS, but I also want them to be curious about, okay, well, this is how Google does MPLS, or this is how, you know, this is how our network automation works around it. So that's what I'm really striving for.

Okay. I have a I have a two part question for you because they're just so related. So answer however you like. But how how do you and how does your team measure success with regard to the application of ML to network operations to this network data? And then if you're able to, what what actually was successful? And, you know, we talked about capacity planning a bunch, so I assume that there's been some success there. But I also am interested in what failed, what didn't work.

Yeah. So I I think I think for the capacity management, it's sort of like, proving a negative. So we have not been in any capacity crunches since we've introduced this to him. And we have, you know, we have regular auto generated reports that come out that give us the predictions and we apply those to our work streams in terms of avoiding future capacity crunches. It's worked pretty well, but as I mentioned, we're sort of at the beginning of it still. You know, we want to implement this service layer on top of it, where we understand the service utilization.

I think the other one that's really interesting that we still need to figure out is sort of the more the operational work, right? So think about alerts correlation, for example. So we, like everyone, have a hellish sort of on call load in terms, and most of it is noise, right? Ninety eight percent of our alerts are probably noise where you're not actually taking action, but they're important informational things. So how do we get that out of PagerDuty and into some sort of tool that allows us to track failures and make sure we're tracking those failures through resolution that don't rise to the layer of, you know, an incident or an outage? And the truly important on call stuff, you know, surfacing that.

You know, so I use the example of like, you know, link failure. So a link failure, we alert on it probably six different ways, right? So we have BGP alerts, you know, that this neighbor went down on both sides, we get alert from both sides, we get a physical link alert, we get a device syslog message alert, and you know, so we're probably getting three alerts from each side of that link. Now, the on call engineer doesn't need to get paged for all those six alerts, right? They just need to understand this link went down, that's why it's alerting me, and then whatever the remedial action is. So the first step is correlation, and then what we're really interested in is then auto remediation, and now, you know, calling in.

So even before AIOps we're thinking about let's say we have an important backbone link that's flapping, how do we take that alert signal and then for example just automatically drain the link? Now you don't need AI to do that. Right? We can do that with ML and with the proper metrics and with the proper triggering mechanism. So those are things we've been we've been working on and we'll continue to work on. And then we'll see where where AI can add to that if it does.

So But ultimately, the goal for an agent system, as you and I so it sounds like that's what's next for your team Yeah. Is to do that same thing, to use whatever predictive Yeah. Conclusion analytics, and then do some sort of thing, take some sort of action and perhaps do the activity of operations more programmatically and more intelligently. So really, all the same stuff that we've been talking about, but let's go even better. Yeah. So then for folks that are going down this road, you've been down this road, and you're now continuing to go down this road with the more recent advancements in agents and all the different things that go along with that.

What advice would you give a network operations team that's looking into doing this for their own organization, whether it's traditional MLOps or the latest AI agent system, you know, and I ask in the terms of or in terms of like, you know, staff, maybe level setting on expectations, like we're not talking about deploying a magic cauldron and, you know, crystal balls, you know, there's an expectation that we need to set Other budgeting as well, what kind of tools should they be looking into?

And one last thing that's not in our outline. You mentioned getting software engineers caught up to speed on networking, and then network engineers caught up to speed in programming and automation.

What, you know, for those folks, what would you recommend they look into?

Maybe even like book titles or courses or mindset Yeah, so I think so I'll start off with saying, you know, if I would have advice for somebody who's considering this in network world, the first thing I'd say is really look at sort of go back to your network design and your requirements, and then your services and your stakeholders, right?

Because, you know, I think when you reapproach the problem initially, we approached it as network engineers. So we're looking at, okay, where does the network fail from a component level all the way up without really tying it to how does that impact Uber as a service? So I think starting at the SLI level, really understanding what am I offering when I'm offering my network to services? What am I guaranteeing?

And then it really clarifies for you what failures are important and what are less important, right? So if I have a large scale fabric and one link fails, my service owners actually don't care about that at all because it doesn't impact their service at all because of our design and because of their resiliency in their service, etcetera. As a network engineer, I actually don't care about it that much if it's a single link, but I want it to be automatically ticketed somewhere. I want someone in the data center to go replace the link and eventually get it back into service, but that doesn't need to wake me up at three a.

M. Depending on where in the network this happens. So I think really starting from a service SOI level and understanding this is what impacts our services, this is what's important to really apply these techniques to. We'll bring a lot of clarity to it.

It will also help you get buy in and budget and all those things because you're saying, hey, you can point to actual outages and be like, if we had invested the time to automate this, we could have prevented this outage, which saved, which would have saved over two million dollars or whatever. It makes the makes that conversation much, much easier.

And to the second part of your question, you know, I think for software engineers who are interested in networking or vice versa, network engineers interested in software engineering, there are a lot of great blogs out there. I mean, obviously, there's conferences around this. Autocon is a great one.

There's a lot of great blog posts across Meta and Google, etcetera. So I would just start with maybe some of the hyperscalers and understand how they think about networking, and then take what works for you. And it depends on your environment. You know, not everyone's running a super large scale network like they are. But I think a lot of the principles are what what really helps clarify your thinking.

Yeah. And I love what you said about focusing in on actual, real, high impact initiatives. Yeah, it doesn't you know, you're not trying to solve everything with ML tomorrow, like we have these two types of incidents. Yeah.

And here are specific examples. Let's just tackle that. What's the specific data necessary for that? What's the specific pipeline?

What are the specific models? And then let's just build that. Yeah. Let's succeed. And then you can always add to your data sets and you can add, you know, assuming that you're building it in a modular fashion and proper software development.

Yeah, absolutely. So Vishnu, we're at time now. And I do want to say that I really enjoyed having you on the podcast because I love this topic and where we're going with networking, not just with AI agents and all that, but even with traditional ML ops, something that I've been interested in for years. It is so cool to hear what what you and your team have been doing for years at Uber and then seeing actual real, success, like real value from all of that.

So thank you for joining today.

Yeah. Absolutely. Thank you very much, Philip. And, you know, I know you you've been interested in this for for a long time, and the industry is sort of coming around to what you've already been talking about and preaching for some time, which is that, you know, as network engineers, we need do need to think about these things, you know, irrespective of AI or not. It is an important toolset that's there when it comes to ML and data that we need to apply to the network and can really unleash some awesome things. So thank you very much. Really appreciate it.

Of course, my pleasure. And to our audience, if you have a question, comment, concern about today's episode, please reach out. I always love to hear those. telemetrynow@kentik.com. And so for now, thanks so much for listening. Bye bye.

About Telemetry Now

Tired of network issues and finger-pointing? Do you know deep down that, yes, it probably is DNS? Well, you're in the right place. Telemetry Now is the podcast that cuts through the noise. Join host Phil Gervasi and his expert guests as they demystify network intelligence, observability, and AIOps. We dive into emerging technologies, analyze the latest trends in IT operations, and talk shop about the engineering careers that make it all happen. Get ready to level up your understanding and let the packets wash over you.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.