More episodes
Telemetry Now  |  Season 2 - Episode 30  |  February 6, 2025

Tales from the Hot Aisle

Play now

 
Have you ever had that heart-stopping moment when a single command brings down the network? In this final episode in our customer series, Phillip Gervasi is joined by Avi Freedman, Joe DePalo, and Jezzibell Gilmore to share some epic network outage stories—from fire suppression disasters to automation gone wrong and catastrophic BGP mishaps. Packed with hard-learned lessons and a few “hold my beer” moments, this episode is a must-listen for any network engineer.

Transcript

If you've ever run a network, you've probably been there too. You're at the terminal, you enter in a command, and then ping stop unexpectedly. And it feels like your heart stops too. Right?

I know I've been there only to have my inbox and cell phone blow up with calls three seconds later asking, what in the world is going on with the network? Well, in today's Telemetry Now, we're rounding out our customer series with Avi Freedman, Joe DePalo, and Jezzibell Gilmore with some war stories, some stories about how we brought down a network accidentally. At least I hope it was an accident. Or Or maybe we were part of some epic network outage that sticks with us forever.

This is gonna be a fun one today. I'm Philip Gervasi, and this is Telemetry Now.

Avi, Joe, Jezzibell, thank you so much for joining the podcast again today. It really is great to have you back. So in thinking about today's show, you know, it kind of occurred to me that there is such a wealth of knowledge and information and experience in this group. Yeah.

Probably turning a physical wrench many years ago, but in, you know, in more recent years, turning that metaphorical virtual wrench of networking, fixing, designing, building networks, and network adjacent technologies and all of that. There has to be in your file cabinet of experiences and stories, some epic war stories of outages that you that you live through, maybe that you cause. Those are always great. Right?

Not they're not great that, like, you experience them, but they're they're fun to talk about in hindsight and and the lessons that we learned or maybe some epic failure, epic outage, epic catastrophe that you were part of that that you'd like to share. And in any case, I I have to assume that there's at least one that you're thinking of right now. And if you're in the audience, probably you as well. I know I can think of one.

And so that's what I wanted to do today is kinda just go around the group and, and share what we're comfortable with and, and hopefully have a good time doing it. So why don't we get started?

That's the the the you're you're drumming up old, old old feelings, but, I'll go first. So I've been, doing data center and network operations for a long time, so I could fill several podcasts of all the various incidents. But there when you bring that up, there is one, very specific one. So I was, like most good outages, I was asleep in the middle of the night, and my operations leader had called me to tell me that the, the cooling system, the the the fire suppression system in the data center has gone off.

And so we had our core data center, and there was a machine that overheated that caught fire. And, typically, those fire suppression systems have two triggers. In this case, the second trigger didn't work, and so the sprinklers went off in the entire data center. And so, I actually, keep the the the sprinkler on my desk.

As a reminder. I'm showing it on the screen now. I keep it with me.

But the the interesting part of this story was that the the every data center has a big red switch, that turns the power off. And so the fire company wouldn't go into the data center because, the electricity was still on and the water was pouring out of the data center. So one of my guys, without my, knowing, used cardboard to slide over to press the button, and nothing happened. Turned out that the kill switch wasn't wired to anything, and so the electricity was on.

And anybody knows water and electricity doesn't go well together. So the power company had to cut power to the block to take the power out of the data center. And so, needless to say, we had, thousands of machines that were soaked with water. So we went to, Best Buy and Costco.

We bought dozens of blow dryers, and we took machines apart. And, it it was a many, many million dollar incident. It was very hard to explain to customers and to board, but it was, it was the one thing if, you know, every little outage, every packet loss or fiber cut or whatever, it was always put in the perspective of, I didn't sleep for, almost fifty hours straight as we went through this, this scenario. So it was it was a hell hell of an incident.

So that was my story. It was, it's it's one that sticks with me and puts in perspective all the other ins incidences that I have in my life. So, you know, the the day the data center, filled with water.

And we were too cheap to use, the what's, Avi, what's the suppression that's that's Oh, no.

Liquid they use? I have a mask.

I just had it in my The data centers, it's escaping.

Yeah. It's gone.

Halon or something like that?

Halon. Halon. We were too cheap to use Halon, so we used water thinking, well, you know, if it's a fire. But, needless to say, it was, it was a hell of a few days. So that's that's mine. What about you, Avi? You got one?

, I have a number of them.

I'm the think the one which may most closely have some lessons for I'll skip over the lessons of the fock you date, that was not, from, fiber or things that you think are redundant that aren't and go more to the logical. So, the danger of confusing UI and its potential implications for leveraging AI. So, the there are many camps of there there are multiple camps of preference for how to configure the devices that to do our networking.

This is back in the days. I think Juniper had just come out, so there was gate deep, but it was really just Cisco IOS, and which wasn't even really an operating system, more of a program loader. So we were doing some things at this company that I was that Jezzibell and I worked at, that were probably considered a little risky. I always say don't use BGP weights. Another thing is be very careful redistributing between verticals.

So we were redistributing, from the border gateway protocol into our in internal protocol OSPF, which I think I know where this is. Stand for have to put some protection factors as well as Right.

Nor does BGP stand for being good Philadelphians? But the border gateway protocol into the open shortest path first protocol.

And the way you tell a Cisco device to remove something is you say "no".

So we had a router OSPF twenty two redistribute BGP four nine six nine route map foo.

And so someone was like, well, let me turn off the redistribution.

They meant to stop this whole thing, so they said no redistribute BGP four nine six nine. But the first time you do that, all it does is take the route map off.

So you then wind up with this firehose taking all the routes on the Internet and putting them into your internal protocol, which is not a good idea because these internal protocols, most for the most part, need to run these very complex calculations, which have gotten better but still are expensive.

So we took the network down so hard that, like, the Facebook incident, they basically you basically needed to, in a synchronized fashion, reboot all of them at the same time.

So, this was painful even, you know, because we had a a international network at the time and, out of band, maybe, I will say at this point, was not, like, completely, you know, set for everything. It was not like we had four g out of band, back in ninety nine. So, the lesson learned comes back to some of the things about, you know, trust but verify. There's a lot of nuance to this.

Maybe LLMs can be good for explaining some things, but if you rely, though, on if you don't have humans involved in the complete order of operations for automation or really good understanding of how the quirks of these things, you can run into big problems. So that was, probably not the longest outage, but one of the more painful ones that I've been involved in.

And that's, yeah, it's a great lesson learned.

How many times have you run across the especially back in the day where someone would announce the Internet over their transit link and We call that being made clueless.

Yes. You know. Yep. May East, made clueless. Very clueless. I love it.

Jezzibell, how about you? Do you have one one that you'd like to share?

Well, maybe not as spectacular as Joe and Avi's, but it impacted me personally tremendously.

It just we just launched the operation the network operational at Packet Fabric, and, we were doing an upgrade and we hit a bug on our platform. And we designed the network to be fully automated. So, you know, the code, the the the update of the network was automated, and it was distributed across all the devices.

So when we hit the bug, it automatically shut down all the devices and then put them into reboot by themselves.

And so we couldn't even actually get in cycle to, go in and reboot all each and every one of the routers.

We had to literally send people on-site to stop the cycle.

And luckily, we had just started the operations of the network, so we only had like, a handful of customers on there, so we did not, we did not have as huge an impact as we would have later. But the lesson was learned, as Avi said, you know, trust but verify your automation.

And but it it hit me hard because I was the customer facing executive. So I was personally on an apology tour to all of the customers.

We've all been there. Flowers and candy, we've all been there.

I did a lot of that, especially in the nineties when yeah.

IOS is called an operating system, but it was really a program motor.

So it was not a lot of protection from bugs taking things down.

So But the good thing is that we learned early, and it's you know, with all that, it never happened at least when I was at Packet Fabric, again.

So Yeah.

Never never waste a crisis. Right?

Network automation breaking the network at scale really, really fast.

Oh, absolutely. Totally.

It is good to get a little experience with that early.

We went a comparatively long time, especially compared to networks at the time without a major outage at Akamai. And when the first one happened, you know, what do we tell people? How do we communicate?

You know, it's, it can be traumatic, actually. More traumatic if you have had none of them until you get at scale.

So the story that I wanna share, it isn't, like, super epic. I didn't take down, like, a global network or have a big routing redistribution problem or something like that.

But it it came to my mind, and it's it has stuck with me for twenty years, because of how impactful it was on me and, like, my philosophy of how I approach a approach a network engineering task or project. Most of my career has been with VARs. So I've, you know, going project to project, customer to customer, whether I was working in data center routing or campus wide area network, you know, installing a fleet of firewalls for an organization, whatever it happened to be. I, I live in the northeast in, in the New York area, and, I had a a larger law firm in our area as a customer, and I was replacing their core switching.

And, I go into, one of their buildings. It was actually an MDF switch, so it was the core for that building. It wasn't for, like, in their data center. And, I I have everything staged, so I can, just move the cables over.

Once I have the configurations on the new boxes, I can move cables over for, and then move over my gateways and all that kind of stuff. And so, what I noticed when I was working in there one day was that the racks were not bolted to the floor. And I'm like, that's not right. It's not a big deal.

Whatever. And, you know, it's not it's not, the the racks weren't, like, top heavy and toppling over. I didn't really see any risk there. But I did notice that the rack the racks were not absolutely perpendicular and parallel to the floor tiles.

You know how you could see, like, the lines of the floor tiles, the grout lines, right, how they meet? And I am so OCD about everything being neat and clean with my work in the data center and with with with cable management and with with all this stuff, right, and clean configs even too, that that really bothered me. So, after I do, like, the important work and get everything cut over and everything is testing and working fine, and and all the cables, you know, all the wires are now, in the in the new switches, and the old switches are just sitting there. I and I left them on in case I had to roll back.

Right? So they're just sitting there.

I start I literally grab that the rack with the new switches. Right? I literally grab that rack and start to, like, jerk it left and right so I can line it up with the tiles. This is the stupidest thing. Well, as I'm doing that, all of a sudden I hear pop, pop, pop, pop. And what I did was I I, you know, I one of the times that I jerked the the the rack, it kinda, like, jerked along the ground too much, and a bunch of the the Ethernet cables got, you know, ripped out, popped out of the interfaces.

I did not take down, like, the data center and the entire, law firm's network overall. That building had a lot of issues. Sure. But, you know, I didn't take any the whole thing down.

But that stuck with me even to this day. That was, like, fifteen years ago, I think. I don't know. Because it really taught me the lesson to be careful and to be very deliberate deliberate and not just careful and cautious, but deliver a bit deliberate and thoughtful with what I'm doing as a network engineer.

Yeah. With, like, cable management and, you know, going into something like a like a bull in a china shop, not not a good plan. But going into something deliberately, thoughtful, careful, taking my time.

And, and that and that's true with, yeah, like, racking stuff and and and moving things around inside of a an IDF or an MDF, but also with my configurations, with my planning. So working with the project managers, with the customer, and going really, stepping through things cautiously and methodically.

Because prior to that, you know, as a network analyst, junior network engineer, whatever my title was, I really was a lot more like jump in. I know the commands. Let's go. Just got my CCNA or CCNP or whatever. So, you know, I wanted to I wanted to just start flying on the keyboard.

But that that lesson or rather that, that outage that I caused and that incident, that I caused, really taught me a lot about being a, a better network engineer.

Change management's a good thing.

I almost I almost did that in Vinny's data center in, Virginia area.

I leaned against a rack not realizing that four racks in a row were bolted to each other and to the towels, but not into the floor. And the whole thing we we we did it. We saved it, but it was almost, a bad situation for for MA.

Yeah. There's just too too many examples of times where I did, like, a shut down the wrong interface, and took down, like, a building or something or, jeez. I remember that. That was a that was a high school in my area.

I was doing a project for a school district and then took their high school offline. Their data center was in the school, like, the the business office, which was a different building on on the campus. Totally took their high school offline. High school of, like, three three thousand students, a really big high school, like, eight hundred kids per grade.

That wasn't good.

I've I've done, like, incorrect prefix list, broken routing. I've done things like, forget the reload command. You make a change. Oh, no.

No. No. You forget that you had the reload command. You make the change, everything's working, and you walk away.

And then the reload command, you forget to cancel reload, and it takes effects and it reboots the router. Oh my goodness.

You know, not not that many. Okay? I did get better over the years as an engineer, so it's not like I have only bad stories.

But certainly, those stick in your memory a lot more than than all the good ones. Right? All the successful cutovers that, you know, you forget about because they were successful.

Phil, you should do a hold my beer episode for everybody. So to talk about their outages. Right?

Because Oh, that would be an epic episode, Jezzibell.

You're right. Totally totally agree with you there.

We used to have we used to have a batlight that Gil got at a garage sale that was in the office at NetAccess, and we rigged it. They rigged it. They would tell the TAC logs and rig it to light up when I was configuring the network.

So as soon as I did, it would light up so they knew that there was danger if someone called me.

God. I love that. The network.

There are too many stars in my trace route. And I'll be like, Avi, what are you doing?

Avi's on. Avi's touching it. We had a we had a cheese hat, like a Green Bay Packer cheese hat. And if you caused a user cause outage, you had to wear it at your desk.

And so it rotated on so, basically, you didn't wanna earn the cheese hat if you were the one that actually caused that problem. And so I love it. Yeah. But they don't do that stuff anymore, I don't think.

Now they get fired.

Yeah. Probably. I mean, I I do remember years and years ago, the very first VAR I worked with, which was a tiny little MSP.

And so, like, after I got off the help desk and I finally got, like, a junior analyst network analyst position, what whatever you wanna call it. There was, a problem with the team not adding configuration changes, even things like we added a new device to one of our customer networks. And so, you know, we had, our our CMB was, ConnectWise. Right?

Our our configuration management base. And, so if you were working on a customer and you needed to know the password for a router, you'd go look it up there. If it wasn't there, well, whoever's job it was to put it in there had to bring in donuts. In the northeast, it's Dunkin' Donuts.

Had to bring in Dunkin' Donuts for the entire team the next day. So I was I was guilty of doing that a couple times, but it was a it was a silly fun, and, you know, for a for a a larger and growing team, sometimes even an expensive, price to pay if you forgot to to to document your changes and and to add that stuff into our system. So, yeah, those things are those things are important too. Not it wasn't an outage when I did that, but certainly, an important thing that I, I messed up on, that I learned.

I I almost ended up on Oprah Winfrey.

Oprah Winfrey? Alright. Let's hear this. Let's hear this.

So we yeah.

We were, we were doing this was early one of the first live events after Victoria's Secret.

Oprah Winfrey book club with, Eckhart Tolle where they were gonna do, like, an a q and a after her show, and it absolutely melted down. It was, at the time, it was the one of the largest events we couldn't handle, the intermediate caching.

We couldn't the net the the global networks like Comcast were were and then so it ended up being a fail. Like, it just didn't go. So we get on the phone with her and her management team, and she wants me to come on the show and explain to her audience what happened, you know, the Internet. Just, like, funny, the same conversations Netflix is having right now, which is just how come you couldn't handle it.

And so I fly to Chicago, to go on the show. Like, that's how close I got. And I immediately said no, but the CEO was like, you're going on there because you're gonna wear, you know, a limelight shirt and talk about limelight. And then the at the last minute, the producers cut it.

They didn't think it would be right for her audience, and Oprah Winfrey breaks the Internet. And so it it it it stopped, but it was one of those things where, like, you know, people don't understand how the Internet works. And, you know, you can't you can't put five hundred thousand people in in an afternoon and and and expect it to work.

So yeah. So sometime, I'll tell the story about when the phone company called my ISP asking when the Internet would be back on.

After they converted all our BATS T1 channels to AMI.

So You know, I I really do think that we could do, a a hold my beer episode like Jezzibell suggested.

Right? Just comparing and contrasting war stories. We could probably do a series of them. But on that note, it it is time to wrap this one up. And, and I and and this does actually wrap up our customer series with, Joe DePalo, Avi Freedman, and Jezzibell Gilmore. So thank you all so much for joining me for the series of podcasts ending today with what I think was a pretty fun discussion about some of our own experiences.

So for our audience's sake, if you have an idea for an episode or if you'd like to be a guest on Telemetry Now, I would love to hear from you. Please reach out to us at telemetrynow@kentik.com. So for now, thanks so much for listening. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.