Kentik - Network Observability
More episodes
Telemetry Now  |  Season 2 - Episode 6  |  June 20, 2024

Breaking the 50% Barrier: an RPKI ROV Discussion with Job Snijders

Play now

 
Host Philip Gervasi talks with Doug Madory and Job Snijders about the importance of RPKI in securing Internet routing. They explore the recent milestone of RPKI covering 50% of IPv4 routes, the process of route origin validation (ROV), and the role of ROAs. They also discuss the impact of ROA expirations and future advances in Internet routing security. Tune in to learn how RPKI contributes to a more stable and secure Internet.

Transcript

The resource public key infrastructure or RPKI is an important framework for helping to secure routing on the public Internet on a global scale.

It's part of an overall strategy to secure Internet routing and limit the disruptions that can be caused by route leaks and other BGP misconfigurations.

And RPKI has hit a milestone recently, covering fifty percent of routes in the IPv4 global routing table. And this is good news. Says that we're heading in the right direction.

So in this telemetry now, I'm joined by Doug Madory and Job Snijders, both prolific subject matter experts in this field, to discuss the importance of RPKI and this recent milestone, as well as some important aspects of RPKI that we might not have known before, such as how ROA expirations work. My name is Philip Gervasi, and this is Telemetry Now.

Doug and Job, thanks very much for, coming on the podcast again.

Doug, I've seen you often. You're on pretty pretty regularly. But, Job, it's been quite a while, so thank you for returning. Now Now before we get started, and we are gonna get pretty deep into RPKI and some recent milestones and some, some heavy information as well, Would you please give our audience a quick background of what you do and your background in this particular area, Job?

Sure. Phil, thanks for having me. Doug, super good to see you again.

My name is Job Snijders. I work for Fastly as a a principal software engineer, and my specific area of expertise is RPKI, BGP, X.509, and I I also contribute as an OpenBSD developer.

So anything routing security, is is what interests me and and what I spent my time working on.

Right. Right. And, yeah, very important content considering the nature of the Internet today. So certainly top of mind for many people both in the networking space and in the security space.

Now, for the audience's sake, today's podcast, didn't just spawn from the ether, but there were two, blog posts that that Job and Doug, co wrote or wrote together that appear on the Kentik website. RPKI ROV deployment reaches major milestones, so make sure to check that out. And then another one called time's up, How RPKI ROAs Perpetually Are About to Expire. Both of those are good reads and very informative and really, the foundation of where this podcast came from today.

So as we start to get into it, Job, I'm gonna start with you again if you don't mind. Can you give us a synopsis of what RPKI is? Maybe a little bit about why we have it in the first place and where we are with its adoption today.

Sure. Sure.

So to help the listener a little bit, I wanna introduce three terms.

Am I pronouncing the the the number three correctly?

I'm always confused, like, free beer and free beer.

Yep.

So term one, RPKI.

Term two will be ROV, and term three is ROAS.

And all three acronyms mean something slightly different.

RPKI is the cryptographic foundation on which routing security applications are built in the modern Internet.

Route origin validation, ROV, is the act of using this cryptographically signed information in order to make route decisions.

So, your your routers receive lots of information from from their E BGP peers, and they have to choose which paths to use and where to send the packets.

And then the third term, ROAS, or route origin authorizations, are cryptographically signed objects that reside inside the RPKI and inform the route origin validation process.

So a ROA is a tiny file of, you know, a few kilobytes that has cryptographic signatures, so called x five zero nine certificates, some other metadata.

This tiny file resides in a framework or or distributed database, the RPKI, resource public key infrastructure.

And those two together is what helps BGP routers execute the route origin validation algorithm, as described in RFC six eight one one. A very readable RFC, so, it's it's just a few pages long if you ignore the boilerplate.

And I I would recommend readers to to take a look at that so they they understand what the BGP routers are doing.

But throughout this podcast, I'll I'll probably continue clarifying on the difference between these three, acronyms.

Okay.

And then as to what exactly RPKI is, a quick recap, The Internet is composed of almost a hundred thousand, networks that connect with each other. And these networks, all of them have deployed so called BGP routers. BGP is the border gateway protocol.

And this is a super low level system where where, routers pass messages to to one another about what destination can be reached where.

And BHP is is a phenomenal technology. It's it's served us well for for multiple decades.

It has excellent scaling properties, but it does not have embedded cryptography.

And this means that the messages that routers pass to each other could be forged or could contain, misconfigurations.

Now to overcome those challenges and to increase the reliability of the BGP global routing system, a second system was devised, and this is the RPKI.

And if you sort of glue together the BGP information that that is unauthenticated that you receive from your your, your peers with cryptographically validated information that you received from the RPKI, you can arrive at a state a fully automated state where you can make routing decisions, that that are backed by cryptographically signed information.

And this, in a nutshell, means that the world's network operators can make routing decisions that are aligned with the intentions of the IP prefix holders.

So that that's it in a nutshell. There's two systems that drive the Internet. The Internet's routing, I should say. I mean, The the technology stack is is huge.

I might add something to that too. Like, in this system, one huge design constraint and challenge is to get the Internet to do something. Like, that's a really hard thing. You have to be it has to be backward compatible.

You can't break stuff. It can't be a system that's so brittle that it causes outages. People will abandon it. It has to be there's also a lot of, a lot of that gets baked into how this operates and how it was deployed and continues to be deployed, but, it is a nontrivial task to get, the Internet, which is a big place to to do a thing, and we're and it has to operate has to provide benefit with partial adoption, because, like, there will be nothing that we, deploy that, achieves, you know, a hundred percent adoption.

So it has to provide, work in in partial. So anyways, there's, that, for all the for all the solutions people have invented over the years, that whittles it down of of, like, just what what's what's a reasonable, or a likely thing that's gonna, actually make a difference, which is what we're trying to do.

Of course. Yeah. And and so then, RPKI the RPKI itself is more of a framework.

And, as part of that framework, we have objects called ROAs, route, object or authorizations.

So that is the Route origin authorization.

Yes. What did I say? Route?

Object.

Object. Yeah. Well, it is an object. That's where I got it from. So my apologies. So route, origin, authorization, which is the, actual file that's used as part of the overall process of validation, ROV, route origin validation.

Is that correct?

Yes. And so, yeah, I call that out because in the literature when I say the literature, I don't mean the formal literature, but in some blog posts and ROV used interchangeably, incorrectly. They are not interchangeable.

They are not interchangeable.

After this podcast, that'll that'll all go away.

That will not go away?

It'll it'll all go away. We've cleared it up. We've cleared it up for everybody.

Yeah.

Officially. We've closed the book on that one.

So is r p the I RPKI really a, a a full proof silver bullet for securing all routing on the global Internet? Is it really the answer, or are we really considering this as part of an overall system of of security, in networking?

It it is part of the toolbox. It is a super powerful tool, but it is not the only thing that we we do. So, to to to emphasize that that there also is none not RPKI based route leak mitigations. There is, for instance, an RFC that describes the BGP open, mechanism.

And this is a mechanism where two routers, are preconfigured with with an expectation of what their own role is in relationship to the other router.

And, to to make an example, if Doug and I want to set up a peering session, we'd each configure that we consider ourselves to be a peer of of the other.

And if Doug on his side has configured that he expects to be a customer in relationship to me, but I've configured Doug to be a peer, then the session would not establish because there's a mismatch of expectations.

If there is no mismatch match of expectations, this mechanism adds a, a special BGP path attribute, the only two customer attributes, And this helps routers automatically discern whether to propagate a given route to to upstream upstream type, BGP sessions or or just, downstream type sessions.

And this has nothing to do with cryptography and RPKI. But I think in years to come, it will be a super effective and cheap mechanism to to address a certain kind of routing security incidents.

Right. But, ultimately, it's, a a method to validate, learned prefixes, not necessarily to encrypt any kind of traffic or any kind of sessions or anything like that. Right? It's a it's a valid Right. Right. Right.

And and another, mechanism that is very cheap but has nothing to do with cryptography or or RPK is, GTSM.

This is a mechanism where you you set the TTL of the packets, and you make it so that package will not traverse multiple, routers.

And this helps guard against certain types of attacks like TCP SYN attacks towards BGP pairs. So I think the RPKI is incredibly important, but it is not the only thing, we have to do as operators. Right. There's a multitude of of technologies and best practices that that we need to take into consideration when, deploying Internet networks.

Absolutely. Which would actually be a a great topic to delve into on another podcast. But today, focusing on the RPKI specifically and then some major milestones, which we'll get into shortly.

I'd like to ask, though, what what is the information in the object, in the ROA, on the in the ROA that we're actually using to val what are we actually validating?

So a row map contains, in a nutshell, a prefix, an origin ASN, and a signature that you can verify all the way up to the RER.

And as you receive a BGP message, the BGP message will also contain a prefix and an origin ASM, the rightmost, fields in the AS path, attributes.

And this allows the router to to do a simple comparison.

Is the origin in the BGP message the same as the authorized origins in in one or more ROAS.

Okay. Right.

Is the prefix length that I received in the BGP message in compliance with what was permitted through the ROA?

And and in just a few, CPU instructions, the router can figure out, like, does this BGP message conform to to the the boundaries or or limits, of of, that the ROA, imposed? And if yes, the route is valid. If no, the route is invalid and can be ignored.

Okay. So in other words, this route that I'm learning, is the source, the sender, from which I'm learning it permitted to advertise this, and, you know, via ASN or prefix length and things like that that you mentioned? So that's what we're looking at as far as the validation. Okay.

So, basically, this is a a, practice in trust. Right? We're talking about a third party, I have to assume, because the RPKI that you mentioned is, a process of the router, some local router, a physical device that's been programmed to reach out to some third party database. Is that is that right?

Who are those databases? Who's running those, and what are those?

The RPKI is, a set of distributed databases.

And, the the the roots of of the system are the far five regional Internet registries. So this is LACNIC, APNIC, AFRINIC, RIPE, and, ARIN. Mhmm.

And one might wonder, like, five. Why why isn't there one? Five is too many. But the alternative would be that the hundred thousand networks that together form the Internet, each would need to somehow establish, trust relationships to to the other ninety nine point nine nine nine, thousand networks.

So instead of everybody having to figure out whether they can trust the their their peers or potential peers, the the system is designed such that the hundred thousand networks choose to trust the five RERs for this specific purpose.

Mhmm.

And if everybody trusts the same roots, everybody can, automatically and securely calculate and derive trust from those five, endpoints. Right? Okay.

So if you and me, choose to trust Doug, then, we don't need to form an explicit relationship between the two of us because we we can, run run the full things through Doug.

And this is you can imagine that as you scale up the Internet, this this is a wonderful scaling property. And this is why I think it's been quite successful so far.

That system also is, intended to replace some of the existing IRR based filtering. So in that case, you've got lots of different registries that have, information you people, networks will use to automatically ingest, build, wait lists or but but the, the problem with it is that, there's different levels of security. There's different levels of types of information, contained. What's nice about the RPI system is that you've got one globally reference able. We all know what ground truth is, for the IRR based stuff. It can be it can get a little, it can be some implementation differences between one network and another, and you don't know what they're gonna do. And that just creates, uncertainty, and I'm probably you just don't have that in this in our care ROV where, we can all know the same we're all working off the same sheet of music.

Right. Right. So then what specific threats, attack vectors are we preventing with the RPKI? Is, is it specifically just, learning invalid routes? Like, what does that mean, and what, what security incidents are we preventing?

So for, for ROV, so RPA ROV, is what we're talking about today. This is really meant for to try to, try to limit the propagation of routes that come from, like, routing leaks, like a an origination leaks when AS accidentally originates, you know, whether it's thousands or a single route, inadvertently or, yeah, that kind of fat finger mistakes. Occasionally, it has some, benefit in a security, standpoint, but I think, Job and I and people who are involved in this are should be very careful, in our wording to not overstate what this, will do for routing. We have a there's a lot of work left to be done to improve routing security, but, but I there are these cases where there's, like, a a a hijack that's kind of intentional, but there's also a leak.

So in the case of following the Russian invasion of Ukraine, there was a crackdown on media in Russia. One of the providers there, wanted to hijack Twitter space because they got an order to block Twitter. They tried to do this via BGP. They accidentally announced this out to the Internet.

Well, by that at that point, Twitter, now x, had rolled out our PI, and created ROAS for all of our routes that enabled, in an automated fashion, all these international carriers to just automatically just drop the the hijacked routes coming from Russia.

It limited the, the damage done, at least outside of Russia in that case, for Twitter. So there that's a that's a case. We we can't count on, that if someone if there's a determined adversary, it could be defeated. You could force an AS path.

It does increase the cost. The AS path gets longer and and less, attractive. But, yeah, we don't wanna oversell this as a, silver bullet, but what it does do is it improves, our survivability and limits the disruption and, and routing mistakes, which is just as long as we have humans typing into keyboards, programming routers, and even once the AI do it, they're probably gonna be, making mistakes too. So, we wanna be able to survive, inevitable mistakes. And I would say that, you know, in the last, as a person who's been writing for fifteen years about routing leaks like they do in the autopsies of these incidents, I mean, we haven't had one, that was really, debilitating in, you know, maybe five years.

That's a really long time in the, span of the Internet. And it's not just RVK RV that's helping. There's other there's other things that, Job's mentioning.

We've got Pure Lock. We've got other other, mechanisms, that a lot of people have worked on. Job's kinda like the face of this, for the last few years.

But there's we'd say there's hundreds of engineers at various companies around the world that have put in work to improve routing hygiene, and I would argue that it's, demonstrably better.

And now we can kinda start focusing in on, the harder problem of the determined adversary, someone who's gonna try to defeat these mechanisms that we're inventing. It gets that's that's harder than fighting off the fat fingers, but we had to start somewhere and build from there. So that's the strategy.

Ultimately leading to a more stable Internet.

So let's, let's shift to discuss what happened last spring, actually, last month, as far as a major milestone for RBKI adoption. So, Doug, maybe you can take this one.

Yeah. Sure. So, Job and I collaborated, on a couple of a blog post. We were looking at, going back to twenty twenty two, just trying to measure, how far are we with with RFK ROV adoption? We know we've there's been increases. We just wanna try to put some numbers to it.

And the milestone that we reached on May first this year was at fifty more than fifty percent now of the v four routes in the global routing table, as per NIST, which is the US federal agency that, they've got an office that studies Internet secure routing. They have a website that publishes these statistics, so I was just using that as a as a, benchmark. I think a lot of people do.

That crossed the fifty percent threshold on May first.

For IV v six, it was last fall. So that that that one's already crossed for people who are wanna say, hey. What about IV six? Well, IV six is already there. It's already there. So, but, yeah. So so that means the majority of routes are, have ROAS and are would be evaluated evaluated as valid, by a validator.

And so two years ago, at a Nanog, Job and I presented, some work where this is, like, the February twenty twenty two Nanog. At that point, what we were doing was looking at, we were only at one third, of the global routing table with, of routes with ROAS.

At that point, with Kentik, the company we're at, Phil and I, the, there is, and we have a lot of NetFlow data from the all the customers that we provide service to, and we wanna and I decided we'd look like, well, how what's this look like in NetFlow if we had, like, a slice we kinda have a slice of the the world's Internet traffic just to for, you know, analysis and experimentation.

And, and at that point, the conclusion was we're actually over fifty percent.

We already have a majority in bits per second, second, if we accept this methodology of just looking at all the traffic that crosses all of our customer base, which, whether that's whether it was exactly fifty percent or not, like, there's some margin of error, but the point is we were a lot farther along due to a lot of content providers doing RPI deployment. So we had, like, CloudFlare, Amazon, Google, like, Akamai. We've got companies that have done a lot of RPI, and then we have, major access networks, eyeball networks. So in the United States, we have, like, Spectrum and Comcast to do done RPI deployment. So that, those may not cost you the majority of routes, but they they handle a lot of traffic. Eyeball networks, content providers. Those are that's what moves most bits.

So, that was the because of their deployments, that's what pushed the numbers up. So then, bring us back to May first here. We went from one third, in February twenty twenty two to over fifty percent in May twenty twenty four, for routes. If you just count them in the routing table. And I went back and looked at our traffic stats, and now that's at seventy percent in bits per second, of, of our the traffic we see is going to routes with, ROAS valid it'll be valid.

And now that that number is gonna probably plateau at some point because it's there's gonna be just harder and harder gains to be made. But, I would argue the these numbers would give you give someone motivation if they're on the fence of trying to say, like, what will the system do for me? Well, like, we would do a lot actually because, we have a lot of adoption now. Most of traffic you're handling is going to routes, with, ROAs.

That means if, if you're rejecting invalids, you're protecting the traffic that you're egressing, you're not gonna get tricked by, a, a fat finger. You were gonna and that's gonna protect a lot of your traffic. So I, so that's the, yeah, that was the the main milestone, but then we took I took the, opportunity to go look at, alright. So we the flip side is is, rejection of invalids.

So if you recall back to the beginning conversation when Job was defining terms and how this worked For RPKI ROV, route origin validation to do its job, two things have to happen. The resource owner has to create a ROA. So your your you have to create a ROA for your address range, say who is the right origin.

That alone will accomplish nothing. If nobody rejects invalids, that you've gotten nowhere.

The flip side is that we have to have somebody rejecting invalids.

These days, we have almost all of the tier one tier transit free zone, providers. Were almost, universal of those networks rejecting invalids.

And, anyway, so we we had some stats looking at, you know, what what were the effects of being invalid on propagation.

So I went back and looked at this through time now. We have a we have, like, a a some history we can, go back and look and, and measure the propagation of routes, of how far do values get propagated. And we can see that through time, it goes down. And so I one challenge here is, if we just if you just pull up a bunch of, the global routing table at any given point, there's a bunch of misconfigured, ROAs.

Like, there's a bunch of persistently invalid routes. That's just always the case. I use those for, measurement purposes, but, you know, for a for method methodological, you know, use case, it's it's a you can't guarantee that they're always there or they're always the same, things from one day to the other. So, to to check the analysis to show that that that that analysis showed that, the propagation declined, but, but then I went to validate that with so Ripe, announces some route beacons for a variety of use cases.

There's a couple that they announced that are both invalid and valid routes from different parts of the world that you can use to just measure what happens to an invalid, based on how they announce these routes. Job also operates a network, that does the same thing. So we have, we can check up on, on Ripe's, Ripe's work with Job's stuff.

And they kinda came to a similar conclusion that there's been a decline in the last couple years of how far an invalid propagates. And this is, I think evidence of, increased rejection of invalids. And that, again, is the system working. We wanna we wanna decrease the propagation of invalids. That'll decrease the, the, the disruption, that's created by, a fat finger leak.

So we've we've made a lot of progress there, and we also have, Zio now as another, major telecom that just in the last few weeks has started rejecting invalids. I I put that in the blog post. We, kinda picked up on that. We put some visualizations out of our BGB, visualization tool to look at, just to provide some evidence.

I've been in contact with them too. I I kinda, you know, checked, let them know that we we saw this and, we had a conversation about it. And then I would last point on this, I would just say that, you know, in January, at the beginning of this year, there was a there was a outage in Spain. So Aranda Espana suffered a large outage.

And, there was the issue was the credentials of their RIPE NCC credentials were found in a, a data dump of, some hacker used it to get into their, right then to see account and just try to cause some havoc. What they tried to do in the end was, create ROAs that were intentionally wrong, that would cause an outage. No no traffic was compromised or misdirected. It was just just kinda Internet vandalism.

But it it it was effective, but it was only effective because there are a lot of networks that are rejecting invalids. They intent they intentionally made their their routes invalid. If if there weren't a sufficient number of, networks out there rejecting invalids, that outage would never have, never have happened. No one would know about it. So I guess I look at that as a you know, I take I take something positive out of that outage. It's just like a an an another piece of confirmation or corroboration of our, observation that, there's increased rejection of invalids.

It's too bad for the the folks of Spain. That's gonna be, you know, cold comfort for them, but, they survived the outage. But the you know, we learned another example of what happens, when a route, is class evaluated as invalid. It gets suppressed, and and, that's what we want, to have happen.

Right.

Right. Job, I have a couple of follow-up questions to to what Doug just explained to us. The first is, is it then incumbent on an organization that's running BGP then to go and create the ROA? They just log into the APNIC website or into the right website and then create their own ROA. Is that how that works?

Yeah. That is, so the the the owner of the IP address, will log in to portal to to click together the the robots or they use an API. So in the case of Fastly, we have an IPAM, an IP address, management system that is our centralized source of truth. And whenever engineers provision something new, it it launches, a set of scripts that interact with the APIs of of the RERs to produce ROAS that match exactly with what we intend to announce in BGP.

And I I wanted to circle back a little bit about rejecting invalids.

Mhmm.

An apt analogy might be, there is a difference between recognizing spam and marking email spam and automatically deleting email that was marked as spam. And you you so I think the creation of ROAS is what facilitates the Internet routing system to to recognize routes that that are an equivalent of spam of of something that that should not, you know, hit the final, email intake. Right.

And and the act of rejecting BGP routes is is, like, sending stuff that is trash and marked as trash automatically into the bin so that it does not disrupt normal business workflows.

And and yeah.

Yeah. Small analogy I I wanted to put in there.

Oh, no. That's a good analogy. It makes a lot of sense because it also distinguishes again between the difference, the difference between ROAs and ROVs, ROAs and ROVs. But it does make me wonder about, the second blog post that we haven't really delved into yet that you wrote together about the expiration of ROAS.

For example, why is the expiration so short on ROAS? That's something that you explored pretty extensively in that blog post.

Yeah. Yeah. Expiration. Expiration.

So the previous systems around routing security and trying to ascertain whether somebody is somehow authorized to to pass on a route or or generate a route, relied on information that that has no inherent expiration moment.

So what this means is that that if a network operator wanted to originate a set of IP addresses, they'd make an Internet routing registry or IRR route object.

And those route objects are plain text thingies, objects that live in databases for multiple decades, potentially.

And even though the IP block since then has been transferred to another organization or or sold or returned to the RER, there these route objects may still linger around because there's no mechanism inherent to the route object itself in IR to prune these.

And if you compare this to, say, domain name registry, registrations, your domain name, has an expiration date. And every year, you you pay a small fee to to extend its lifetime. And if you don't do that, it might mean that you lost interest in the domain name, and it's now available for others to use or or register.

So so IR lacked this mechanism. And, of course, letters of agency, so me providing you with a paper letter or or faxing you something with my company letterhead, that that I am about to use this prefix.

Also, it doesn't have an expiration date. It's it's a piece of paper that will exist forever and ever. So over the years, there has been proliferation of of information that maybe twenty, fifteen years ago was accurate, relevant, and useful.

Nowadays, it's not so useful. It's it's trash, but it's super hard to understand is is this, relevant or not because we we cannot discern if if an object is still or not. Now this is very different compared to the RPKI.

In the RPKI technology, every object, so a ROA or its issuer's, certificates or its issuer's issuer's certificates or its issuer's issuer's certificates or, you know, all the way up to to the trust anchor.

These, data objects have time stamps embedded in themselves about the starting point of their validity and the end, time stamp of their validity. So, in x five zero nine terminology, this is the not before and not after. And not after, I guess, can be compared to the expiration dates printed on a a box of, a carton box of milk. You know, if you if you pick up the the the milk, a week after, that date, it it probably might not be as good for you as you had hoped to be.

And and all of this happens automatically. So the the the RPK is really this machinery that is constantly reissuing and resigning information and and bumping expiration dates into the future.

And and that's a fascinating property that that, like, the previous generation of information system, the IRR, did not have at all.

So, yeah, dug into this.

Allow me to to to create a bridge to to pass on the microphone to to Doug.

I run a a project called r p k I views dot org, and this is somewhat modeled after route views dot org, because at some a few years ago, I realized, like, hey. Wouldn't it be cool if if a few years from now, we can go back in time and look at the state of the RPKI as it was two, three, four years ago?

In the same way that route fuels allows us to to, like, take a look at, how the last ten years of of routing or invalid propagation or, deployment of certain BGP attributes were.

So I made this system that just would periodically, every, say, twenty, thirty minutes, take a snapshot of the RPKI and put it into a tarball, a a, compressed archive, and just put it on the website, for anybody to to download and analyze.

So this system, not only records the raw digital files that together form the RPKI, but also produces a validated output, something that would be sent to your routers at that point in time in order to do route origin validation.

And these files not only contain prefix and prefix maximum prefix length and origin ASM, but also a expiration timer.

And this is where where I wanna pass the microphone to to Doug.

Okay. Yeah. So, yeah, the RPI views, service that Job has built is, incredibly useful. There's not another way to do what, that provides to go back be able to go back in time to understand the state of our API, at any time other than now. Like, the system itself has no memory. It's just, you know, what is it at this moment?

So, I've I've started using, his system, the the RPI views, which is open to the public. And if anybody wants to dig into this, it's all there.

But, I, you know, I used it in the my analysis of the Aranj Espana outage. We could go back and you could record. You could see the ROAS the ROAS changing through time.

So we knew exactly what, what took place, that, you know, that caused the outages and what were the actual changes to the ROAs, for our responding client causing the outage. So, I guess I had come across the, the expirations that are listed in the data that's coming out of our API views.

And, you know, I was a little bit I didn't understand the distinction initially, between the effective expiration and the row ROA expiration. So there's an expiration in a ROA and there's there's also an expiration, effective expiration, which is basically from the point of view of the validator. So after the information is traversed, the cryptographic chain, there's each of those steps has got its own, you know, not before, not after, kind of, time frame of when it's valid. So once it's traversed the chain, you're gonna be the, subject to the the shortest, you know, the shortest, time frame is gonna be the time frame that, and, that you have to, honor, as far as, not after time, we're calling it expiration.

So there's a there's a some of some of these ROAs can be a year out, when the you look at the expiration date on the ROA, but, effectively, it's actually, you know, eight hours out. It's, it's not that far in the future. Anyway, so I I started to just poke around at this. I was trying to understand, this phenomenon, and I just, took a snapshot and graphed them.

And I noticed that there's these peaks, at different times, there's different, seem like there's a a few different, like, a multimodal, distribution.

And, just a little bit more poking, I realized that each mode was a RAR. And so then I realized that, what we're looking at is, the different behavior of the different RIRs. You have, you know, Job mentioned the five RIRs earlier. Each one is a, you know, a root of authority in the system and, and then there's little implementation details of just how they they do their thing is a little different than the other.

So then once I saw that peak from the one snapshot, and I was like, well, I wonder what does this look like through time?

And then it started to get really funky where, different there's different, cadences each one operates on of just how often and to what extent it updates, its effective expiration date. And then it gives us a little window into some of the, you know, the interworkings of that cryptographic chain. I know I threw around the term cryptographically signed a lot, just kinda in a hand wavy fashion without, having the the skill set that Job's got to kinda actually walk through each each step. But, so I I thought it was kinda fascinating to keep keep pulling at the string, and then we were you know, we did a I did a, like, animation. I did a, you know, a few different graphs trying to capture, what what how each one is different. And so then we have we have two different we have a variety of different we have five different solutions that are out there.

You've got RIPE is the the expirations are always within, you know, like eight and twenty four hours and they're quickly updated. And then the other end of the spectrum is APNIC where it's they're always about five days out.

And, I'll I'll give my rationale. Job can correct me or add to this, but the the argument for having these short expirations that you you would like to not, get stuck in a wrong state. If there's some sort of an outage along that, crypto after chain, and we can't get, new information. We don't want people to get stuck with old information, and the the concern is it could be, yeah, there could it could be problematic. Job, do you wanna add something to that?

I I first wanna reiterate the difference between, effective or transitive expiration moments and and trans expiration moments per per element in the chain, and and then I'll, add some some to why expressions exist at all in this system.

So the ROA is this digital file that contains a certificate which contains expiration on it. So, you know, the the date today is somewhere in June twenty twenty four. And if I create a robot today, I might say this robot is gonna be valid for for six months. So that puts us into, you know, expiration moment November twenty twenty four.

The parents also so the parent is is the entity that signs over this ROA, and the parent also signs over a certificate revocation list. And the certificate revocation list is a mechanism through which you can retract or revoke a ROA before its expiration moment. So let's say, next month, I want to revoke the ROA that I created today.

Next month, I need to add that ROAS serial number to the certificate revocation list, the CRL.

Because if I don't do that, some systems might interpret that that my ROA is still valid for another five months because, you know, the expiration dates embedded inside the ROA itself says it's it's gonna expire in November twenty twenty four.

But these CRLs also have an expiration moments, and that one is pretty short because, you want sort of a a call it a lifeliness, check-in the system because it is technically possible for, an issuer of ROAS to issue a ROA that makes their own publication point unreachable because of having issued a rollout that covers the IP address space in which the RPKI publication point resides.

In other words, if you paint yourself into the corner, it is kind of nice if the system, in a matter of hours, rectifies the situation by expiring, the information in in a ROA, which would make the publication point reachable again, which would allow the the RPKI operator to to distribute a corrected ROA.

And imagine, if the system didn't work this way, if I accidentally created a ROA that's valid for another hundred years, I publish it.

Oops.

The the the the IP space in that ROA is is, not properly usable for well, it it is a very hypothetical situation, but, like, another hundred years. So the difference between transitive or effective expiration and and expiration of individual components in the chain is that the the validation algorithm, that parses through all the objects will check, what is the soonest expiry moment of any of the objects in the chain, any of the manifests or CRLs or, CA certificates or the the entity certificate inside the ROA itself.

And that, is is the transitive expiry moment or the effective trans trans expiration moment, upon which, the system as a whole acts. So if a ROA is set to expire in six months, but the, the CRL that that covers, the scope within which the ROA resides expires eight hours from now, that ROA's effective expiration moment is eight hours from now, not six months from now.

Job, I have a question for you.

So on, you know, in our in our analysis, we had, you know, like I mentioned earlier, a wide range of outcomes here with, maybe RIPE as the most, you know, shortest time, time expiration times, APNIC with, ones that are at least five days out, for most of their, effective expirations.

Is there a best practice? Is that is there a risk there? Like, I don't I don't know what what people usually do in this case, but the it seem like a bit it's a bit of an outlier. Right? This they've got, is there a risk incurred there?

I guess, let's say from this hypothetical scenario.

Well, in general, when talking about certificates, I think the industry as a whole has is is, has come to understand that short lived certificates have advantages even if they're not understood at the time compared to long lived certificates.

So it used to be the case that if you wanted an HTTPS, you know, icon in the browser, you'd you'd purchase a certificate, and it would be valid for three years. And you'd buy the three year one because you you know that you forget to renew it every year, so you wanna postpone that moment as much as you can.

But this means that if you lose the private key associated with with that certificate, the risk will, potentially exist for up to three years.

So modern day HTTPS or TLS, in systems like like Let's Encrypt and certainly do, the certificate expiry moment is is, you know, maybe thirty or or ninety days into the future. And this means that the time window in which risk exists related to compromised private keys, is is literally reduced in in terms of days.

Another advantage, as I mentioned before, if I, ahead of the expiry embedded inside a ROA, want to retract or revoke that ROA, get it out of the system, I need to list it on the CRL.

And the shorter time frame, the shorter the the the validity window of, a given object, the shorter it needs to reside on the CRL if I wanted to revoke it, prior to to its natural expiration moment. And this means that if we use shorter lifetimes in a system like the RPKI, the size of the CRLs is also shorter, because we need to list individual objects for a shorter amount of time, and this, increases the overall efficiency of the system.

So I think it's it's a matter of, like like, security hygiene. Don't have keys that that that exist forever and ever and ever.

And, also, just, you know, make sure things are as small as they can be.

And and the the life shorter lifetime sort of transform into smaller tile sizes, in this system.

So, I would assume that this is all part of the iterative process of the RPKI improving over time as a result of people adopting, adopting the framework.

So I have two one question for you, Job, and then I'd like to shift our attention to, next steps.

And, basically, is this a mechanism that should be considered by tier one service providers, CDNs, maybe lower tier service providers only, or is this really applicable to all organizations, enterprises included that are running BGP on the public Internet?

I think route origin validation is super useful the moment you have you you are multi homed.

Okay.

Multi homed doesn't mean that you have two transit providers. Multi honed means that you have two BGP peers that are distinct ASNs, and and it doesn't matter what their exact role is to to you. Because the moment you are multi honed, chances are that at some point in time, both BGP sessions present you with a route towards a given destination.

Now one of the routes so in normal course of operation, both routes would be, good, eligible.

But if there's some kind of typo or or attack or or other issue going on, then maybe one of the two options presented to you is is a wrong one. And it's awesome if you are able to discern which of the two is the path towards reaching the end destination and which of the two is the path towards the wrong destination.

So if you are multi homes, even as a small network, then I think it pays off to the route origin validation because you you increase your chances of sending packets down the right pipe or the correct pipe.

Okay.

So, Doug, I assume, correct me if I'm wrong, that it is unlikely that we'll we we will ever achieve one hundred percent adoption by folks that are multi home BGP or using multi home BGP.

So what what would you say are the the next steps, the the near future and long future to adoption?

Well, I guess the takeaway from, this updated statistics on adoption, are are twofold.

One is that, the fact that, there is a significant amount of rejection of invalids, means that, that's ought to be incentive, for a network, really, any network. Anybody who has any resource, meaning you, you're responsible for IP addresses.

You should be creating a ROAs. That enables the rest of the Internet to look out for you.

It's, of the two steps, that's the easier one, where you can just log in, set set something up, you know, create, you know, what's the origin that you want, for the for your address range.

Anyway, so that ought to be, you know, motivation, to get that step done. And then, again, with the, the as far as the routes with, with ROAS, this ought to be, instead of it instead of, or instead of yeah. Incentivize people or or, motivate people to be rejecting invalids knowing that this is gonna, we're gonna re protecting the traffic that you egress. So each step, you've got, a benefit to be to be had there.

Now as far as future steps, there are a few different technologies. We're still on the horizon. We've got ASPA, which is a mechanism where an AS will assert into the RPI. It's another application that runs over RPI. It will assert its, transit relationships, enabling others to identify valley free, violations in AS paths and reject those routes.

I think we're still at the beginning stages of that. It's, we we do have some deployment there, but, now that will catch things, you know, ROV is really just focused on the origin.

ASPA is focused in the middle of the AS path, of trying to catch, mostly types of leaks, that happen with some frequency still. There there is, you know, there's steady stream of leaks that are taking place. They don't they're not big news because we're doing a better job of suppressing them.

It's hard to tell that story of the story of the things that haven't happened, but, as as somebody who still follows us very closely, it's happening and they're not affecting, people. But the I think we're the area that that of concern for a lot of the folks, who are, of expertise in this space is now shifting over to this determined adversary scenario, and it's best exemplified by these, there's been a series of attacks against cryptocurrency services in the last couple years, where you've got attackers that are quite knowledgeable of our system, of this, of these, like, ROV, IRR filtering.

These these whoever's behind this, knows how to get around each of these steps. And, and so then we we're left with needing and they'll be able to impersonate NAS. And so then we need a a a technology like BGP sec, to try to, prevent networks from impersonating other, other networks. And that's, you know, a whole another discussion that we don't have time for here, but, that would that that is still in the discussion phase of just, you know, how how would that how would we build that and address all the concerns of people who have, you know, had hesitation around that technology.

But that's, we've there's lots more to be done, and this is something that probably folks like Job and I will be working, you know, their our entire careers, on this. This is you know, the Internet's a big place. It's a big the the constellation of problems are, are nontrivial, and so it's gonna take many years to to address these things. But, the fact that we're able to roll out a mechanism like ROV, and get the the rate of adoption we've got, to date is, very encouraging that, that these other problems, the way if we put our heads together, we're gonna be able to keep, improving route hygiene, routing security, and keep the keep the momentum up.

Before we close out, Job, is there anything else that you'd like to add?

Yeah.

When it comes to chance of expiration or expiration of RPKI, at all, I've heard concerns in the community because people have had misconceptions about the dynamics of the system. So, some people might have thought like, well, if the robot is gonna expire in eight hours, then eight hours from now, my validator will start taking actions to refresh the data and see if there's a newer thing or a different flavor of of this misconception that if a ROA expires eight hours from now, the system, eight hours from now, will reissue a new ROA. That's that's not how it works.

If there's an expiration moment, say, if if if you have a ROA that that is gonna expire a year from now, some of the RER managed systems will reissue that same, information, in in the ROA, nine months from now. So, like, there there might be, reissuance three to six months ahead of time. When it comes to CRLs, if I make a CRL that will expire twelve hours from now, four hours from now, I'm gonna reissue a CRL that's gonna be expire four plus twelve hours from now. So everything in the RPKI is reissued long before the actual expiry happens so that if there is synchronization, issues between the relying parties, the validators, and and the, the issuing parties, the CAs, there is sort of a grace period to allow for network synchronization issues to to be overcome.

So the expiration moment only kicks in if you lose network synchronization capabilities towards, the public patient point that the issuing party is using, and that is the appropriate reaction of this system. As long as there is a communication channel, and and it it's in proper order, you will well ahead, of time receive newly, freshly signed cryptographic information with expiration moments that are yet again further into the future.

So, yeah, this this is a key takeaway. Like like, we don't start taking action after expiration.

All the systems, including the validators, including the issuing systems, perform actions well ahead of the actual expiration momentum.

Right. And, again, lending itself to the stability that the RPKI offers, in any case as far as, in the authorization and validation of of routes on the Internet. So I would like to end here, because of time. Thank you, gentlemen, both, very much for joining today and for the, the the really the great discussion. So, gentlemen, as we close, how can folks reach out to you online if they have a question or a comment about today's topic. Doug, why don't we start with you?

Sure. I'm on Twitter x. I'm in LinkedIn.

Let's see. Most of, my blog posts appear on the Kentik, blog.

And, yeah, that's probably if you can reach out, through one of those means, that would that's probably the best.

Great. And and, Job, how about you?

There's a few ways. You could email me at job at fastly dot com.

You can find me on Twitter dot com slash job schniders or on Mastodon, which is, b s d dot network slash job.

Or I'm IRC.

My nickname there is Job.

Very convenient.

And, you could find me still on Twitter at network underscore Phil. You could search my name in LinkedIn, my blog networkphil.com. And, and as the case is with Doug, I have a significant amount of blog posts on, the Kentik blog, which I do recommend that you check out, especially the two posts, written by Doug and Job recently on RPKI ROV deployment and on ROAs perpetually expiring.

So for now, thanks very much for listening. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
View in Prod
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.