I'm pleasantly surprised to see how good of an effort you make to explain the basics before explaining the actual problem! Keep up the good work and keep posting on such fresh topics!
Very nicely done. I am now a subscriber. Let me just contribute a thought or two, please…. 1 - BGP is COMPLICATED. Not just anyone can modify it. All kinds of rules, regulations, protocols, etc…. 2 - Many networks, including on premise networks and virtual networks ( AWS, GCP, etc… ), usually have no need to use or control BGP once they have a group or consecutive block of IP addresses. 3 - Anyone familiar with Access Control Lists ( ACL’s ) on firewalls, will IMMEDIATELY recognize that single line of code as being out of place. There are 3 Certainties in Life: Death, Taxes and the last line of an ACL for a firewall is usually DENY ALL ( or the equivalent )
Great explanation. I would like to know how you spend your one day,because handling daily family related tasks and job on top of that how do you find time to read all these technical blogs and stay updated. Knowing how you manage time could really help me plan my day.
Thanks for the very significant detail in this post. This shows the urgent need for a crisis management routine. Thanks, too to CF engineers for documenting the cause and recovery process, and especially their management team for their courage to lay bare their errors. I hope it leads to other companies learning from this. KUDOS TO CLOUDFLARE MANAGEMENT FOR TRANSPARENCY!
Thanks for a great explanation of Cloudflare's root cause analysis! I'm a huge Cloudflare fan and appreciate their transparency on the causes. I can only imagine how embarrassing it was for them. Reminds me of the time I inadvertently dropped the schema of a dev database by committing a bad Hibernate auto ddl config setting. Stuff happens. :-)
For me the real question is "how can they prevent this from happening again". It would be very interesting if Cloudflare, or other organizations for that matter, share such stuff too.
Use a good colour coded text editor and have it peer reviewed who is not sleep deprived would be a start ? And always delete and re-add the last term in JunOS unless you are modifying the last term… oh and Commit Confirmed in X so it will auto revert in X mins
That's right but a single developer should not be blamed here. A developer cannot deploy code in production just like that, the code goes through several reviews and approvals. Everyone who reviewed and approved this particular change is responsible for this outage.
Dev or Network Engineer ? This is network and by the look of it JunOS , I’m guessing this is IT Ops …. Ops usually goes through anxiety and panic at least twice a year
It's very possible that the reason engineers were stepping on each other's reverts, is that the procedures and tools they would normally use to roll back changes in a controlled way were unavailable due to the outage itself. So they may have been resorting to more hands-on, manual changes, using their best judgement and on-hand knowledge to decide what to do, with on-the-fly oversight. And in this kind of fog of war, sometimes the wrong thing happens.
With so much relying on the Internet working, they probably had no phones ( all VoIP I suspect ) and no email or chat programs. Very hard to coordinate the effort with little to no communications.
Great explanation as usual. Having worked for some of the larger cloud services companies unfortunately I'm not surprised. Lackadaisical approaches to change reviews, single-minded focus on one change without the broader "what else might this effect?" mindset and a fear of pointing out potential problems or mistakes amongst engineers makes this kind of issue almost guaranteed.
Great explanation - you made the complex easy to understand. Big thumbs up to the openness of the Cloudflare devs and their post-mortem. We can all learn from this.
What a great explanation. I loved the bit where you walked through the chaos that exists amidst a full outage. Engineering meltdowns are glorious and horrifying all at once.
It's the same wrong BGP configuration that brought down all Facebook network in 2021. BGP is very critical to internet but it is fragile. Every AS advertise it's routes and they get forwarded globally. And each AS has to trust each other's config. And there is no easy way to test this without broadcasting as far as I know.
If someone can explain a complex thing in simple words then that person really know what he is taking. I get that feeling when I see your videos. Thanks a lot for doing research and explaining it to us.
One benefit of CDNs is that transmitting data over a long distance has large hidden costs to it, and these costs end up being expressed in the increased latency and self-limiting bandwidth due to it, and also the datacentre fees that you end up paying. But also static data is usually several orders of magnitude larger than dynamic data, because it's images, videos, fonts, large Javascript frameworks, while dynamic data is usually a screenful of text at most, like 2 kilobytes. Imagine you could deliver all data from nearby the user, for the best user experience and lowest transfer costs, but hosting costs and complexity will multiply. Since static CDN offloads the vast majority of the data near to the user, you get also the vast majority of the performance benefit and save costs as well.
This sounds familiar! Properly testing code changes requires simulating the entire production system, which is rarely done. Microsoft elimated dedicated testers altogether in 2015, putting all responsibility for testing on the developers. Even before that I had to make a federal case to get full copies of production databases to test against. When I got my way, testing each release eliminated bugs most of the time. It is just hard to comprehend the complexity involved in these systems, and it is hard to accept that at that level of complexity, *bugs are a given*.
I learned a lot here and it very interesting, thank you for the video and gradual, clear explanations...all these behind the curtain things are fascinating...
wow, hats off to you Kaushik, if i was having the power, i could have given u the title of Father of Computer Science, u r the best, tech, teaching, coding, superb.
Interesting and very relatable indeed. You can have the best processes in the world and still, some major problem occurring is not question of if but rather how often. Humans are just not perfect. And keeping answer to this "how often" really low is the best you can ever hope for.
Based off of the CLI commands in the root cause analysis they are using juniper routers. When it comes to network equipment there are 2 options for configuration and it varies depending on the environment. Itt can either be a scripted change via something like ansible or straight python. OR it can be fully manual giving you the opportunity to double check your work via show commands. With Junos (Juniper OS) we have the concept of a candidate configuration. Which is a split from the cisco way of doing things. Cisco configuration takes effect immediately once you hit enter and your only rollback strategy is to not save the config and to reboot the devices. Well with Juniper we have this candidate configuration which allows you to not only review changes before commiting them but also allowing for a commit confirm. If the changes are not confirmed within the allotted timeframe the configs rollback automatically. BGP is an incredibly in depth and complex routing protocol(this is the routing protocol of the internet). The selection of which routes are taken is an algorithm. BGP uses route advertisements and there indeed exists a global routing table. Autonomous System Numbers and IP space are allocated and need to be purchased from the relevant regulatory body for where the network exists. I need to stress this network configuration is not code, it is not programming, you are dealing with physical infrastructure. You can manage the configuration through code, however in the end it is a configuration that exists on network equipment.
Another thing on BGP changes, they propagate to your peers, then their peers, then... I think you get where this is going. One little oops in BGP, and it can take a while to get things back to normal. I was watching the changes that were being made when we were changing backbone providers, and one seemingly minor typo brought us grief for hours. I asked about it when I spotted it, but it was already too late to prevent the mess, we just had to fix it and wait. An ISP somewhere in a small country made a mistake years ago, and killed their country's connections by overloading them, and other parts of the world had nothing, as they had mistakenly made themselves "the shortest path" for most of the world, once the other BGP systems lost contact, they started to correct the issue for themselves. But it was ugly for a while, and lots of filters were added right after the incident, to try to prevent such things spreading like that again.
Cloudflare is honestly the bane of my life, always hitting these stupid barriers that stop me visiting websites. If I see that 'checking your browser' message I usually just close the tab and find another site to use because it never works.
Happy to have BGP being explained by developer ❤️. Well I have a suggestion. In case they dont want to expose the IP of server, why dont they have a proxy to real server, where real IP shall be stripped off and replaced by proxy ip when returning to user from proxy.
Well, it was a great explanation!! I have two doubts though : In the section of "fallback mechanisms", why not reroute the incoming request to other locations where the copies of the same resources reside ?? This way actual source IPs would not be disclosed. This probably requires implementing the config file (where the code that results in the outage is) in some other way. In the section of "testing", Knowing that "reject all rest" should come at the end and not in between, can't we run a config file parser/checker/validation script, that checks during parsing, if the tokens aggregate together to give a "REJECT ALL THE REST" semantics, that "aggregate of tokens" can be placed at the END by itself. It's like dynamically editing the config file, AFTER it goes through all the code changes (by different people), BEFORE rolling it out in production. Github Actions can be used for this, right ??
really good technical explanation! But you didn't really connected the point with storage is cheap and compute is expensive afterwards to the topic, or did I miss it?
2:02 a dynamic site in javascript runs on the client; a database driven site runs on the server side; both are not cheap in terms of resources, one hogs down your mobile/notebook (with tons of background ads and data mining crap) , the other moves data across (many) servers and does calculations and send them back to the client
In the firewall world you can have rules that cannot "live" in the middle of your config (like DENY ALL) and trying to commit them will be rejected. For BGP routers, it is quite normal to have rollback rules available that execute after x minutes and rollback the changes automatically unless condition bla. And I am pretty sure they can use commit scripts that can verify if a "REJECT" is not the last rule, this is basic stuff. I mean, losing your management in a network like CF is truly something unbelievable, they should have a separate OoB (out of band) network available to manage their devices via console, this all sounds odd to me, but then again, we have no idea about their network setup and we do not get all the details.
Commenting something just to make sure RU-vid algorithm catches it. This is the very first time I am hearing @Kaushik to comment and like the video😇. You are beyond these though 😎
The idea of having multiple smaller code reviews which are then rolled out as one seems ok if the pieces are truly independent one from another. But in the general case they won't be and in that case by dividing them up you've just set yourself up for the next best big outage due to the fact that these interdependencies were missed by the individual reviewers. This critique seems a bit too facile to me.
This things are bound to happen, computer systems and networks are becoming more and more complex. The higher the complexity the higher the chance a small part can bring the whole system down. It's not about preventing a similar mistake to happen, it's how can the system be more robust to handle cases like this.
Actually, they can and should always be small. If you are finding it difficult to decompose a code change into less than say 500 LOC, then it means that your system is too tightly coupled in its design, and time has come to consider refactoring or decomposing it into smaller micro-services.
I speak as somebody who's never worked on such a system, but naively isn't it bad design to have that much interdependence? Shouldn't your code base always be modular enough and with clean splits among well defined interfaces that you can always do piecewise upgrades?
What I don't understand is why a code analysis tool can't either find that particular issue or be modified to find such an issue. Unreachable statements/fallthroughs in a switch block are like a cliché gotcha at this point. You should also have some coded tests as proof of it working too. There's too much at stake to mess this up.
Incomplete rollback plan. Shit happens and things break and we try to prevent it. But it is unacceptable to do a bgp change without access to the devices that will drop if the change fails.
Isn't this similar to the FB/IG/WA outage? Messing up the BGP, all nodes reject traffic, engineers had to literally break into their server centres because they can't authenticate themselves on the very network that's down