How a single line brought down the internet - Cloudflare outage explained

Java Brains

Подписаться 678 тыс.

Просмотров 109 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

26 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 192

@Jigar-Mehta 2 года назад

I'm pleasantly surprised to see how good of an effort you make to explain the basics before explaining the actual problem! Keep up the good work and keep posting on such fresh topics!

@gauntletwielder6306 2 года назад

Very nicely done. I am now a subscriber. Let me just contribute a thought or two, please…. 1 - BGP is COMPLICATED. Not just anyone can modify it. All kinds of rules, regulations, protocols, etc…. 2 - Many networks, including on premise networks and virtual networks ( AWS, GCP, etc… ), usually have no need to use or control BGP once they have a group or consecutive block of IP addresses. 3 - Anyone familiar with Access Control Lists ( ACL’s ) on firewalls, will IMMEDIATELY recognize that single line of code as being out of place. There are 3 Certainties in Life: Death, Taxes and the last line of an ACL for a firewall is usually DENY ALL ( or the equivalent )

@AjayKumar-fd9mv 2 года назад

Great explanation. I would like to know how you spend your one day,because handling daily family related tasks and job on top of that how do you find time to read all these technical blogs and stay updated. Knowing how you manage time could really help me plan my day.

@yehudamakarov 2 года назад

Lol. That’s all I have to say! I told my friend I slept 2 hours the other night he said “so you slept well!”

@laurensdehaan2202 2 года назад

Thanks for the very significant detail in this post. This shows the urgent need for a crisis management routine. Thanks, too to CF engineers for documenting the cause and recovery process, and especially their management team for their courage to lay bare their errors. I hope it leads to other companies learning from this. KUDOS TO CLOUDFLARE MANAGEMENT FOR TRANSPARENCY!

@jackfrosch 2 года назад

Thanks for a great explanation of Cloudflare's root cause analysis! I'm a huge Cloudflare fan and appreciate their transparency on the causes. I can only imagine how embarrassing it was for them. Reminds me of the time I inadvertently dropped the schema of a dev database by committing a bad Hibernate auto ddl config setting. Stuff happens. :-)

@IvanToshkov 2 года назад

For me the real question is "how can they prevent this from happening again". It would be very interesting if Cloudflare, or other organizations for that matter, share such stuff too.

@AdrianHansraj 2 года назад

they did share that, just read the report from Cloudflare where he got all this info from

@TheXennner 2 года назад

Use a good colour coded text editor and have it peer reviewed who is not sleep deprived would be a start ? And always delete and re-add the last term in JunOS unless you are modifying the last term… oh and Commit Confirmed in X so it will auto revert in X mins

@vandero.8742 2 года назад

Omg. The developer responsible for it. His heart must have dropped. The most anxiety ridden two hours of all time😂

@santhoshkalisamy 2 года назад

That's right but a single developer should not be blamed here. A developer cannot deploy code in production just like that, the code goes through several reviews and approvals. Everyone who reviewed and approved this particular change is responsible for this outage.

@TheXennner 2 года назад

Dev or Network Engineer ? This is network and by the look of it JunOS , I’m guessing this is IT Ops …. Ops usually goes through anxiety and panic at least twice a year

@ccthomas 2 года назад

It's very possible that the reason engineers were stepping on each other's reverts, is that the procedures and tools they would normally use to roll back changes in a controlled way were unavailable due to the outage itself. So they may have been resorting to more hands-on, manual changes, using their best judgement and on-hand knowledge to decide what to do, with on-the-fly oversight. And in this kind of fog of war, sometimes the wrong thing happens.

@javabeanz8549 Год назад

With so much relying on the Internet working, they probably had no phones ( all VoIP I suspect ) and no email or chat programs. Very hard to coordinate the effort with little to no communications.

@GregInTokyo 2 года назад

Great explanation as usual. Having worked for some of the larger cloud services companies unfortunately I'm not surprised. Lackadaisical approaches to change reviews, single-minded focus on one change without the broader "what else might this effect?" mindset and a fear of pointing out potential problems or mistakes amongst engineers makes this kind of issue almost guaranteed.

@rameshreddyadutla 2 года назад

« If you can't explain it simply, you don't understand it well enough. » So simple explanation, brilliant Kaushik! 🙏

@gillianbc 2 года назад

Great explanation - you made the complex easy to understand. Big thumbs up to the openness of the Cloudflare devs and their post-mortem. We can all learn from this.

@ragequilt_ 2 года назад

What a great explanation. I loved the bit where you walked through the chaos that exists amidst a full outage. Engineering meltdowns are glorious and horrifying all at once.

@zohaibsaeed4063 2 года назад

Kudos to you i am network engineer i have first time see dev know about the network that much.

@chethan93 2 года назад

Thanks Kaushik! Its always a pleasure hear you explain things.. this episode in particular was so much fun 😃

@mo0ndawg 2 года назад

Very informative and interesting stuff! Thanks Kaushik!

@chandugorantla 2 года назад

It's the same wrong BGP configuration that brought down all Facebook network in 2021. BGP is very critical to internet but it is fragile. Every AS advertise it's routes and they get forwarded globally. And each AS has to trust each other's config. And there is no easy way to test this without broadcasting as far as I know.

@random-gc7dc 2 года назад

If someone can explain a complex thing in simple words then that person really know what he is taking. I get that feeling when I see your videos. Thanks a lot for doing research and explaining it to us.

@ramakanthanr1031 2 года назад

Sir, Thank you for taking this effort in detailing us about the outage that happened. Great explanation sir. 👏🏻👏🏻👏🏻

@praveensharma5463 2 года назад

Very nicely explained. I liked how you started with facts and brought higher level building blocks into picture one at a time.

@SianaGearz 2 года назад

One benefit of CDNs is that transmitting data over a long distance has large hidden costs to it, and these costs end up being expressed in the increased latency and self-limiting bandwidth due to it, and also the datacentre fees that you end up paying. But also static data is usually several orders of magnitude larger than dynamic data, because it's images, videos, fonts, large Javascript frameworks, while dynamic data is usually a screenful of text at most, like 2 kilobytes. Imagine you could deliver all data from nearby the user, for the best user experience and lowest transfer costs, but hosting costs and complexity will multiply. Since static CDN offloads the vast majority of the data near to the user, you get also the vast majority of the performance benefit and save costs as well.

@anmoldhawan656 2 года назад

Awesome video. Please upload such kind of content more

@r3vmixman 2 года назад

You have a great way of explaining things. I’m hopeful the rest of your channel is just as good.

@ronrice1931 2 года назад

This sounds familiar! Properly testing code changes requires simulating the entire production system, which is rarely done. Microsoft elimated dedicated testers altogether in 2015, putting all responsibility for testing on the developers. Even before that I had to make a federal case to get full copies of production databases to test against. When I got my way, testing each release eliminated bugs most of the time. It is just hard to comprehend the complexity involved in these systems, and it is hard to accept that at that level of complexity, *bugs are a given*.

@tns6862 2 года назад

I learned a lot here and it very interesting, thank you for the video and gradual, clear explanations...all these behind the curtain things are fascinating...

@prabhjotsingh9597 2 года назад

Thank you Koushik

@SirKutuli 2 года назад

This kind of quality explanation only comes from you. Thanks 🙏

@bitterstories418 2 года назад

i am not a techie, but i fairly understood what's going just because of your brilliant explanation. great work.

@DurgaShiva7574 2 года назад

wow, hats off to you Kaushik, if i was having the power, i could have given u the title of Father of Computer Science, u r the best, tech, teaching, coding, superb.

@kasuha 2 года назад

Interesting and very relatable indeed. You can have the best processes in the world and still, some major problem occurring is not question of if but rather how often. Humans are just not perfect. And keeping answer to this "how often" really low is the best you can ever hope for.

@DeadlyDragon_ 2 года назад

Based off of the CLI commands in the root cause analysis they are using juniper routers. When it comes to network equipment there are 2 options for configuration and it varies depending on the environment. Itt can either be a scripted change via something like ansible or straight python. OR it can be fully manual giving you the opportunity to double check your work via show commands. With Junos (Juniper OS) we have the concept of a candidate configuration. Which is a split from the cisco way of doing things. Cisco configuration takes effect immediately once you hit enter and your only rollback strategy is to not save the config and to reboot the devices. Well with Juniper we have this candidate configuration which allows you to not only review changes before commiting them but also allowing for a commit confirm. If the changes are not confirmed within the allotted timeframe the configs rollback automatically. BGP is an incredibly in depth and complex routing protocol(this is the routing protocol of the internet). The selection of which routes are taken is an algorithm. BGP uses route advertisements and there indeed exists a global routing table. Autonomous System Numbers and IP space are allocated and need to be purchased from the relevant regulatory body for where the network exists. I need to stress this network configuration is not code, it is not programming, you are dealing with physical infrastructure. You can manage the configuration through code, however in the end it is a configuration that exists on network equipment.

@d_starcode1197 2 года назад

i just stopped halfway to write this comment...the explanation is so clear and enjoyable thank you for this.

@omgkanamikun 2 года назад

Thanks. Great explanations!

@anantakumarsahu2658 2 года назад

You are a legend of explaining complex topics into simpler form kaushik ❤️

@claydoug 2 года назад

It is an alarming situation, but always good to learn from our mistakes and the mistakes of others. Thanks for sharing!

@themrambusher 2 года назад

I hope they don't fire the people who did this changes or review those changes. They need support from their people right now...

@javabeanz8549 2 года назад

Another thing on BGP changes, they propagate to your peers, then their peers, then... I think you get where this is going. One little oops in BGP, and it can take a while to get things back to normal. I was watching the changes that were being made when we were changing backbone providers, and one seemingly minor typo brought us grief for hours. I asked about it when I spotted it, but it was already too late to prevent the mess, we just had to fix it and wait. An ISP somewhere in a small country made a mistake years ago, and killed their country's connections by overloading them, and other parts of the world had nothing, as they had mistakenly made themselves "the shortest path" for most of the world, once the other BGP systems lost contact, they started to correct the issue for themselves. But it was ugly for a while, and lots of filters were added right after the incident, to try to prevent such things spreading like that again.

@roboko6618 2 года назад

Cloudflare is honestly the bane of my life, always hitting these stupid barriers that stop me visiting websites. If I see that 'checking your browser' message I usually just close the tab and find another site to use because it never works.

@NealMiskinMusic 2 года назад

The total number of bugs in a program will always be greater than the number of bugs found in beta testing.

@mangeshkulkarni7299 2 года назад

Great learning from this video. Thanks Koushik.

@supadrasta 2 года назад

Clear, Concrete, Concise.

@starpawsy 2 года назад

It usually is only one line. It is very rare for like 20 errors to all occur at the same time.

@baismail-daily 2 года назад

Masha Allah. I'm a big fan of how you make complicated stuff a lot simpler and easier to understand.

@vaishnavmhetre9061 2 года назад

Happy to have BGP being explained by developer ❤️. Well I have a suggestion. In case they dont want to expose the IP of server, why dont they have a proxy to real server, where real IP shall be stripped off and replaced by proxy ip when returning to user from proxy.

@CyberAbyss007 2 года назад

Thank you! Great job on the video!

@zaheerahmad8188 2 года назад

Wow!! Best and simple yet effective explanation.

@shubhamdhapola5447 2 года назад

Well, it was a great explanation!! I have two doubts though : In the section of "fallback mechanisms", why not reroute the incoming request to other locations where the copies of the same resources reside ?? This way actual source IPs would not be disclosed. This probably requires implementing the config file (where the code that results in the outage is) in some other way. In the section of "testing", Knowing that "reject all rest" should come at the end and not in between, can't we run a config file parser/checker/validation script, that checks during parsing, if the tokens aggregate together to give a "REJECT ALL THE REST" semantics, that "aggregate of tokens" can be placed at the END by itself. It's like dynamically editing the config file, AFTER it goes through all the code changes (by different people), BEFORE rolling it out in production. Github Actions can be used for this, right ??

@HenryLoenwind 2 года назад

It is trivial to write a rule checker for issues that have been experienced. It is impossible to predict all issues that may arise in the future.

@brendawilliams8062 2 года назад

I am glad someone understands this. Excellent lecturer but way over my head.

@alexandrei1176 2 года назад

It’s crazy on how many engineers we depend for things to go right nowadays

@IndyRider 2 года назад

This is the best explanation of the issue on youtube

@rajkumarvb2602 2 года назад

Fantastic explanation Koushik!

@weeb3277 2 года назад

who will win: the internet vs. one line of code

@elliejordan2033 2 года назад

It`s high time for the ecosystem Utopia to avoid many problems with protection and internet!

@RiteshShetty77 2 года назад

Great explanation and this tells how crucial the code reviews will be!!

@AlphaSierra380 2 года назад

I loved the video and your simple explanation. Thanks.

@csthll 2 года назад

really good technical explanation! But you didn't really connected the point with storage is cheap and compute is expensive afterwards to the topic, or did I miss it?

@MohammadAhmad-nh5ug 2 года назад

The urge to explain even the smallest detail before jumping to the main topic.. ✌️😎

@RaviKiran-qd1cl 2 года назад

Great explanation. Thanks.

@casperghst42 2 года назад

Great info, but the ad's are a killer.

@BhanuPratap88 2 года назад

very well articulated ...i loved the learning part of entire explanation

@yvanratolojanahary5735 2 года назад

Great content. Thanks !!

@sabarikannanmm 2 года назад

Clean Explanation! Kuddos!

@uruppadi4606 2 года назад

Some arrogant engineers being over confident end up killing companies. They need even more arrogant reviewers and quality gates.

@charanteja1531 2 года назад

Thank you so much for detailed information.

@ankit5587 2 года назад

Amazing content Please keep posting such video Kudos to your hard work and your team

@smithdmello2159 2 года назад

Don't know how, but you seem to understand everything effortlessly

@sasidharch6622 2 года назад

Good explanation!! Thank you 👏👏

@TheAnand1995 2 года назад

Best explanation for a CDN and the outage issue ^^

@alerey4363 2 года назад

2:02 a dynamic site in javascript runs on the client; a database driven site runs on the server side; both are not cheap in terms of resources, one hogs down your mobile/notebook (with tons of background ads and data mining crap) , the other moves data across (many) servers and does calculations and send them back to the client

@mathrisk 2 года назад

Thanks for the detailed explanation. 🤘

@atom6_ 2 года назад

In the firewall world you can have rules that cannot "live" in the middle of your config (like DENY ALL) and trying to commit them will be rejected. For BGP routers, it is quite normal to have rollback rules available that execute after x minutes and rollback the changes automatically unless condition bla. And I am pretty sure they can use commit scripts that can verify if a "REJECT" is not the last rule, this is basic stuff. I mean, losing your management in a network like CF is truly something unbelievable, they should have a separate OoB (out of band) network available to manage their devices via console, this all sounds odd to me, but then again, we have no idea about their network setup and we do not get all the details.

@mehulmakwana7091 2 года назад

Awsome explanation Kaushik :)

@s77funky 2 года назад

Just the location of the line was the problem, very haunting stuff but code on I must

@vkpshiva17 2 года назад

Well explained. Thanks!

@chethankumar.j.r7606 2 года назад

very detailed and simple explanation... thank you sir

@netdioxadmini 2 года назад

Thanks bro... Nicely Explained.

@RajkumarRaigonde 2 года назад

As always great explanation. Thanks

@subaruhassufferredenough7892 2 года назад

Thanks for this nice explanation!!

@samme4life 2 года назад

Valuable knowledge as always. Thank you

@js2009-g9f 2 года назад

whoever can come up with a better algorithm should win turing award

@utprem 2 года назад

Great explanation. Do you have any video on compiler optimizations and PGO etc ... ?

@somdattachakraborty2739 2 года назад

So well explained! 😇

@ahahahabmbc1075 2 года назад

Thanks for the video

@einsteingenius9740 2 года назад

Commenting something just to make sure RU-vid algorithm catches it. This is the very first time I am hearing @Kaushik to comment and like the video😇. You are beyond these though 😎

@Java.Brains 2 года назад

Don't hate the player, hate the game. I got a better idea of the opportunity cost of _not_ asking people to do that stuff, and it's a _lot_!

@einsteingenius9740 2 года назад

@@Java.Brains All I got is love and respect for the game and the player 💯 🙏. Thank you for every knowledgeable video you share.

@sairaj6875 2 года назад

Really nice video!

@supernenechi 2 года назад

Good title! However, don't you think that a lot of these issues are actually just one line of code?

@Hamish_A 2 года назад

Listen to the end folks! There's more than one lesson here.

@sanjarmatin6227 2 года назад

Superb Explanation !!!

@rddavies 2 года назад

The idea of having multiple smaller code reviews which are then rolled out as one seems ok if the pieces are truly independent one from another. But in the general case they won't be and in that case by dividing them up you've just set yourself up for the next best big outage due to the fact that these interdependencies were missed by the individual reviewers. This critique seems a bit too facile to me.

@haroldcruz8550 2 года назад

This things are bound to happen, computer systems and networks are becoming more and more complex. The higher the complexity the higher the chance a small part can bring the whole system down. It's not about preventing a similar mistake to happen, it's how can the system be more robust to handle cases like this.

@AnonyoX 2 года назад

Actually, they can and should always be small. If you are finding it difficult to decompose a code change into less than say 500 LOC, then it means that your system is too tightly coupled in its design, and time has come to consider refactoring or decomposing it into smaller micro-services.

@randomnobody660 2 года назад

I speak as somebody who's never worked on such a system, but naively isn't it bad design to have that much interdependence? Shouldn't your code base always be modular enough and with clean splits among well defined interfaces that you can always do piecewise upgrades?

@rydmerlin 2 года назад

How do their tests work? How is the changed tested where it’s proven that those two lines under the reject line still have their intended affect.

@arunperumalr 2 года назад

Very Informative !

@victorphilip875 2 года назад

What I don't understand is why a code analysis tool can't either find that particular issue or be modified to find such an issue. Unreachable statements/fallthroughs in a switch block are like a cliché gotcha at this point. You should also have some coded tests as proof of it working too. There's too much at stake to mess this up.

@SateeshAllu 2 года назад

Great Explanation!

@MayukhDatta 2 года назад

Great explanation.

@rajasundarrajan9097 2 года назад

As always you are spot on.

@OzByDrone 2 года назад

Incomplete rollback plan. Shit happens and things break and we try to prevent it. But it is unacceptable to do a bgp change without access to the devices that will drop if the change fails.

@thePribs 2 года назад

Lovely, thanks ! 🙏

@vineetpardhi5475 2 года назад

amazing explanation!

@shashankprakash5698 2 года назад

Very well explained..!!

@PrashantKumar-wi7tt 2 года назад

Very Well Explained !!!

@Yadobler 2 года назад

Isn't this similar to the FB/IG/WA outage? Messing up the BGP, all nodes reject traffic, engineers had to literally break into their server centres because they can't authenticate themselves on the very network that's down