I was working at AT&T Bells Labs in New Jersey when this happened. The person in the office across the hall from me, a distinguished member of the technical staff (DMTS), Dave, who was a C expert (editor of the ANSI C standard), found the problem. It was a misunderstanding of how the break statement works in C by the original writer of the code.
@@kikisbytes Dave and I actually worked on the C compiler team, so it was not our direct responsibility, but Dave was asked to help because of his expertise and the gravity of the situation. He looked over the source code and pretty quickly found the problem.
I once told a DB dev to set imgs to be 200x200 so he could debug some "empty" sections (they all sucked at css) ... came in next morning to find everyone freaking out about imgs, guess what happened? This v well known company also used some random url rewrite script it bought for $20, which meant if u asked for a jpg that wasn't there it would drop the entire site. One of the largest e-retailers in the UK.
@@kikisbytes The in-house team of .NET developers didn't even consider the url re-write till half way through developing the new site when Wordpress folk got it as standard :D
@@kikisbytes Thinking about it, they screwed me over so I owe NEXT plc no loyalty on this one lol Terrible place where the whole culture is "I didn't beak it" and "it passed QA" ... There's zero "let's make something good"
This description is a bit of an oversimplification. The 4ESS proper was coded in EPLX while the 3B20D that ran the CCS7 messaging software was written in C. CCS7 was a message passing overlay that allowed Dynamic Non Hierarchical Routing - all prior routing was based on each switch in the hierarchy kicking a call to the next higher level if the given switch didn't know how to reach the desired destination. The unhandled race condition was in the CCS7 (Common Channel Signaling) - not in the in-band trunk signally that the 4ESS could and did use as well. I was working at Indian Hill, the Illinois location that handled switching system development, when this happened. Lucky for me I was with 5ESS but the rumor was that the guy who wrote the bad code worked in an office a few hallways down. Would not want to have been him. For those who want to dive into the nitty gritty of circuit-switched voice telephony (now largely replaced by packet based switching), there are multiple issues of the Bell System Technical Journal that devote themselves to the hardware and software design of 4ESS and 5ESS.
My first reaction was "wow that was a fast turnaround on the outage, not even Cloudflare publishes COEs that fast"... then you said 1990. Topical either way!
I took down the staging environment of my company’s data pipeline because the caching layer I wrote had a custom compare sort that didn’t fulfil antisymmetry properly. No one noticed for about two days lol.
@@kikisbytes Yes, we did have a dev env, all unit tests and integration tests passed, and it worked fine there. The issue was insidious as it didn't show up until the second day. It kept chugging along fine for a day until it stopped caching on the second day.
maybe doesn't count because it was a hobby project of mine, but i host a self-made Telegram bot that a decent number of people use. it's a modular system with a whole host of features, ranging from animal pictures, to quote storage, a fake economy, and even some homebrewed generative text ai. i was making a small change to the code about a year ago now (can't quite remember what) and when i went to deploy the changes to the remote server, i ran an outdated deploy script that decided to overwrite the user data folder and then immediately restart the bot. RIP all of those saved settings, statistics, fake tokens, etc. that was a painful announcement to make lol. had the bot not been restarted, the data would've still been in memory and i'm sure that i could've restored it somehow. and of course, no backups :)
This definitely counts and thank you for sharing! And ohh man that must of been one tough of a night. Were your users okay resigning up for your service afterwards?
@@kikisbytesit was more of a helper bot on a chat platform, so nobody had to re-signup, but they definitely were more than a little peeved at having to reset their settings to what they were before the big wipe, and i was sad at the loss of all that PLUS the super big dataset for the aforementioned generative text ai
@@kikisbytes AT&T had a nationwide outage last week. Took down half the phones in the US for 9 hours. My company got an alert level 1 from them, first any of us have seen. Clicked on this vid because I thought it was about that.
Have I ever brought down prod? Several times... Am I proud? No way, but I actually managed to fix with 99% accuracy at least, plus had lots of luck during that. I learned some key things for sure. Like: never ever run a one-liner cmd thinking it's just deleting a single entity in db which I'm intending to do, especially when u don't know the behaviour of the app DB layer 🙃