Software Engineering F&*K Up Behind The Passport E-gate Failure

Подписаться 203 тыс.

Просмотров 31 тыс.

50% 1

The UK passport e-gate failure was a big news story at the beginning of May, a significant software failure that caused disruption and significant delays in travellers entering the UK.
In this episode, Dave Farley talks about the issue, how poor software engineering led to this, how distributed systems come into it all, how to avoid something like this happening again and at the end of the video asks for answers on some concerning issues around the whole story.
-
FREE 'How To Evolve Your Software Architecture' Guide:
How to work in ways that keep stuff easy to change which gives you the freedom to make mistakes and experiment and how to work in small steps that allow you to determine their fit for your present understanding of the problem... continuously. All are explained in this FREE compact guide. Download HERE ➡️ www.subscribepage.com/evolve-...
-
⭐ PATREON:
Join the Continuous Delivery community and access extra perks & content! ➡️ bit.ly/ContinuousDeliveryPatreon
-
🗣️ THE ENGINEERING ROOM PODCAST:
Apple - apple.co/43s2e0h
Spotify - spoti.fi/3VqZVIV
Amazon - amzn.to/43nkkRl
Audible - bit.ly/TERaudible
-
👕 T-SHIRTS:
A fan of the T-shirts I wear in my videos? Grab your own, at reduced prices EXCLUSIVE TO CONTINUOUS DELIVERY FOLLOWERS! Get money off the already reasonably priced t-shirts!
🔗 Check out their collection HERE: ➡️ bit.ly/3Uby9iA
🚨 DON'T FORGET TO USE THIS DISCOUNT CODE: ContinuousDelivery
-
🔗 LINKS:
🔗 "E-gate outage lessons 'must be learnt' - BBC" ➡️ www.bbc.co.uk/news/articles/c...
🔗 "Queues as electronic passport gate technology fails at uk airports - The Guardian" ➡️ www.theguardian.com/world/art...
🔗 "E-gate border chaos sparked when Home Office failed to tell BT it was updating software" ➡️ www.telegraph.co.uk/news/2024...
-
BOOKS:
📖 Dave’s NEW BOOK "Modern Software Engineering" is available as paperback, or kindle here ➡️ amzn.to/3DwdwT3
and NOW as an AUDIOBOOK available on iTunes, Amazon and Audible.
📖 The original, award-winning "Continuous Delivery" book by Dave Farley and Jez Humble ➡️ amzn.to/2WxRYmx
📖 "Continuous Delivery Pipelines" by Dave Farley
Paperback ➡️ amzn.to/3gIULlA
ebook version ➡️ leanpub.com/cd-pipelines
NOTE: If you click on one of the Amazon Affiliate links and buy the book, Continuous Delivery Ltd. will get a small fee for the recommendation with NO increase in cost to you.
-
CHANNEL SPONSORS:
Equal Experts is a product software development consultancy with a network of over 1,000 experienced technology consultants globally. They increase the pace of innovation by using modern software engineering practices that embrace Continuous Delivery, Security, and Operability from the outset ➡️ bit.ly/3ASy8n0
TransFICC provides low-latency connectivity, automated trading workflows and e-trading systems for Fixed Income and Derivatives. TransFICC resolves the issue of market fragmentation by providing banks and asset managers with a unified low-latency, robust and scalable API, which provides connectivity to multiple trading venues while supporting numerous complex workflows across asset classes such as Rates and Credit Bonds, Repos, Mortgage-Backed Securities and Interest Rate Swaps ➡️ transficc.com
Semaphore is a CI/CD platform that allows you to confidently and quickly ship quality code. Trusted by leading global engineering teams at Confluent, BetterUp, and Indeed, Semaphore sets new benchmarks in technological productivity and excellence. Find out more ➡️ bit.ly/CDSemaphore
#softwareengineer #developer

Наука

Опубликовано:

28 май 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 193

@ContinuousDelivery Месяц назад

FREE 'How To Evolve Your Software Architecture' Guide: How to work in ways that keep stuff easy to change which gives you the freedom to make mistakes and experiment and how to work in small steps that allow you to determine their fit for your present understanding of the problem... continuously. All explained in this FREE compact guide. Download HERE ➡ www.subscribepage.com/evolve-your-architecture

@dxhelios7902 29 дней назад

This video is bad advice or at least incomplete. 1. It is a problem of service management first. Because the failure of the network was either not evaluated or not tested. 2. Sometimes I do not want resiliency - I want failure. Because if the system is failing, it could be a bug, design bug, security breach. If it is a design bug or security breach, resiliency may result in leaking or corrupting a lot of data. Potentially - customer data. It is better to fail than to continue. 3. How easily you dismissed security issue. Why? 4. Usage of cached data may not be suitable. If a bad person must be stopped now, then cached data is not an option. Usage of cached data introduced replication, invalidation, privacy challenges. This can be exploited. This introduces more things that can break. And if all things can break, then advising to do better design, to have caches, to have secondary channels - conceptually does not solve anything. As we are on continuous delivery channel, you should know about ITSM practices, ITIL, MOF... They all focus on continuous improvement besides other things. There are critsit trainings for that etc. There are business continuity plans for that etc. It is ITSM - not another component or piece of code that fails.

@Microphunktv-jb3kj 28 дней назад

... as estonian i'm just... "what on earth are those brits yapping about... " that system is Mid level difficulty , how can you fail in system like that.... :D not even superdifficult i guess, estonians are just on another level in state-level e-services and infra , e-services have never been down :D .. in the rest of the world, u watch Weather Report from news.. in my country... in weather report , there's cyberattack reports , how many times attackers have failed this week : ))))))))))

@Dorgrin 29 дней назад

I thought it'd be an expired TLS certificate, but "we went over quota" is far more hilarious.

@Phil-D83 29 дней назад

Yupp 🤦‍♂️

@LewisCowles 28 дней назад

was it a SaaS system. Like what do they mean "we went over quota"? Quota for what?

@superjugy 29 дней назад

I am the type of person that always thinks "But what if". My wife always tell me I'm too negative and that I have anxiety and I felt bad. Now after watching your videos, I know that I'm just an engineer doing my job!

@MoiMagnus1er 29 дней назад

A big difference between anxiety and "engineer mindset" is that when you're anxious, you didn't fully accept that things will eventually break. You suffer mentally because the solutions you find are always not enough to reach perfection. You don't just try to mitigate problems and make more resilient choice while calmly considering edge case, you torture yourself about the edge cases you don't know how to perfectly solve. Also don't confuse "engineer mindset" with pessimism. The engineer mindset is aware of the statistics, and does not claim that it will go wrong every single time, it instead claims that it will eventually go wrong given enough time and that we should be ready when it does.

@gan314159 29 дней назад

you'd do well in a role on risk assessment or security modelling. i've come to realise that it can be used to move from anxiety (which is rather nebulous) to something you can address, ie a skill that can be highly employable.

@superjugy 29 дней назад

@@MoiMagnus1er yeah. I know that there can be anxious and pessimistic persons and that is not the same, exactly as you described. While I can have those kind of thoughts from time to time, in general, I just try to plan for everything, and that drives my wife crazy since she's more of a "live the moment" kind of person. And some times I genuinely thought "is it wrong to plan for everything? Should I live the moment? Is it anxiety?" But I think now I'm at peace knowing it's just my engineering side wanting the best outcome possible.

@Microphunktv-jb3kj 28 дней назад

The more intelligence you have... the more paranoid you usually are and many "what if" and anxiety and overthinking... always thinking of the worst case scenario lol...

@mellusk9194 29 дней назад

It bugs the hell out of me when people use WiFI as a replacement for cables. The only reason to use it is for something that's mobile; if you have one of these kiosks, which will almost certainly reside in the same place for most of it's life, there's no reason to not run an ethernet cable to it.

@mrpocock 29 дней назад

Even then, a switch could die. It is better than WiFi but can still fail.

@mellusk9194 29 дней назад

@@mrpocock true enough....you still gotta plan for those kinds of things.

@EwanMarshall 29 дней назад

@@mrpocock multple redundant nics connected to multiple redundante switches is possible. Of course, there is the facebook case as well, they managed to reconfigure the whole network into a nonfunctional state including the switches the door access controllers for the room to locally recover.

@mrpocock 29 дней назад

@@EwanMarshall when I worked on the human genome project, we had redundant power. Two cables went in opposite directions to two different power stations. No shared cable. Everything is overkill until it isn't.

@EwanMarshall 28 дней назад

@@mrpocock Oh, I know :D And it can still find a way to fail.

@andreaszetterstrom7418 29 дней назад

If we want to cut the developers aome slack, they may have thought of all these things, but then get stopped by something else such as GDPR, burocracy or other rules that prevent local copies of data. The POI terminals are likely susceptible to hacking for example.

@MrSanchezHD 29 дней назад

I work in this industry (document and biometric scanning for govts) and GDPR + bureaucracy is indeed very likely to the cause. The team creating the e-gates software were likely not allowed to create a system that has full read access to the "bad people list" and thus lookup would likely only be allowed on a per-api-call basis and only by authorised clients via a home office-controlled privileged API of sort. All to prevent the entire list from leaking somewhere. You can probably mitigate the hacking of POI terminals by having the terminals connect to a central hub server application in a "secure" room in each airport (so e-gates using local area network instead of WAN). That central hub (per airport) could contain a local cache. The odds of an intranet/LAN going down is significantly smaller and would limit damage to 1 airport. The hard part in such a project is justifying that each airport should be allowed their own full-cache copy of the sacred bad people list just because of an event that people would deem "unlikely", but that is what resiliency is about. Bureaucracy is tough though.

@monad_tcp 29 дней назад

In this case it would be better to not have any software at all. I'm not cutting the developers any slack.

@e.m.aseguin9401 29 дней назад

I am. It sure but it seems that there has been only one incident with the egates since 2008 plus the issue comes from a home office “blunder” I would say not too bad.

@robervaldo4633 29 дней назад

even pictures can be hashed and the hash be used for comparisons without disclosing the original

@ForgottenKnight1 29 дней назад

Unfortunately, that is correct. When you see a system failing, you only see effect, you have no idea about the story behind. GDPR is good and should be applied to all systems processing personal information.

@SirBenJamin_ 29 дней назад

Its easy to make assumptions, but Im going to guess the security requirements are more complicated/restricted than people are thinking. We dont know that a local cache was even allowed.

@charlesd4572 29 дней назад

Indeed and if you did take this approach you've just created another set of possible failure points - what happens if your cache fails to update often enough you could be letting people through you shouldn't be. At least the original failure prevented this.

@inferzard 29 дней назад

@@charlesd4572 1000% this

@thatnod 29 дней назад

Almost certainly NOT allowed. Despite it being a list of 'bad people', it's not okay for that list to be distributed or risk being leaked. So can't be brought to the edge devices for security reasons alone. However, there could still be multiple secure replicated, distributed copies available at fall back locations should one become unavailable for any reason. A single copy isn't just a single point of failure, it's also a bottleneck and the pressure on it is likely to be the reason it fails.

@PavelHenkin 29 дней назад

Yeah I guarantee that they considered it and we're told it wasn't an option

@robervaldo4633 29 дней назад

even pictures can be hashed and the hash be used for comparisons without disclosing the original regarding the "outdated cache" problem, it seems they fell back to manual check, which probably means even more out of date data and more error prone

@_Mentat 29 дней назад

It often boils down to, "we took the cheapest quote." Local copies require extra nonce in each gate which means more cash.

@ContinuousDelivery 29 дней назад

That assumes that a good design costs more, I am not so sure that that is true, particularly when the kinds of orgs that get government contracts are usually massively overstaffed with teams of hundreds and sometimes thousands of people for a project that could, and should, be built by 20 or 30 people. Sure the procurement approach for such projects is broken and a problem, but that goes a lot further than the "cheapest bidder" problem I think.

@TimJW 26 дней назад

Mainly decision makers not talking to or believing engineers, accepting declared risks out of hand with no intention of addressing them because it moves their plan forward...

@nickjcresswell 29 дней назад

Michael Lopp - AKA Rands - wrote a book called Managing Humans around 2008. In it he describes two distinct groups of people you might meet in software; incrementalists and completionists. The incrementalist will experiment, validate and try and improve an application based on their learnings of it. The completionist thinks everything can be designed up front, but needs all the information and requirments to do so. I believe completionist thinking leads to the kinds of problems in the video - and it very widespread, because for most of us who were schooled in the western world, we are bred and educated as completionists!

@ContinuousDelivery 29 дней назад

I am definitely an incrementalist, and don't believe in creationism for complex systems.

@joeldorrington7898 29 дней назад

I like that I'm gonna remember to look out for completionists. A coworker of mine was like that and just couldn't cope

@philipoakley5498 29 дней назад

Isn't there a similar dichotomy between those who need plans and those that accept uncertainty (e.g. Philippa Perry's Guardian article Sun 19 May 2024), and how they cope in different ways.

@brownhorsesoftware3605 29 дней назад

In my experience this is agile vs waterfall or (since I learned programming at a large electric utility and then designed and implemented an ATM system at my next job) an engineering vs a bureaucratic approach.

@muhdiversity7409 25 дней назад

God, I hope none of you guys ever build a skyscraper that humans live in. No wonder software gets worse by the day.

@laci272 29 дней назад

I can't imagine how someone could design an e-gate system without local caching... I've made gym entry software that had local caches ON THE GATE so wifi doesn't affect it.. and updates were pushed via mqtt to each gate... and these e-gate systems, i assume are made for many many millions.. it's a shame..

@andreaszetterstrom7418 29 дней назад

Maybe they did think of all those things but we're stopped from doing them due to other reasons such as GDPR, bureaucracy or other rules. The POI terminals are vulnerable to hacking after all.

@TonyWhitley 29 дней назад

I'm sure the data could be anonymised - the e-gates don't need any detailed information, just "Is this GUID in the bad people cache?".

@MrC0MPUT3R 29 дней назад

There's an amazingly large number of software developers who just don't think about their application's connectivity at all. I say this as a web developer who's coworkers are completely fine with making 10 different API calls to gather information instead of just keeping it cached.

@MrLampbus 29 дней назад

Many years ago, ticket controlled passenger barriers were installed in Newcastle Central Station. I travelled several times a week. As an experiment, for many months after the barriers went live, I used the same old ticket to pass through. I expect either a software / requirements bug, or incomplete development ... or problems getting some link to a back office database. The interesting thing about the passport gates is that there is actually no need for the data to be fully up to date to the millisecond, or even the hour. It is all about limiting risk to the protected territory, not eliminating the risk. (Although I suspect that many do not understand the difference). The system may also be logging arrivals and departures of all individuals as well as other "features" in addition to a "bad actor" lookup.

@ContinuousDelivery 29 дней назад

Yes, and all completely solvable problems with a truly distributed design.

@Exiide89 29 дней назад

Thanks for the book.

@brownhorsesoftware3605 29 дней назад

Thanks for another excellent video! You are singing my tune. The basis of good design is making no assumptions. Always testing for and handling failure is the key to robust code. That is what queues are for...

@Һагараӏъ 29 дней назад

Erlang is a great tool for running those e-gates. Embrace the failure, live with it , anticipate it , crash now and catch it later (in a matter of ms) using supervision trees and distributed computing Erlang provides among many other features. Great talk, sir!

@bepamungkas 29 дней назад

Erlang also designed as a distributed system where the client is the source of truth. So no matter how many retries there will be guarantee of correct input. E.g, multiple cell towers could handle a call, and whichever failed to handshake first would crash. It will in no way affect the correctness of the entire system, only a momentary capacity issue. When it's the reverse, you can't really rely on retry because there's no guarantee that the crash condition would either repeat or be resolved in timely manner.

@tristanmills4948 25 дней назад

In the early 2000s I was asked in an interview to look at a system and say how it could go wrong, and how I'd mitigate against it (or what could not be and we'd accept failure gracefully). I haven't been asked anything like that since. Just a lot of questions about algorithms I'm never going to implement, or design a system from scratch in 45 minutes with no time to really dive into all the failure modes.

@leemauger6610 29 дней назад

Is very likely the bad person list is Top Secret classified and can't be distributed in the way you propose (otherwise every eGate / cache would be potential leakage point). They will probably be doing an API call using the name of person at eGate to get approval.

@raymitchell9736 29 дней назад

There's too much "happy path" engineering these days. Those AI phone bots are top of my list of things to be annoyed at lately... They aren't, as they put it, "able to understand you like a regular person" they easily get tripped up with keywords and make the wrong assumptions about why you're calling in. Each time it got it 100% wrong... and wading through the system and getting transferred from department to department just added to the frustration. Cable companies and Insurance companies... I'm talking to you and furthermore, I believe customer service is dead, and then you add the f*&K up engineering on top and you have the perfect storm. I want to live under a rock, on preferably on the Moon... although lack of oxygen might be a problem... Well, let's have Happy Path engineering solve it... Shall we?!?

@TonyWhitley 29 дней назад

My local NHS trust has an automated switchboard that uses the keypad to narrow your call down to which hospital but then switches to voice recognition for which department it finally directs you to. Unfortunately it can't distinguish between "Neurology" and "Urology" 🙄

@NicholasOrlowski 29 дней назад

Getting a hold of that "bad person list" would be a pretty valuable hack and much easier if there are hundreds of copies outside of a secure area... sounds like a bad idea

@ContinuousDelivery 29 дней назад

An easy problem to solve, in a way that is at least as secure as a central DB.

@NicholasOrlowski 29 дней назад

@@ContinuousDelivery That's easy to claim. Admittedly we don't know what their security requirements are, but would you let copies of even hashed/salted passwords replicate to unsupervised, unsecured workstations in a public space? These are national secrets. How would this be easy to solve from a security standpoint? I'm not excusing their faults - I'm doubting that replicating national defense secrets in a public space is a good idea

@TK-UK 29 дней назад

@@ContinuousDelivery Come on now Dave a single copy is easier to secure than multiple copies, which are inherently more risky and need to be updated so the risk factor is even greater. Lets not start talking rubbish to defend software.

@bepamungkas 29 дней назад

@TK-UK with multiple local copy, you only need to secure the hosts. With single copy, you also need to secure the transit. Depends on use case and local infrastructure, multiple copies are not always less secure.

@jonathanmarino7968 29 дней назад

@@bepamungkas True, whether you have a single or multiple copies, the gates need access to the data. If there's a risk of a copy on a gate being exposed, then there's also a risk of the gate itself being compromised and used to extract the data.

@ghhoward 28 дней назад

As an American IT tech hearing that a government agency was using a contract limited, BT Wi-Fi service just sounds crazy full stop.

@RenoirB 29 дней назад

Amazing. Gold stuff.

@ContinuousDelivery 29 дней назад

Glad you enjoyed it

@Apipoulai 29 дней назад

Sounds like some one made the decision to have only source of truth (considering PII is being looked up). If one does not trust storage at rest in caching systems than this is what managers often decide. They should have at the very least have had a redundant communications network though. Another issue i often see is that management is wary or downright scared of eventual consistency: imagine a database was updated to include a new unwanted person and the gate wasn't updated in time.... Everything involves risk, but if you are optimising to prevent that one small edge case to the detriment of the entire stability of the solution. YOU ARE DOING IT WRONG :P Thanks for another great thought provoking video!

@ianflint4610 27 дней назад

Lots to unpack here. Replicating data into a local cache will reduce/remove the immediate impact on back-end system failures, but, it throws up other problems beyond update latency on the local cache. You now have the opportunity for secure data to be accessed and manipulated away from the central security. You have created a weak point that could potentially be exploited. One would hope that you can harden access to this local cache to prevent that, but you've created a new design issue and weakness that has to be assessed and fixed. Availability is a key issue, but also, system security integrity has to be maintained. I would envisage a capability to put an all-stop on processing. This could be for example to maintain security and integrity - something that likely would never be acknowledged. That 'need' would likely have a higher priority than end-user inconvenience. That said, any networking for this kind of application must be on secure infrastructure only (which absolutely excludes the use of WiFi at any point). The problem is that integrators and systems maintainers tend to put in back-doors to make their day-to-day life easier. I know this happens. In scenarios you wouldn't believe. Just putting 'exclusion' data locally creates potential problems then created from providing stacked updates back to the central system when endpoints become detached. Not insurmountable, but another set of problems to be overcome to ensure latent updates are correctly applied. There will be an awful lot of processing going on behind the scenes to provide for example, tracking alerts to many UK security agencies. Processing that has to be done in real-time. It is a bit more involved than tracking stock and sales transactions in a retail environment. Could the e-Gate system do the same as the laptops used instead? Probably, but, there may also be operational reasons for falling back to a people based system. But, I suspect the actual cause was budget limitations considering the cost of a more flexible higher-availability solution. A 'safe' system that has a 95% availability target was all that the budget allowed.

@john_critchley 25 дней назад

I like the idea that we can and do take components out as a matter of course and even that all components can and should be upgraded while the platform in online - and there should be no impact on the users...

@josiah5776 29 дней назад

My former, and final, employer had a mainframe database as a single point of failure. Almost everything in the entire company that depended upon data ultimately relied on the mainframe. Adding insult to injury, the mainframe had to be taken down for several hours every weekend for maintenance. That in turn, shut down every dependent system, including the company ecommerce website which promised 99.99% uptime to business partners.

@marcbotnope1728 28 дней назад

This is not a software developer problem, this is a management problem.

@ivanmaglica264 28 дней назад

1. why are those machines on WiFi? Deauth attack could disrupt the whole system. Critical infrastructure MUST be wired. Period! 2. why is WiFi metered, especially for critical infra? 3. Why is WiFi not managed by airport or at least an IT company?

@gammalgris2497 29 дней назад

Some organisations tend to split up their teams too rigidly and too strictly into isolated teams. Things tend to break when teams don't communicate enough.This is an additional level that makes distributed systems hard to maintain.

@edgeeffect 29 дней назад

Complex IT systems and governments REALLY REALLY do not mix. ;)

@josiah5776 29 дней назад

That must have been a fun "all hands on deck" video conference. And by fun, I mean not fun.

@nofatchicks6 29 дней назад

How did the manual checks continue, if this was the error? There must be some resiliency in the manual process, so why wasn't this brought through to the e-gates? Edit: I posted this before 15:15 😊😊

@gan314159 29 дней назад

Simian army, and the MS STRIDE threat modelling could have identified this, and then adoption of something like the BitTorrent protocol (you wouldn't steal a list of those not allowed into the country) to provide resilience around node failure could give peer to peer sharing between gates automatically if they can see each other on any IP network. Local DB for actual processing and....it....should...work?

@pierrelautrou1210 29 дней назад

I agree whith everything that was said in this video but I would add another point to consider : cost Increasing the resilience of a system always comes at a cost, whether it be increased complexity, reduced capabilities (which are both mentioned in the video) or increased financial cost. It's a balancing act between resilience and cost. I believe it is necessary to perform a risk analysis to determine where to place the cursor between resilience and cost.

@esakoivuniemi 29 дней назад

I agree, sort of. I think it's a balancing act between the cost of failures and the cost of resilience. In a distributed system, network outages are guaranteed to happen. The only unknowns are how often and for how long they'll occur. But as you said, a good risk analysis should give you an idea of the cost of failures. Of course, there's also the small matter of who will pay those failure costs. It's not always the same organization that acquires the system.

@chudchadanstud 29 дней назад

So how did manual gate manage to get people through if the automated system's connection failed? Don't they have access to the same db? If not why isn't it integrated into the egates?

@bernardobuffa2391 29 дней назад

the T-shirt is this the robot from The Hitchhiker's Guide to the Galaxy? LOL

@ContinuousDelivery 29 дней назад

Yup! Marvin the Paranoid Android.

@igiannakas 28 дней назад

It’s the £££. Hardening a system with extra resilience adds development time and infrastructure. at some point a line is drawn where the budget / cost vs benefit trade off is deemed acceptable and anything that may happen rarely is just accepted as a risk. That’s the reality unfortunately. While there are several solutions to the problem that happened, at some point they were deemed as not desirable due to the financial impact to the project. Ie a dual network link from two different providers routed through different pathways costs more than not having it. A local cache or local server that would need to meet XYZ data protection requirements in each site costs more (surveillance, cages etc). Redundant power costs more and so forth. So at some point there’s a line drawn in the reliability department and the rest accepted as collateral damage, especially in massively over budget public sector project

@pdr. 29 дней назад

It's a lot less sophisticated than you think. Even the photo recognition is done by the humans at the desk behind the gates.

@daverooneyca 29 дней назад

A government agency with a mission (and arguably safety) critical system that has a freaking Wifi usage cap in the contract?! While I absolutely agree with every engineering point in this video, I have to think that there's a bean-counting procurement issue here as well. 🙄

@JustLikeBuildingThings 29 дней назад

The problem is not software...

@uberseehandel 29 дней назад

Another example of egregious British Information System design and implementation. Other examples - neither the Passport Office nor the NHS can handle surnames with full stops or commas in them, such as St.George or d'Ath. This issue is common throughout the British private sector and local government as well. To be fair, HMRC and DWP get it right, which causes problems when trying to match individuals between different systems. "Losing" records is another common failing, so DVLA can't find records relating to driving licences, for example. One of the issues I most frequently encounter in the UK is the non-involvement of senior people who have actual knowledge of the real world requirements. YABSS (think about it).

@JamesSmith-cm7sg 25 дней назад

The home office is not going to allow airports to store copies of their data. They're also not going to host their own local cache instance of their API within airport networks, either. This has a security nightmare written all over it and would get shot down immediately. The problem appears to be connection related. If true, they should be looking at using dedicated connections, and have a backup connection.

@quenchize 29 дней назад

I still don’t understand how a WiFi failure causes this. Is the master bad person list on a server or laptop that connects by WiFi!?

@pirakaleader2 29 дней назад

aye and probably on an excel spreadsheet too

@TK-UK 29 дней назад

Blast radius... nice term I prefer "splash" as I also use this to define different release types; "A splashless release" no model changes just has perhaps some copy update or minor UI update.

@jamesc7400 29 дней назад

When you mention, “imagine or predict what can fail”, I was wondering, do you not have a structured approach or tool to analyze potential failures such as FMEA (Failure Mode and Effects Analysis) during the design process ?

@ContinuousDelivery 29 дней назад

My approach isn't at all formal, we simply think about and explore, usually in small groups, all of the ways that we imagine bad things could happen. And we monitor what goes on in production to see if we are hit by any bad things that we didn't imagine and each time we find one of those, we find ways to make sure that the system won't fail in the same way again, and repeat.

@ErazerPT 29 дней назад

I can get not distributing the data, as there are many concerns about that, but... not having network link and/or db redundancy??? I'd say this is gov infrastructure not some fly by night back-alley web-host company, but then i realized THAT is the problem.

@capability-snob 29 дней назад

Help, I filled up the local disks with a large cached naughty list and now my gates aren't responding! 😅

@timop6340 29 дней назад

This is exactly what has pissed me off in some small scale hw products with not so good software. Bug reports are downplayed and there is the attitude that when all accepted bugs have been squished the reliability will be perfect. And my first thought (lacking experience and knowledge) was that shouldn't they have built the reliability through decisions done much earlier in the process.

@alexischicoine2072 29 дней назад

Very interesting. What if the backup list was only used when the connection doesn’t work? Wouldn’t that avoid the eventual consistency problem when things are working?

@ContinuousDelivery 29 дней назад

there is no eventual consistency problem when things are working, changes would happen so fast that it would be, effectively, the same as the supposedly synchronous version.

@BillsCountrysideAdventures 29 дней назад

Greg ceo from Octopus energy shared this :)

@jimhumelsine9187 29 дней назад

I suspect the UK Post/Fujitsu. 😁

@flesz_ 24 дня назад

yes it was definitely wifi issue, their servers at home office are connected over the wifi and one of the employees unplugged the repeater from the power socket to connect his kettle to make a cuppa

@joseluisvazquez5126 29 дней назад

"Everything fails at the same time" = Single Point of Failure = Centralisation I argue that when you have a Single Point of Failure you must know you are dealing with a Centralised System, no matter how much distributed it is disguised to be. Happens everywhere. In engineering and in economy: what the Fed does affects every country, and what the Fed does is in turn dictated by how much the Treasury wants to overspend. We live in an era of peak centralisation, too much reliance in a handful of cloud providers and a handful of big governments. All while pretending everything is distributed, when in reality it is all tied in lockstep. Decentralisation is the answer to most if not all our pains.

@ChrisM541 29 дней назад

Unfortunately, the cheapest bidder tends to get the contract. That inevitably means delivering the absolute minimal agreed specification, and if that doesn't include contingency then what can the developer do? It's not the developers job to write the specs/urs (though yes, they will have an influence, especially if the purchaser's IT skills are low). Local data? Surely we've cracked that security nut by now?

@justinth83 29 дней назад

I'm quite surprised they were relying on WiFi for network connectivity of critical infrastructure. Especially if the software is not designed to be tolerant of faults then making the infrastructure HA is all the more important to prevent and reduce downtime.

@beaticulous 29 дней назад

Hello Joe!

@br3nto 29 дней назад

They probably also have people who are not Software Engineer educated at the top of the management hierarchy. That is, they probably have a Chief Technology Officer, but not a Chief Software Engineering Officer in the C-suite. They probably have an IT vertical, but not a Software Engineering vertical. Each Software Engineering team is probably part of some random business unit and isolated from one another.

@aliencommander 26 дней назад

nice

@wywarren 29 дней назад

Sadly though for most contract jobs, they require line items (often hour rate based) on the budget allocation for the quote/invoice. When you have a nominal chunk of it allocated to testing/security, you'll often get push back from the client/government. A contractor can break down their rigorous testing methods, but usually the one approving the budget/job has no background to comprehensively audit the requirements. Even with submitted documents as part of the acceptance criteria, few have the know how or willingness to verify that in fact the reports were correct and accurate. After the job wraps up, there is often little to no monitoring deployed either to log and flag any degredations or points of failure. There really needs to be more tech/industry consultants to help validate these mission critical systems, but the tricky part is also being able to find unbiased experts that aren't incentivized by the contractors targeting the jobs.

@BryonLape 29 дней назад

The first mistake the people made was allowing bio-metric scanning.

@BillsCountrysideAdventures 29 дней назад

You need backup systems to backup systems

@bc4198 29 дней назад

Ohhh, shiiiiit. Oopsies! You offering a solution reminds me of my professor, circa Jan. 2001, who used dates as an example of making a class. Without meaning to, the date class we made off the top of our heads would have been immune to Y2K issues 😅

@pnmcosta 29 дней назад

Lets be real, someone pressed the Internet button off at the home office! 🤣

@SubTroppo 29 дней назад

I am reminded of an instance where an incompetent electronics engineer (instrument landing systems) was "kicked up stairs" into the same company's marketing department after making egregious errors in specifications for a project which went ahead . What's the bet that (later on in his glorious career) he was writing and monitoring contracts for supply of those self-same systems (as the client)? ps I wonder how many people have actually worked on systems that function as expected?

@Private-GtngxNMBKvYzXyPq 27 дней назад

WiFi? For a mission-critical system? With security and privacy data? Competing with other systems for bandwidth? Over quota? C’mon. Seriously?!

@yp5387 29 дней назад

Eventually all comes to humans. No one person can fix/create a perfect system that guarantees it will work always. Even if he/she tries to do that not all developers don't get onboard and that creates a gap and eventually the person who was originally trying settle for less.

@ContinuousDelivery 28 дней назад

...and that is the whole point of this video, it is nothing to do with perfect systems, it is about designing for resilience so that when things do go wrong, the system can cope without total failure.

@michaeledwardharris 26 дней назад

Excellent video. It's useful and important to apply actual engineering principles to software development. Proper engineers have a different mindset that most software devs could benefit from adopting.

@tomaskarlkjartansson 29 дней назад

Some crazy assumptions being made here I think. Maybe it behaved exactly as planned. If it looses connection to the server it shuts down and makes the trained professionals handle the passport check. You make 'keeping everything in sync' sound super simple but that is everything but. I've had to design systems that have to do with Invoicing customers and having everything in sync was a super hard problem to solve. Also, this event driven architecture (which I'm a fan of generally) is subject to all the same failures you mentioned in the App1 and App2 scenario. This is not some gym entry scanner where there is no real consequence if someone slips through or not, this is the border of a country. Lets say we have a local copy of the list on each machine. Now you have to put in fail safes to make sure that the list is always up to date if not you could have a gate letting in thousands of people and have a weeks old list for all we know. Can you imagine the fallout if a person on the terrorist watch list slips through a system like this (which they probably do for all I know) and they can point to they system and say "Yeah, this machine had not received the updated terrorist list so he managed to pass through" The system can probably be designed better but, when you are dealing with highly critical systems, in many cases you do want a single point of truth. It is embarrassing and bad enough if you send out wrong invoices to customers (One of the systems I've had to design). Now translate that to your national security and you are in a whole different ball game. Sometimes its much better that the system just shuts down rather then keeps on chugging with wrong data and potentially can cause much more harm.

@justinth83 29 дней назад

I agree with this too. What's crazy is they architected a system that is reliant on a DB/API always being available but then didn't build the infrastructure to be highly available.

@ContinuousDelivery 29 дней назад

My point is that dropping resilience because it is harder than fragility isn't a good answer, Yes event based systems suffer all of the same failure points, but a well designed event based system make it MUCH easier to design-in resilience, it is after all how Databases and virtually all serious financial systems work, oh as well as Telecoms. My point is that these problem are well understood, and have been for a very long time, because some people don't know these answers and think that choosing overly naive solutions seems like the real problem to me, somebody earlier in the comments called it "Dunning Kruger Architecture" and I think that is a good description. This design was not fit for purpose for a system like this. I disagree with your characterisation that the kind of distributed design I suggest in this video is only suitable for Gym membership. Quite the contrary, the idea of synchronising all these things is the less stable, more open to attack and more prone to failure solution.

@tomaskarlkjartansson 28 дней назад

@@ContinuousDelivery The point of these toll booths is to speed up the processing on the border. If they cannot look a person up due to some technical failure they default to stop working and forcing the people to go through the trained border patrol agents that handle this task normally. The point I was trying to bring across is that this function is a perfectly viable solution in a critical system like the protection of the border. Banks and telecom companies are solving a completely different problem where availability of service is much more important to them then how long a tourist has to wait at the airport. besides that, I'm no expert on this but I'm 99% sure that a banking app would need to have 100% access to some centralized single source of truth before you can transfer funds or pay bills etc. I would willing to bet a whole lot on them not allowing you to do any meaningful work on the bank app via a cached copy of your accounts. Also, I wasn't saying that your solution is only viable for a gym, the point I was trying to make is that this is the countries border. The stakes are super high in this situation and for the system to potentially to get it wrong is not a good idea. We cannot take one design approach and use it for all system designs. In some (and probably most scenarios) the design you mentioned is the preferred one, i 100% agree with you. But I can think of many situations where just stopping the work and defaulting to a more manual process is a perfectly viable design e.g the bank app cannot validate the amount you have in your account and forces you to either wait or go talk to a teller. But finally, I would like to congratulate you on the channel. I enjoy watching your videos and I am subscribing :) and one final thing. Just to be clear. I'm not saying that the design of these gates is perfect, I have no authority or knowledge to say so. All I'm trying to point out is that this is a perfectly valid system design, to have a single source of truth and stopping work when you don't have access to that truth.

@Fjonan 29 дней назад

Chaos Engineering sounds like fun

@zshn 28 дней назад

11:20 Wouldn't it be a privacy data violation if non-personal / commercial edge devices cached personal & private data? The security risks are huge of caching such data on commercial edge devices. It's like saying card processor at my coffee shop or local restaurant should cache credit card details of all its customers.

@oussamasaidi5836 29 дней назад

Welcome to my life as a person with a third world passport that needs to go through the manual process of validation at every airport in the world (and always randomly selected for a completely random security check)

@bing6740 29 дней назад

I'm a software engineer, but my comment isn't tech-related. If anyone asks your friend who is not from the EU or a Commonwealth country, they'll tell you about the hours of waiting we all experience every single time.

@WhereAreTheSquarePants 29 дней назад

Is it just me or if this was a privately owend system, banking or trading platform bringing a lot of money this issue wouldn't have failed... But since is public ...

@malavoy1 29 дней назад

"How many ways can this conversation fail?" Just ask a married couple.

@megvt08 28 дней назад

I do agree with small portion, but having a cache is not correct and form a security perspective is wrong. lets put it this way. 1. 2pm Internet is down and system has a local cache of 10 people. 2. 3pm new person is added as a bad person (system is still down) 3. Bad person enter the country and use e-gate -> Welcome Sir .... Above is very unlikely but still can happen. We are dealing with very very sensitive data here and I am surprise the idea of local cache is proposed. I am surely its not a perfect system, but you need to understand the full system first, before making suggestions.

@charlesd4572 29 дней назад

Captain hindsight to the rescue.

@vitalyl1327 29 дней назад

Nah. There are formal methods and well established design protocols that would have highlighted all such issues on a design stage. But the developers were incompetent and should have been flipping burgers instead of being anywhere close to any real world engineering.

@charlesd4572 29 дней назад

@@vitalyl1327 I think your point is delusional to think if you just have the right protocols all failure points can be spotted. In engineering there are no such things as solutions just compromises. He seems to be arguing against that reality. We could spend from here to eternity working on that system and each "solution" to one problem creates a new set of failure points. For example, his "solution" will create other problems: is the cache being updated often enough or at all (this could be worse than failure allowing people in that are dangerous - would it be worse for the system to bar them because it stopped for an hour and then flag them up afterwards). BTW what's wrong with flipping burgers - earning a living in a thankless task rather than going on benefits should be saluted not degraded.

@vitalyl1327 29 дней назад

@@charlesd4572 there is nothing wrong with flipping burgers - it's a decent profession, and it is far more suitable for the self-taughts and bootcamp "graduates" than engineering. There were no *unknown* failure points in this system. Like, in every distributed system every link can go down. Obviously. "Engineers" who fail to use formal methods to evaluate all possible failure modes (based on known individual component failures) and their consequences are not engineers and should not be allowed to design anything at all. I've been building fault-tolerant systems for decades. It really hurts to see what abominations people build when they have no engineering background. Cache is indeed the wrong solution here - and using a WiFi for anything even mildly mission-critical is an outright crime. A fail-safe design would have had multiple communication channels and a quick way of detecting the issues with them.

@ContinuousDelivery 29 дней назад

I don't think it requires much in the way of hindsight. Which really is my whole point. "Let's build a system where every eGate in the country checks in with a single, non-clustered DB to filter out bad people" - What could possibly go wrong? The huge mistake here is to assume that because we can never imagine every possible failure scenario, that we can't build resilient systems that can cope with most of them, even the unexpected ones. It is unlikely that ANYONE would have predicted that overloading a WiFi service whirl break access to the DB, but assuming that the DB is a "single point of failure" is obvious to the most superficial overview of this system, and so that case should have been considered in the design.

@charlesd4572 29 дней назад

@@ContinuousDelivery thanks for replying - I do enjoy your channel. I have little doubt you would've probably done a better job here but I dislike the idea of folks talking about solutions in the absolute (it creates false sense of security), there are only ever compromises. There is never enough time, never enough funding and never enough staff. But even if you had all that you'll always have failure points and you don't know by using one "solution" to a problem you don't create a worse one - the truth is we just don't know. All you can do is define your specification, implement a solution and stress test the system - if you can and hope you've captured all the most likely and most significant failure points.

@AlecBickerton 29 дней назад

Smells like CapGemini or Accenture et al.

@robertlenders8755 29 дней назад

The reason these failures are inevitable is no one responsible for the system pays any price for being wrong. On the contrary, when it comes to government failures they inevitably get a bigger budget.

@giannismentz3570 29 дней назад

A bigger budget won't fix bad choices/designs, it might patch or hide those issues for a bit until they come up again and you need yet a bigger budget. This is usually the norm in some places, and if no-one pays the price it's easy for those contractors to actually go for failure. Gov contracts are very costly for all those reasons, to avoid failures, to design with redundancies, and make sure something works no matter what, etc. If specs are not met, and there are no consequences, why have any specs in the first place or why build according to them if they even exist? And, in some corrupt govs, the contractors can always share the profits with some of the gov overseers, spend the money right into their pockets and do not much, or just enough to get a bigger deal soon. This is standard practice in some places.

@mrpocock 29 дней назад

You design every single component of a system like this assuming it will usually fail. You design the data messaging assuming data is not reliably transmitted or is duplicated. You ensure that every component fails safe. Or you design the entire system to a golden path at 10% the developer cost.

@ContinuousDelivery 29 дней назад

I am sure that that is the calculation that people make, but I don't buy it. I have worked on well designed distributed systems with good people and it didn't take longer, it was quicker, and it didn't cost 10x because dev salaries don't work like that. Actually for systems like these developed by government, they are so overstaffed that I am sure that the kind of teams I worked on would have been 10x cheaper, as well as better.

@mrpocock 29 дней назад

@@ContinuousDelivery I don't think it costs 10x to do it properly, but that's what it looks like in the project management process if you start listing all the failure modes you can think of and then treat each one as having a development cost in proportion to the golden path. I've worked on some scientific distributed processing code bases where nearly all the code handled faults and recovery. If we'd had to guess a number for the developer time for each failure mode, we'd never have been allocated the hours those numbers said we needed to handle them. Management would have said it was too costly. It wasn't, but tickets and features and user stores inflate it.

@simonabunker 25 дней назад

Isn't the problem that this wasn't a distributed system? If there is only one database that they are all reading from that sounds like a very monolithic system to me. A great example of a really good distributed system is Netflix. There are some great articles about how they tested their system and how they got it you run on AWS and assume servers could disappear at any time.

@rickbates9232 29 дней назад

There is an issue with the caching concept ... currency ... you will not pickup recently added bad people since the cache was last updated ... this could be being updated real time ... sure you could always try and lookup from the true source before the cache ... but then the system owner must accept that a bad person could get through in the event of an outage. What also can happen in these types of systems is that they lose connectivity to the current source, and use the cache and no one notices ... so there need to be better monitoring of what data source is being used and the system owner needs to manage the trade off between system availability versus implications of use of stale data. A question in this e-gate instance is did going to manual mitigate the stale data issue ... as in what bad person source did the manual system use ... if any ... again getting the system owners to think through these issues in large organisations can be very tricky as they typically want the best case even in competing requirements. Did you want to talk about the Post Office too?

@ContinuousDelivery 29 дней назад

As I said in the video, It would be interesting to know hoe the emergency response actually worked. Either they had another route to the bad-person-list or they had copies of the data. If they had an alternate route to access the list, why not automate the switch-over to that, rather than rely on staff carrying laptops running through airports? If they had a copy somewhere then this has all of the same problems/trade-offs as the cacheing solution I recommend. Cacheing is a well understood problem and there are lots o patterns an approaches to make it work, even for distributed caches. Non of this is simple, but it is what is required to build distributed systems.

@ashnur 29 дней назад

Single source of truth shouldn't mean there is a single location that controls all data. That was never what it meant and only half-wits who can't look further than lay etymology thought it was. The actual single source of truth means that you have a single node in your distributed system that masters some piece of information and others can replicate it. In other words, if it was done as a centralised database, that wasn't a single source of truth, it was just bad architecture.

@philipoakley5498 29 дней назад

While a good example of failure, it maybe that other factors really really got in the way of the 'design', such as the secrecy of certain aspects of the said 'bad person' list (Oh, people & politics, sigh), but as said the fall back may have been worse, but more acceptable. And maybe the presumed BT shutdown of a Wi-Fi channel should (that word again) have just change to the £1/MB charged else where on some mobile plans ;-) politics & money.. (see Matheson's 11k iPad bill...)

@ContinuousDelivery 29 дней назад

I am sure that that was the thinking, but that *is* the design challenge here, how can you achieve all of those requirements while still achieving resilience. I can think of several different approaches that may work, and I am sure that we could find one that did once we had examined all of the constraints. But that is the job! Not simply making a crap system because some parts of it are difficult to achieve otherwise.Our job for a system like this, should be asking us to solve the hard problems, not just find the naively simple solutions!

@philipoakley5498 24 дня назад

@@ContinuousDelivery One part of the problem is the generally fallacious belief that we can always create a design that solves _all_ of the problems with just a lead developer or small team. Once the scope of the 'system' reaches the level of "Crapper's brainfull" (the metaphorical last person to completely understand the system they were designing) then we continue to find levels of detail that were unknown or unappreciated that result in such failures. This is tree of knowledge issue is covered by RE Bohn's "Measuring and Managing Technological Knowledge" paper (1994/1998 Sloan Management Review/ ..), and is also reflected in Conway's Law. It ultimately results in the proliferation of the ideas of Systems Engineering which also attempt to address the issue. 'Government' systems do tend to have that greater level of complexity ("Wicked System" problems of multiple stakeholders and longer term issues) than commercial systems with their 'simpler' shareholder focus (which is bad enough..) and PR management. However I do agree that there should have been more resilience foresight and expectations of the 'if it can go wrong it will go wrong, badly' style.

@IulianOnofrei 29 дней назад

But, what if the scanning process itself is done on a server, and the gates merely send the scanned passport images and face pictures to it? In this case, they're completely dumb, and can't possibly work without a server, right?

@ContinuousDelivery 29 дней назад

Sure, you could build it that way, but I can't imagine why you'd prefer that, if you could avoid it. Keeping the compute close to the point of need is a pretty useful strategy in distributed systems if you value resilience.

@GDScriptDude 29 дней назад

Hit like and subscribe is wrong in terms of UI/UX terminology. It could result in failure of the systems involved such as hitting the screen with a bat.

@dougr550 28 дней назад

Is John mistaken or is that just word salad? That kind of felt like word salad.

@scrooge-mcduck 29 дней назад

SNAFU

@karlssberg 29 дней назад

Time and time again I encounter these distributed snowflake architectures. We should start calling them out for what they really are: Dunning-Kruger Architectures

@ContinuousDelivery 29 дней назад

I like the name 🤣🤣

@eduardpopescu9109 29 дней назад

Are you sure the way it was designed/implemented was because of "lacking of programming skills" and not "lack of funding" and/or a "we need it yesterday" approach? Not that that would be an excuse, but still....

@ContinuousDelivery 29 дней назад

That assumes that doing a better job would cost more. That's not been my experience. Poor organisation and poor developments practices result in poor systems, but not lower costs.

@merridius2006 29 дней назад

Perhaps “AI” will fix this

@vitalyl1327 29 дней назад

The pre-winter AI definitely could fix it. You know - formal proofs, SMT solvers, etc. All the stuff modern so-called "engineers" who are likely nothing but some bootcamp "graduates" know nothing about and have no mathematical background to even start comprehending.

@murdakah 29 дней назад

You make a lot of assumptions for someone that supposedly understand the requirements process, limitations and good design.

@ThePlayerOfGames 28 дней назад

It's really cool you have so many technical brains here but the reason the system doesn't have local cacheing is that is a political toy, not a system of technical efficiency. The governments are pandering to the right wing, they are posturing to "keep the bad ones out", which is why a fail-closed system based on a single link rather than a series of reversions is in place. A technical minded person researching this to build it would know that 99% of pickups (getting the baddies) happen >50 nautical miles from any given national border based on statistics released around 2016 that were measuring the efficacy of post 9/11 security policies Vs recent and older history. Knowing that, a technically minded person like the majority of commentators here would have said that having multiple redundant links over different mediums with a layered system of caches and databases would have suficed to keep people moving safely through the port without losing critical data. Sorry, not sorry.

@Kiev-en-3-jours 29 дней назад

Enjoy your Brexit guys! 😊

@TA-eo2ww 28 дней назад

!!! BUT EVEN HUMBLE LITTLE ME IN MY HOME HAVE UNLIMITED WIFI !!! Why Is UNLIMITED WIFI Not The Default For All Government Agencies?! I Have This Scene In My Mind! "Hang on Lads, Lend Me A Tenner, While I Go Down To Tesco's And Buy A WiFi Card!" I CAN'T BELIEVE IT!!!!!!!!!!!!!😮

@vitalyl1327 29 дней назад

Developers who do not use formal verification for critical systems are trash developers. There are no excuses. Ignorance is not an excuse. "Hard to find competent developers" is not an excuse. This industry is teeming with the wrong people who should have never been allowed anywhere close to programming - all those self-taughts, bootcamp "graduates" and such.

@Private-GtngxNMBKvYzXyPq 27 дней назад

WiFi? For a mission-critical system? With security and privacy data? Competing with other systems for bandwidth? Over quota? C’mon. Seriously?!