No video :(

The Absurdity of Error Handling: Finding a Purpose for Errors in Safety-Critical SYCL - Erik Tomusk

Подписаться 152 тыс.

Просмотров 9 тыс.

50% 1

cppcon.org/
---
The Absurdity of Error Handling: Finding a Purpose for Errors in Safety-Critical SYCL - Erik Tomusk - CppCon 2023
github.com/Cpp...
C++ is hard. Error handling is hard. Safety-critical software is very hard. Combine the three, and you get just one of the exciting problems faced by the SYCL SC working group.
SYCL is one of the most widely supported abstraction layers for programming GPUs and other hardware accelerators using ISO C++. As of March 2023, the Khronos Group has a working group tasked with specifying SYCL SC --- a variant of SYCL that is compatible with safety-critical systems. One of the key features of a safety-critical system is that its behavior must be well understood not just in normal operation, but also in the presence of faults. This raises some difficult technical questions, such as, "How do I implement deterministic error handling?" but also some more philosophical ones, like, “What does an error actually mean, and is the error even theoretically actionable?”
Much of the information on C++ error handling in safety-critical contexts focuses on RTTI and the pitfalls of stack unwinding. Although these are important considerations, I will argue that a far greater problem is a lack of agreement on what safety even means. This talk will focus on how safety in a safety-critical context differs from safety from a programming language design perspective. While the talk is inspired by the pain-points of C++ error handling in safety-critical contexts, the conclusions are relevant to C++ software in general. The talk will challenge the audience to rethink the situations that can be considered erroneous and to carefully consider the expected behavior of their software in the presence of errors.
I am a member of the SYCL SC working group, but this talk will contain my own opinions.
---
Erik Tomusk
Erik Tomusk is the Senior Safety Architect at Codeplay Software, where he is working to bring functional safety to the SYCL API. In a previous role, he spent a few years writing C++ and CMake for Codeplay's OpenCL runtime.
Before joining Codeplay, Erik researched CPU architectures at the University of Edinburgh, and even managed to secure a Ph.D.
---
Videos Filmed & Edited by Bash Films: www.BashFilms.com
RU-vid Channel Managed by Digital Medium Ltd: events.digital...
---
Registration for CppCon: cppcon.org/reg...
#cppcon #cppprogramming #cpp

Опубликовано:

27 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 33

@SonicFan535 6 месяцев назад

I was surprised that network connection errors were never brought up as an example where error handling makes sense, since that's the most common case I can think of where the error handling/recovery strategy should be fairly clear-cut for most applications, i.e. shut down the client component that tried to connect, boot the user back to their previous state with an error message, and continue with the rest of the application as normal. Another example would be trying to read a missing file from the filesystem, like if the user of a text editor tries to open a file from the "recent files" list that they had deleted before starting the application. In that case, just show an error message, remove the filename from the list, and continue as normal. In these examples, and many more like them, I can't imagine a better API design than the traditional form of error reporting where you return something like an exception (or error code or whatever) and let the surrounding context (like the "recent files" list) deal with resetting the state as appropriate, even if just by stack unwinding, running the destructors of everything that was created up until the point of the error in reverse order. It definitely makes sense to think about the cases where this doesn't apply as clearly, like the out-of-bounds error that was mentioned (which should probably be caught by an assert during testing rather than throwing an exception), but at least in my experience, the main source of errors isn't necessarily applications entering an "inconsistent state" (since bugs like that aren't reported as errors in the first place), and it's usually pretty rare that there's no blank state to return the application to if something goes wrong in well-behaved code, like the main menu of a GUI or the main connection broker loop of a server. And if all else fails, you can almost always at least write an autosave file, log some data and exit, which is often better for the user than just aborting the process immediately. That might not apply as much to some safety critical applications, but in general I'd say that regular old error handling still has a lot of practical value to many developers, way beyond the theoretical abstract machine that language designers are concerned with, and it would be a mistake to replace standard error reporting with something like terminating on the spot as the default strategy just because it seems easier to reason about.

@ABaumstumpf 6 месяцев назад

But those are not really application errors but rather situations that you know will occur exactly like that. trying to open a file is just the simplest of simple cases as it is basically "can i open file? No. What do i do now?". Either the file was required to be opened - in this case you are pretty much skrewed and can't do anything. Or the file was not required so it is just "can't do that" and not much of an error.

@XeroKimorimon 6 месяцев назад

@@ABaumstumpf I think it's important to split your views in a micro / macro scale of the program. You could say that Function C requires some file to be opened, and failure to do so results in an error. Function B which calls C might detect the error and do provide an alternate path in which Function A will consume the results of Function B regardless of the result. At a macro scale, you could say this program doesn't have an error, but at a micro scale, specifically function C, it has an error that basically makes it screwed in the rare event a file can't be opened

@StefaNoneD 6 месяцев назад

I need error handling mostly for error analyzation and rarely for error recovery. This is very critical for my daily work in medical engineering in order to make sure the application is as bug-free as possible. We are using a fail-fast approach for non-safety-critical development.

@N.... 6 месяцев назад

15:40 I can think of plenty of examples: - user is playing a game from an external drive, that drive is unplugged by the user's pet, using error candling you can pause the game and inform the user, they can plug the drive back in and the game can recover. Trying to access files from an unplugged drive is definitely an error and definitely recoverable. Similar situations are possible with external GPUs - unplugging the GPU the game is using shouldn't result in a loss of game save file progress. - A robotic arm's camera gets disconnected accidentally due to shifting load, the system halts and requests user intervention, camera is reconnected, the system recovers after user confirmation. Allowing the load to disconnect the camera is definitely an error and definitely recoverable. There's all kinds of variations of this where a safety-critical system needs to stop moving and require user attention. - a self-flying airplane suffers engine damage due to failing to avoid a bird, it must make an emergency landing and be serviced before it can fly again. Failing to avoid the bird is definitely an error, and being able to make an emergency landing is an important way of handling that error and (eventually) recovering from it. I'm sure there's plenty other cases where a safety critical system needs to do something more complex than simply stop moving in order to properly handle and recover from certain kinds of errors.

@GuillaumeGris 6 месяцев назад

For the first two examples, I’m not too sure how you could reliably differentiate a device disconnected from a device failure or a corrupted state of your program. It’s theoretically possible to recover from a GPU loss (by re-initializing the entire graphic state) but in an actual video game, it’s really hard to do and you have no guarantee it will succeed because you cannot ensure that what caused the GPU loss is not permanent. Though you are right that it’s generally doable to save the game state before exiting the program. I would not describe such a situation as a recovery. For the airplane example, an engine failure arguably falls into the "not really an error" category. It’s more of a special known state for the self-flying airplane. (Though if the engine does not work because the entire wing was torn off, it’s definitely not recoverable :p) The point is, from what I understand of this talk, that the speaker proposes that you either know (or assume) the state of your program (in which case it’s not really an error, it’s just a state that is handled by either gracefully landing the plane or crashing it), or you are in an unexpected state that should not happen assuming a correct state of your program (in which case it is not recoverable because your program is potentially corrupted in unpredictable ways).

@ABaumstumpf 6 месяцев назад

"user is playing a game from an external drive, that drive is unplugged by the user's pet" A bit contrived and also relies on many other undefined aspects: This can only work if the game sometimes requires data from the drive, but absolutely NOTHING in the instance it is unplugged. The error-handling code has to be in memory at that time (ignoring pagefiles etc) and nothing like the savefiles are allowed to be on that drive either (as you said that losing progress should not happen). "A robotic arm's camera gets disconnected accidentally due to shifting load" That is a hard error showing an underlying bigger flaw that MUST be fixed and a program just crashing is a good way of halting and later recovering the system - that under the assumption that for whatever reason an unplugged camera is giving you an exception in your application despite you knowing that this is a possible scenario). "There's all kinds of variations of this where a safety-critical system needs to stop moving and require user attention." There can be incidences but those CAN NOT be exceptions/errors or you really do not have a safe application to begin with. "a self-flying airplane suffers engine damage due to failing to avoid a bird" And that is a change in hardware, NOT and error in software.

@N.... 5 месяцев назад

@@GuillaumeGrisThere is no need to distinguish device disconnect from device failure. That's why you pop up a message asking the user/player to handle the situation and decide how/if to recover. As for recovering from GPU loss, it is not hard to do at all and in fact happens regularly, all modern web browsers handle it for example during graphics driver updates.

@N.... 5 месяцев назад

@@ABaumstumpf I don't see why the game can't autosave to a different drive or Steam cloud save etc. and support loading/saving from/to multiple locations in general. And even if there's no possible destination to save data to, the game can still pop up a message giving the user/player the ability to copy save data to clipboard or pick an appropriate save location themself. As for code being paged out of memory, I don't think that works the way you imply, my understanding on Windows at least is that the entire executable is loaded into memory and even if parts are paged out they are in the system pagefile and don't need to be loaded from the external drive that was unplugged. You cannot hard crash a robotic arm that is moving. At the very least, you need physical dampening systems to make it properly slow to a halt instead of halting abruptly, and that is considered a form of error recovery. Without that, the abrupt halt could cause even more damage. "And that is a change in hardware, NOT and error in software." - and so belies my point, all errors originate from sources beyond the program's control, and some can (and should) be handled within the program despite that.

@brynyard 6 месяцев назад

I've ended up mostly using an "exploding kitty" style handling of situations that might fail. Any function that may fail returns a special optional that will "explode" (abort & write an error message) if you just use it's value without checking or "defusing" it first. Makes the code flow very straight forward, and any unexpected situation is localized (you don't end up getting 100+ stack traces).

@N.... 6 месяцев назад

That has its advantages, but a problem is that the location where the "explosion" occurs might be rather far away from when the initial problem actually occurred, e.g. due to storing that explosive optional and accessing it later. This can make debugging it rather difficult.

@MatthewWalker0 6 месяцев назад

that's a fun card game ;)

@AGeekTragedy 6 месяцев назад

I was expecting the reveal to be that the "something else" that we need in the language is often just a variant (or an optional or an expected). That would fulfil a lot of the stated requirements: deterministic (at least as far as the types in the variant are); flexible (i.e. not "a monolithic chainsaw"); and able to communicate different types of information differently.

@TheArech 6 месяцев назад

Excellent talk, one of the best I've seen. Thank you, Erik!

@anon_y_mousse 6 месяцев назад

I'm sure it's in your list, but file open errors would be a type that is recoverable. Generally it's due to the user mistyping a file name and you can simply request that they try again. As a compiler author, another would be malformed code fragments. Say for instance you're designing a language that uses semicolons as statement terminators. You might decide that when you can reason about whether what you've seen thus far is a complete statement that you only need to warn the user that they dropped a semicolon. If you can't reason about it, then by all means abort with an error, but that's frequently not the case.

@terragame5836 4 месяца назад

You don't really need full understanding of an error to handle it. Quite often, you only need to know some bounds for what the error could affect. Like, for instance, running a third-party plugin, getting an error from it and deactivating it while having confidence that any possible damage is localized to its inner workings. You don't really care what the error is in this case (other thanfor debugging), but you have a very clear handling strategy for it regardless

@SolarLiner 6 месяцев назад

"Undefined behavior isn't so bad as long as it does the same thing everytime" (6:21) -- so that is, never? Even taking the system as a whole and as a black box, how can the funtional safety practitioner sign off on the behavior of a system that is eventually led by undefined behavior? The crutch is in the name, the behavior of the software is undefined, and therefore cannot be relied upon. It can't "[do] the same thing everytime", there are no guarantees or documentation, might change from one compiler to the next, or from one (abstract|concrete) machine state to the next. I am not in safety-critical software, so maybe I'm missing something here, but I really don't understand the point of view. I also haven't watched the rest of the talk yet, so that may be adressed in a further section or in the questions.

@ABaumstumpf 6 месяцев назад

"how can the funtional safety practitioner sign off on the behavior of a system that is eventually led by undefined behavior?" You are looking at that from the completely wrong direction. Accessing a float as an int is undefined behaviour - and a very popular example of how that can be used. Your error is equating the language-specifications with reality. What C++ says is undefined behaviour has no impact on what the compiler will do. "there are no guarantees or documentation" And this is also completely wrong. the language can ONLY say what is undefined based on its own rules. Have you ever seen microcontrollers? Many of them come with their own compiler that simply make certain actions well-defined behaviour and that behaviour IS guaranteed. You still write your code normally in C++, everything works as you would expect with the only difference being that you can also do thing that C++ does not define a behaviour for.

@vbregier 6 месяцев назад

@@ABaumstumpf It has an impact on what the compiler may do in the future ; that’s the whole point of undefined behavior. If your safety-critical system is relying on the specific implementation of a particular version of a particular compiler, I don’t think it should be advertised as safe… Your system may break without you knowing at any compiler upgrade.

@SolarLiner 6 месяцев назад

Accessing a float as an int (through `reinterpret_cast` I assume) is UB because it violates the assumption of the language that if pointers are of different types, then they must be of different addresses. This means the compiler is free to cache reads of one pointer even if we are writing to the other. C-style unions are also UB for the same reason. If you still intend on violating the strict aliasing rules of the language, you either have to a. prove that the compiled machine code actually does the right thing for all possible inputs, or b. use a toolchain that documents a specific behavior for this case. In the latter, you've effectively created an extension to the language that aims at resolving this UB by documenting it and making it well-defined. And that's what I don't understand about the video (and your reply too, to be honest), UB is undocumented *by definition*. Documented behavior on "UB" means it's not UB anymore, but it also means you're not following the C++ standards directly anymore. And that's why some microcontroller platforms provide specific toolchains, because they have essentially created a dialect of the language that document and erases specific cases of UB in a way they decided made sense for the platform. I also don't understand "What the language says is UB has no impact on what compilers will do". If UB didn't have any impact, we wouldn't be here having a conversation around it. UB is the reason why compilers can make most optimizations, can reason about the code and produce machine code that respects user intent but also delivers runtime performance (or binary size, for embedded systems). So UB absolutely has an impact on what compilers will do, but crucially, what they end up doing is nothing more than an implementation detail because nothing is preventing them from doing what they want (because it's undefined, by definition).

@kuhluhOG 3 месяца назад

I more interpret the sentence as "officially it's UB but in reality it is or became defined behaviour". There is a lot of stuff in the C and C++ world or interacts with it (one example is the SystemV ABI) which may or may not make things defined. Sometimes also some Compiler vendors just decide among themselves to do one specific thing which another standard doesn't say anything about. So, TL;DR: The important part when it comes to UB is not the C++ standard, but the promises (implicit and explicit ones) you get from your platform (that includes your toolchain, your OS (or lack of it) etc.). In general imo, the C++ standard can't actually decide what is UB and what not, it can only define what is well defined and what is implementation defined. The Implementations later on in case of implementation defined behaviour decide if they want to guarantee something about it and what or not.

@kuhluhOG 3 месяца назад

@@SolarLiner "UB is the reason why compilers can make most optimizations, can reason about the code and produce machine code that respects user intent but also delivers runtime performance" UB is not a requirement for this. There are languages these days which match C and C++ performance and/or binary size (and in some cases even better on both, depends on the specific case ofc) without any UB in the language.

@MatthewWalker0 6 месяцев назад

Maybe it wasn't said too explicitly, but something I understood is that anything that could be reasonably handled by std::expected is 'not really an error', which is fair, and and I think something the C++ community is starting to understand. Errors like these -- file not found, connection lost and even resource exhaustion -- should be handled gracefully by any non-trivial application, and exceptions make grace difficult.

@N.... 5 месяцев назад

If it's "not really an error" then what are we supposed to call it? The word "error" is perfectly suitable for this use and trying to redefine it so you can say there's no need for error handling is silly. Error handling doesn't imply a lack of grace, quite the opposite in fact; failing to do proper error handling results in rather ungraceful failures.

@ABaumstumpf 6 месяцев назад

nearly all the exception-handling we have is in the database in the form of either no data being found, or too much data being found - the former being just a normal occurrence as we just try to get data were nothing is there (instead of first checking and then getting the data), the later is an actual error and normally has no direct recovery (whatever caused that problem has to handle it - or stop doing anything). And there are a few exceptions in the normal code but many of them deal with handling outside interactions like userinput - but there are only very few real error-scenarios there. Most are again just a "you entered the wrong thing - try again" and in terms of the code that is just normal control-flow for an invalid input and not an error. The last error i saw and had to fix was still the result of invalid user-inputs but at that time these were separate processes and somebody just provided bad but potentially valid data for multiple different systems... a couple numbers that at one point get multiplied/divided and can lead to an overflow - so actual software error - and sadly preventing the error would have been a lot harder than just adding some code to check for the error (but even that was hard as all checks for UB MUST also prevent any UB from happening at all or the checks will be removed by compilers).

@kiffeeify 6 месяцев назад

@15:55 In my current project, I work on a data transformation pipeline that processes incoming client datasets and transforms them into a canonical internal format. This internal format tries to establish some guarantees on a dataset, e.g. "the sum over all rows of table X needs to be zero". Just now, a new requirement popped up, where business wants to have data in that canonical format, but they don't care about some of the guarantees. IMHO this is exactly the gray areay between "not really an error" and "not recoverable" - If you encounter an error, it depends on what you plan to do with your application state: Which are the guarantees you will rely on when continuing execution? Does the encountered error indicate a violation of those guarantees? In other words, if you can guarantee, that an encountered error will only compromise an isolated part of your application state and the error was caused by violations of guarantees you have given, then the error is both "a real error" and "recoverable".

@oidpolar6302 6 месяцев назад

Finally someone has raised this point

@snbv5real 6 месяцев назад

I don't understand how Sycl could ever have an SC considering it's not directly implemented on anything but Intel, this is, in fact, one of the biggest downsides of Sycl. It's super weird that we see a Sycl safety critical talk here.

@ABaumstumpf 6 месяцев назад

"Sycl could ever have an SC considering it's not directly implemented on anything but Intel" how did you come up with that?

@vrclckd-zz3pv 6 месяцев назад

Bjarne Stroustrup - Morgan Stanley Ah yes. That is what Bjarne Stroustrup is most well known for. Nothing to do with being the creator of C++.

@sjswitzer1 6 месяцев назад

It’s a bold claim that the C++ designers want to remove undefined behavior. It seems to me that they’ve proliferated it.

@sjswitzer1 6 месяцев назад

But FWIW, the second half of this talk is very good.

@FredFred-wy9jw 6 месяцев назад

I have written, designed and used both mission critical and safety critical systems and software… the worst development combination is lots of abstraction, object oriented whatever, c++ and software engineers who don’t understand the system hardware and system mission or use, …