28:17 [slides 31-32] I think “old school c developers” would define Pixel as a union of a single uint32_t and a struct with 4 uint8_t, and try to use this union as a way of simplifying the read-/writing code. Such approaches are undefined in c++ (break strict aliasing rules, I believe). I’m not sure if that C-style state of mind could guide us when designing how c++ should do it. Perhaps we should allow some std::simd for T’s that are aggregates of same-type “vectorizable” member-variables? Perhaps this is a generalization that can implicitly allow simd, mentioned in 48:22. Great talk, thanks Matthias !
Great talk! It seems that exploiting ILP when using simd can be very beneficial. Will library/compiler vendors be allowed to “do it for us” - e.g. is the default size() of std::simd strictly mandated by the hardware, or will specific compiler/library vendors be allowed to choose larger size() (perhaps based on compiler flags) to exploit ILP? perhaps the ABI tag which was mentioned is able to support such desires.
Intel's left hand: push SIMD into all languages it could, including many mask defined operations. Intel's right hand: don't give us, simple people, AVX-512 for 10 years.
Because the compiler might simply remove the loop if you don’t use it later. And for modify first I think it’s to avoid the compiler pre computing the result at compile time.
if you actually care about the performance of your data-parallel code, your PC has a special massively powerful hardware component that's specifically designed to maximize the throughput of this exact kind of task. it's called a GPU.
Only view systems that have SIMD also have a graphics processor. And if they have one it`s only as much as you need for graphics. Servers, industrial machines, cars, home and kitchen devices etc. pp.
Sending data to the gpu and reading back a result is also a pretty slow so for algorithms that utilize recursion or dynamic programming the gpu doesn’t make for a great resource.
Ahmad's law. GPU processing is only ever worth it when the compute time greatly outweighs the serial time (in this case the atrocious pcie transfer times).