Nowadays, compile time branch prediction (with profiling) is usually better than 95%, and run time branch prediction (one cycle beforehand) is about 97%, according to some manufacturers. So branching code actually suffers from this technique as it make the code larger, resulting in both higher ICache mispredict rate and pressure on the instruction bandwidth. It's not ironic that the only surviving major architectures are those without a delayed branch.
Perfect branch prediction recognizes that only about 2-3% of branches are actually misbehaving, though we're only about to see about 90% of the time what direction a branch will go well ahead of time.
Also the fast algorithm picture, (as given in Hennessy Patterson Textbook) seems a bit erroneous to me. When we are adding Multiplicand.Multiplier[0] + Multiplicand.Multiplier[1], then wouldn't we need to shift the Multiplicand.Multiplier[1] by 1 unit to the left for addition alignment? The fast multiplier implementation given in the Carl Hamacher textbook, explains it properly (where the entire hardware is given in details). Actually, Multiplicand.Multiplier[0], this gives a 32 bit number, the LSB of which forms product[0].... And the rest of the bits of Multiplicand.Multiplier[0], i.e. from 1 to 31st are given to the first level adder... such that Multiplicand.Multiplier[0][1] is aligned with Multiplicand.Multiplier[1][0]. Please correct me if I am wrong.
I do not think that the previous algorithm also works for signed numbers (I mean negative numbers) in all cases. The previous algorithm works in case of signed numbers, I guess, only when the multiplier is positive. The multiplicand can be positive or negative. But while right shifting the partial product formed at each step, we need to do an arithmetic right shift (instead of a simple logical right shift). Please correct me if I am wrong.
If you have a network like this, the currents will not simply be added. In fact you have a complex current divider with multiple voltage sources. The output current is the result of the superposition of all current dividers, which depend on the resistor values and it will get more and more complex by increasing the input vector. Furthermore, the memrisors are changing their values by applying a voltage. How is it possible to get a consistent result?
I came across this channel in 2015; I was pursuing my master's. I was impressed by your way of teaching and articulate explanations. Since then, I have been revisiting this channel whenever I need to recall Computer Organization concepts.
what happens is you've a heterogeneous multicore system with one processing element (say a DSP, for example) that has caches but doesn't participate in the MSI/MESI/MOESI protocol. I presume a read request will cause it to get he correct data from either main memory or another cache. But what if it wishes to modify that location and has, say write back cache). How do the other processing elements know it's been modified? Must the DSP do something in software to alert the other PEs? Excellent tutorial - the best I've seen in fact!
I have to say I love your british accent with a hint of indian as well. A lot less thick than most Indian computer science teachers you can find, and lot easier to understand
Prof. Rajeev, thanks for the video and your explanation. I have 1 question. If a situation arises where, say for example, both Processor P1 and P2 want to write to their copies of x exactly at the same time and issue an upgrade request simultaneously. Then how is this issue resolved. Thanks.