Suboptimal action != Worst action (unless there are only two actions). This leads to various issues with the proof at 44:20 1) 2nd bullet does not follow from 1st. R(a) >= R(a') for all a' means that action a is actually optimal. However this contradicts bullet 1 which defines action a as suboptimal. 2) Assuming that action a is suboptimal only implies that there merely *exists* an a' so that R(a') > R(a). We cannot deduce universal quantification as it's possible for R(b) < R(a) < R(a') (here action a is suboptimal but not the worst). 3) We also cannot fix this by assuming that we converged to the worst arm a for this proof by contradiction, as the negation of "converging to the best arm" = "not converging or converging to an arm which is not the best".
To prove that P(a_t != a*) ~ epsilon, you just need to understand that epsilon-Greedy algorithm explores with probability epsilon at each iteration. Hence, on an average, we can say that that an arm isn't going to be the optimal arm at any time step is just epsilon.
The purpose of having epsilon is to increase the exploration. If we try to reduce the probability of getting a suboptimal by reducing epsilon, it will go back to greedy strategy. Then, what's the point?
You explore at the start to get the best action. After that is estimated for you choose the best action(exploit), and not sub-optimal actions. You want your regret to be low.
True that as epsilon decreases you become greedy but before reaching that point you have had explored a lot!! So the theoretical result is telling you a.- decrease the rate of exploitation over time. b.- But not too fast!! slower than 1/t. This type of results is typical in the theory of stochastic process. It is related to the way “stochastic processes” are constructed. The classical book by Love, vol 2 has a beautiful chapter using this type of conditions to prove that ‘good process” can be constructed in a Kolmogov space.
Just one more comment on the rate at which epsilon is decreasing. Look at Kolmogorov’s 0-1 Law. This law tell you what happens if you keep sampling independently from a distribution.