Great work. Interesting paper read indeed. At 7:27 ; Bayes theorem is incorrect. P(X/Y) = P(Y/X).P(X) / P(Y) ; The rest of the math that follows is fine.
well spotted. thank you. I think I saw it after the video pub. Left it as YT doesn't allow newer versions of videos. I think I should start writing errata in the comments :)
yes, we need depth or pose datasets. We already have several datasets in computer vision for depth or pose. The problem is these datasets are tiny compared to the scale at which LLMs or LVMs are trained. So the solution is ControlNet. By ControlNet approach, we simply add a few trainable layers and we are good to go and train with these "small" datasets. As a result, we will be able to control the spatial layout of the generated image during inference. Hope that clarifies :)