Thanks for this great video. May I ask how PnP was done? Let's say the neural network already produced 8 keypoint coordinates in the 2D image, how we can get the ground truth 3D coordinates? Does the "final result 6DoF pose" mean the camera pose relative to the fixed location of the object, instead of "world coordinates"? (Sorry I am not an expert of multiview geometry)
What's the main difference between 6D object pose estimation and 3D object detection? Is there the difference in assumptions regarding on input and output?
After taking a quick look at the following link (paperswithcode.com/task/3d-object-detection), I would say that they are very similar in that the goal is to predict the translation and rotation of the object relative to the camera and I believe the terms could be used interchangeably. Something to note is that in this application we are predicting the pose of a single known object, whereas some object detection solutions can classify objects in images alongside generating a pose estimation, and can classify numerous instances of an object within a single image.