Josh amazing work! Thank you for sharing! I think now there are a lot of interests of multiplane video with the Apple Vision Pro and their Spatial Video mode. Have you tried with that type of video? The separation is really low is almost 2 cm from the pupil.
Thanks! I haven't tried it specifically with spatial videos, as I've mostly moved on to capture with a few more cameras and a different technique that allows for more movement (see my more recent video on Layered Depth video). But it almost certainly would work, assuming the subject isn't too far away (and after converting the spatial format to separate left and right images). Thanks for asking and let me know if you have any other questions!
@@JoshGladstone I'm really interested on trying how to see multiplanar video into Quest, I don't know much of Unity but have been learning a bit. I download stereo-magnification, but couldn't make it work. Sure is something really easy. Do you think you could make a tutorial on how to install it? Best!
Thanks and congrats on this awesome project! Quick question please, is the Unity project available somewhere? I'd like to test this approach for a spatial video, whose side-by-side frames were extracted using ffmpeg, to then be viewed from Unity as an immersive video. Thanks in advance.
Great video! I had tried firing up that stereo magnification code a while back but I could not get it working with all the dependencies. Will you be sharing your scripts? Maybe that could help me get it running.
The original release of stereo mag was 2018 so it's relatively old at this point. If you want to get it up and running, you have to use Tensorflow 1.11
Great to see your continued work on this project. Have you experimented with extruding the pixels in each of the 32 layers so that they each have depth and pseudo fill in the gaps between each layer? The end game to all this might be phones with stereo lenses as standard + machine learning gets to a point where it takes that depth data and combines it with it's ability to recognize objects/scenes and turn them into 3D meshes then uses the video to texture most of those meshes. Angles without any texture data will use content-aware fill technique.
What you're describing is remarkably similar to a 'Layered Depth Image' or a 'Layered Mesh Representation', which is sort of the next advancement from MPIs that Google used in their 2020 paper. I haven't tried it yet, but others have and it works very well. My gut instinct is that's probably a bit much for two cameras to capture, but honestly I never though this or nerfs would really be possible, so its all absolutely within the realm of possibility. This idea actually overlaps with some of the tech developments of self driving cars and AR, especially with detection and segmentation.
@@JoshGladstone Extruding depth to those pixels seems like it would be relatively easy to implement. As far as the stereo cameras - they would only be recording video. All video to mesh and temporal content aware fill would happen in the cloud later. Not sure of computationally heavily that would be though. Until then, there's that technique I posted on your Reddit months ago that seemed like something that could easily be widely adopted - in short my idea was to shoot stereo 180 video pointed towards the subject of most interest and, if aftwards you decide it was a special memory you wanted to preserve in a 360 environment you would then take 1 minute to shoot handful of panoramic shots with your arm extended outward to create a 360 photo with a double arms length lightfield of depth information. You then compare and composite in the stereo video with the 360 lightfield photo to place them within 6dof space. You wouldn't be able to kneel down without some tearing but a 5ft circular 6dof view from the standing position should be good. Much less data as well.
@@brettcameratraveler Similar to your idea, I've heard some talk about using photogrammetry or nerf for static elements and then something else for dynamic content. Could be interesting
@@JoshGladstone How do you like that Kandoa camera? Couldn't find the spec info on their site. What resolution does it shoot at per eye? How does it compare to the Insta360 Evo? What are you using for your custom rig?
@@brettcameratraveler I think it's a fun camera. Very portable. Different from the Evo in that it's not VR180, it shoots rectilinear 16x9 content, 1080P per eye. The custom rig is two Yi 4K action cameras synced with an oddly proprietary cable. This is the same camera that Google used for its dome rig. I designed a 3d printed frame that lets me have a variable baseline. The current frame is adjustable from about 130 - 220mm.
Btw have you seen the work on LifecastVR? They do something similar but with 1803DVR cameras, which enables a wider field of view. I was suggesting them to use it more to make improved comfort in 180VR experiences where you still lock the position in order to avoid going out of the sweet perspective spot, but enable a rotation at the level of the neck to make it much more natural, and also enable a customizable stereo in term of the distance between the eyes (although the Quest 2 doesn't allow this anymore I think).
Yeah, if you headlock this or RGBD, it could solve a few stereoscopic discomfort issues, like the parallax from our necks and ipd adjustment like you mentioned, and also also being able to roll your head to the side and not ruining the stereo effect. The issue is that it's a large amount of added effort in post production for a sort of minor payoff that's hard to demonstrate the advantage of to people. But I'm with you!
@@JoshGladstone Yeah totally! I forgot the head tilting, definitely another advantage! For the post-production, I feel this could be automated. I'm sure you could even edit like regular VR180 footage, and just after export, load the XML into another software to render an OpenXR built (in the future it could be a specific file accepted by Oculus TV for example). The ipd adjustment would just be determined by the headset used and its settings. So to me, much of the effort would be in that software to be created once and which LifeCastVR could do quite easily. I'm gonna post that on their Facebook group. So I agree it's hard to show the added value but to me is not just a minor payoff, it's a huge one if people are having a good experience in the headset unlike what I could observe with classic 1803D. Without mentionning all the dynamic moves then possible to do with the camera.
Because full 6Dof video is more of a huge issue with the distortions and all. And the interactivity breaks the narrative. But if you want to tell a story presence and immersion are ok. But interactivity is the enemy of narration. That why I think this would be a bigger deal for narrative content. Interactive content like games is another story and I think there it would be better to have it completely interactive with separate assets etc,.
how about a rig with 2 or 3 of those kandao stereo cameras mounted side by side (sync potential?).. would you also need another layer of cams above/below? or is LR parallax "enough"
It wouldn't help this neural network, as this one takes two inputs and outputs MPIs. But other techniques such as in the Google 2020 paper, and the various flavors of NeRF all require more inputs, so the more cameras the better. Left and right are sort of more important because human vision works that way, and generally we don't see a lot of vertical disparity unless you're intentionally moving up and down. Plus, my goal for output is the LookingGlass which only displays horizontal views anyway. But if you want to be able to move around freely in the space in VR or something, then you really do need cameras at a variety of angles. Especially for things like NeRF that also map more advanced view dependent effects like refractions.
Well that is the next frontier barrier to succeed in.. is nerf videos. I was wondering if this would be possible. I have 3 drones, and about 6 iphones 8-13 plus max. If we were to capture a scene from all angles with different cameras, and drones.. how would we go about joying all thos different view points .. so that we can play the nerf like a vide but being able to move the camera as the scene is being played. So if someone is walking in the scene we would see the actual movemenet from point a to point b.. and be able to look at it from different points of view. This is not currently possible, but what would it take maybe 360 cameras and a bunch of them and have them positioned at the right distance? A new algorithm to combine all the data?
@@JoshGladstone Yes, I've been trying to get it running locally for years now, HA! If you take a look at the sheer amount of issues on the gitub you'll see I'm not alone. It's kind of a miracle that you've managed to get it working so well, hence why I'm hoping you might do a detailed tutorial.
Great video! Is your MPIre python script available? I just found this video from the 6/14/23 PetaPixel article "Filmmaker Uses Action Cams and AI to Create Incredible Volumetric Video"
That's pretty cool. Have you tried using Nerfs as well? Seems this kind of image representation closer to light fields and combined with AI will really be the solution.
@@BenEncounters I've also found in my limited experience with nerfs so far (and other view synth techniques that require a lot of inputs), that they are extremely reliant on accurate camera poses for good results, and generally Colmap is used to figure out the camera poses. I've never had good luck with colmap, although I've only tried it out a few times. But that is another reason I like stereo inputs, you can avoid camera poses altogether. Although when it does work, the results from nerf are really incredible especially with shiny and refractive surfaces, so I definitely do want to play with it.
@@JoshGladstone yes definitely a trade off there. And it’s true that for many video use cases, like adventure content you cannot have a rig of cameras (only if shooting in studio). And as for photogrammetry, it’s even at the moment of taking the pictures that it is even more important to do it well, so colmap other tools produce accurate results ahah
@@BenEncounters I could definitely see a capture stage set up that way. Although a lot of the appeal of this stuff for me is the ability to capture motion in the background as well. The real goal for me is to have a camera or camera system that you can set up on location and capture the whole scene. 'Reality Capture' or something like that
@@JoshGladstone Yes I am totally in line with you there. Agin I think you should check and play with the LifecastVR tool if you havn't yet! On my side I only wished it was possible to publish their format on OculusTV with a lock at the neck level like I mentioned in my other comment above :)
What happens if you use stereo 360 video / photo? Or at least taking a middle section from of the equirectangular image to remove the nadir and zenith from the calculations. You could use the Kandao Obsidian R to capture those. That also can produce a depth map in the Kandao software which I've seen but not yet tried to do anything interesting with
I haven't tried 360, but the cameras I used for the Chinese Theater video have a 160º fov, so that's pretty wide. I did try VR180 videos about a year ago, and they didn't work when stitched, but the raw fisheye recordings did work. I then had to use hemispheres as opposed to flat planes to display them (which is something Google also did with MSIs - multisphere images), and that worked pretty decently. There was still quite a lot of fisheye distortion though, and it takes an already limited resolution and stretches it over an even larger area, so it looks less detailed too. But it did work for the most part.
Really interesting. I've not done a lot with VR180 so I'm not that familiar with the workflow and output. I'd be happy to share some images from the Obsidian R of the full 360 if you'd like to try. I think cropping the stitch to a somewhat undistored letterbox could produce something, even if its a flat panoramic image
@@JackAaronOestergaardChurchill Btw, I found a capture I did last year that I never published with a VR180 sample, if you're curious: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-0ikfvtU2Uhs.html
Yes and no, it depends on what your expectations are and what you're trying to capture. For a static scene with no moving objects or people, you can recreate a fairly full environment by moving the camera around and getting shots from a lot of different angles and all sides of the subject/environment. If that's your goal, look into NeRFs and/or gaussian splatting. Luma.ai makes this very easy and user friendly (lumalabs.ai/)
No, not the way it's currently designed. The neural network doesn't export geometries or depth, it outputs rasterized images in layers. You could in theory turn this into a point cloud I suppose, but you'd still just have discrete layers of points.
@@JoshGladstone My interest is in converting monocular video into these depth images, and this video has shown me another way of getting information out of certain shots. Every shot is different, and some have more depth information than others, sometimes when the camera pans around a subject, photogrammetry and other elements might provide something comparable to stereo. Other times automated depth maps work fine, and other times they need correction, or geometric reconstruction. But, this is pretty badass.
Great how your technique is working. The Looking glass looks awfully bad. Can't believe they raised 2.5mil and that many people found it is workt 250$😅
The looking glass is one of very few volumetric displays that even exist right now, and the fact that they can produce and sell one for $250 is pretty impressive. In my opinion.
@@JoshGladstone I know. And you are right. I just didn't het that feeling of a beta product until your demo. I kinda feel that given its resolution limit it would look more impresive with an OLED and some fancy eye tracking cameras for passer by 😁
@@smetljesm2276 You can definitely see the resolution limitation on the looking glass, but I just got a lumepad2 which uses eye tracking with an autostereoscopic display, and while it's cool and the perceived resolution is much higher, I'd still say the looking glass still *feels* more like an actual volumetric hologram
holy crap man.. this is the exact problem i have been trying to solve.. I started 10 months ago and even used the same movie scene from minority report to explain what im trying to do.. and from the movie Dejavu... creepy! :D link: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-UKAguEUb2Lc.html Now i been trying to do that with lidar. and have had some success. but nothing like this .. wow!