XNect: Real-time Multi-person 3D Motion Capture With a Single RGB Camera (SIGGRAPH 2020)

Подписаться 3,4 тыс.

Просмотров 60 тыс.

50% 1

We present a real-time approach for multi-person 3D motion capture at over 30 fps using a single RGB camera. It operates successfully in generic scenes which may contain occlusions by objects and by other people. Our method operates in subsequent stages. The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals. We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy. In the second stage, a fully-connected neural network turns the possibly partial (on account of occlusion) 2D pose and 3D pose features for each subject into a complete 3D pose estimate per individual. The third stage applies space-time skeletal model fitting to the predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose, and enforce temporal coherence. Our method returns the full skeletal pose in joint angles for each subject. This is a further key distinction from previous work that do not produce joint angle results of a coherent skeleton in real time for multi-person scenes. The proposed system runs on consumer hardware at a previously unseen speed of more than 30 fps given 512x320 images as input while achieving state-of-the-art accuracy, which we will demonstrate on a range of challenging real-world scenes.
XNect: Real-time Multi-person 3D Motion Capture with a Single RGB Camera
D. Mehta; O. Sotnychenko; F. Mueller; W. Xu; M. Elgharib; P. Fua; H.P. Seidel; H. Rhodin; G. Pons-Moll; C. Theobalt, ACM Transactions on Graphics (Proc. SIGGRAPH), 2020

Опубликовано:

18 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 71

@thomasgoodwin2648 4 года назад

Excuse me a moment while I pick my jaw up from the floor. Truly awesome! Not long before automation of all objects in a scene (not just human). State of the art yesterday has been blown away. Can't wait to see what state of the art is this afternoon. Thank you and keep up the amazing work.

@AZTECMAN 4 года назад

Awesome work. Congratulations on making it into the SIGGRAPH conference!

@dissonantprotean5495 3 года назад

Super exciting, this makes motion capture way more accessible

@NickGeo25 4 года назад

I'm curious how much the quality can improve just by adding one more camera placed on the side.

@xxello90xx 4 года назад

I had that exact thought. Why not treat the cameras similar to base stations of the oculus/vive/index. With two sources to sample from would the tracking be a lot smoother and have a decent margin of error correction?

@i-conicvision1058 4 года назад

@@xxello90xx If you're interested, we are developing software that allows for real-time epipolar resampling of multiple video streams from moving cameras (we use drone video). This means that you could use your technology on two videos and very accurately calculate positions of the poses.

@scottturcott1710 4 года назад

Like a hypothetical snap chat concerts filter that while everyone are already filming a concert it would build on point data for a program like this.

@i-conicvision1058 4 года назад

@@scottturcott1710 Potentially, Yes.

@ilmarselter 4 года назад

I wonder how much the quality can improve just by using higher fps camera. A commercial product is still using low resolution PS Eye cameras for motion capture.

@novaria 3 года назад

I found the original paper but where can I find a demonstration repository? Is it open-source and MIT? If not, what are your plans on this? I plan on building a free and open-source framework for hobby game developers and animators (strictly non-commercial).

@mesmes9424 7 месяцев назад

Same, is it available?

@kickassmovies5952 4 года назад

When will it be out in public domain, it looks similar to app made by Radical long time back.

@smirnovslava 4 года назад

Good work and congrats on siggraph! Any plans for training code being published?

@Jewelsonn 4 года назад

I want this released for MikuMikuDance software

@Tactic3d 4 года назад

Great work! Very impressive result.

@blendlogic4151 4 года назад

Please when will this be release. Cant wait to get my hands on it

@pacoreguenga 3 года назад

It’s already available as a C++ library.

@azarkiel 3 года назад

@@pacoreguenga do you know where is this library? I would like to work with it. thanks in advance.

@MadsterV 3 года назад

This looks very stable! and no depth info? that's amazing.

@goteer10 4 года назад

Does it work well with face occlusion? This would be great for a cheap vr fullbody tracking alternative if it did. Great for when institutions don't quite have the budget for everything

@mehtadushy 4 года назад

As discussed in the supplemental document, the system as it is does not work with face occlusions, because in the absence of facial cues it cannot tell whether it is looking at the front of the body or the back. However, as we demonstrated in VNect, one can get around it by sticking images of human eyes on to the VR head set.

@VivaZapataProductionsLLC 4 года назад

This is great stuff. How do we get access to it? I couldn't find any more information how to actually obtain this program or software...

@virtual_intel 2 года назад

How does this benefit us viewers? and when can we gain access to the tool?

@shadatorr9378 3 года назад

the spine almost dosent bend at all but over all its really great and usefull

@joanamat5139 3 года назад

I've seen a lot of experimental demostration videos like this, but never comes out a real product.

@dietrichdietrich7763 Год назад

Awesome

@titter3648 3 года назад

There is some glitches where one part of the skeleton instantly goes from one position to another for just a second or less and then goes back to the correct position. Maybe you can add a filter to filter out "impossible" fast accelerations and speeds.

@amirierfan 3 года назад

Insane!!!!!

@mxmilkiib 4 года назад

Videos of Contact Improvisation dance jams would make for good tests

@Augmented_AI 3 года назад

Does it work in Unity?

@WeidzM 4 года назад

This is huge, the most affordable mocap suit today still cost like 3K€, for only one actor, without 6DOF position and from what I see, about the same accuracy if not worse. Does the estimations give an indice of accuracy per points of the poses ? If so it shouldn't be too hard to cross analyse data from multiple angles and get a more smooth and accurate output right ?

@i-conicvision1058 4 года назад

If you're interested, we are developing software that allows for real-time epipolar resampling of multiple video streams from moving cameras (we use drone video). This means that you could use your technology on two videos and very accurately calculate positions of the poses. (www.i-conic.eu)

@Cera_ve858 3 года назад

wow already throwing out the ping pong balls

@daehyeonkong1762 3 года назад

Awsome!

@acidcube6967 4 года назад

🌟✨🤛 Well done! Inspirational start to something that I would love to incorporate into a game scenario I am working on. Could this potentially work in real-time in combination with Apples IOS Lidar Apps such as scanner Is it possible to contact you apart from here? Cheers Marlon

@camswainson-whaanga2750 4 года назад

Are you working on hand and finger tracking close up? if we only have our hands in the camera view.

@21graphics 2 года назад

what is RGB camera?

@williamweidner5425 3 года назад

Is there a way to capture finger motions with this?

@bolzanoitaly8360 2 года назад

what you want to show us, if you can't share the Model, then what is the need of this, even I can take this video and can place on my VLOG. this is just nothing..... can you share the model and code, please?

@phillipfury528 3 года назад

Hi! I'm a professional mocap performer who recently worked with Marvel and Fox on different projects.. I am curious about this software. I would love to connect with everyone!

@Ethan-ny4vg 2 года назад

is the character controlor in Unity???anybody knows??thanks

@Drago.23 4 года назад

how was the motion transferred to 3d models?

@Ethan-ny4vg 2 года назад

Have you sloved this?? l also wanna know how

@donk.johnson7346 3 года назад

why all the foot sliding?

@cmdkaboom 4 года назад

its funny they focus the video on the people tracking and make the actual rig motion small when its shown. A lot of jitter when you actually see it on a rig. Maybe they will improve it... doesn't seem to be there yet.....

@hughjassstudios9688 4 года назад

Nothing post processing can't fix. Perhaps blending the jitter in post

@viniciusplaygames6042 4 года назад

someone can tell me how I download Vnect or Xnect? thanks :)

@azarkiel 3 года назад

You and me are in the same point. I would like to play with both :)

@tribagaskara 2 года назад

What a software bro

@birdisland 3 года назад

how can I get this software? Is there any website for purchasing?

@andrewgonzalez620 4 года назад

Can I download this

@luisfable 3 года назад

where is this

@jackcottonbrown 3 года назад

Can this run on an iPhone?

@ziadeldeeb6066 4 года назад

what is the RGB camera do you use, type ?

@thejetshowlive 4 года назад

from the video it was looking like a logitech...if that IS what they are using.

@DP-ee6qv 3 года назад

Software name?

@Kiran.KillStreak 4 года назад

Seeing like this videos ,since Kinect v1, nothing is useful for game developers without retouching .

@kendarr 4 года назад

There is really no such thing as a perfect mocap, a cleanup is always needed, this is awesome considering it dosen't contain any depth data

@cybermad64 4 года назад

I understand what you are trying to achieve here but you still are far from optimal result. We saw the 'Everybody Dance Now' tech demo 2 years ago, I know it wasn't realtime but the pose detection was super clean. In your solution, the poses are not acurate, the skeleton are shaking, legs and arms position are off most of the time, knees are not facing the right direction... Your simple skeleton looks 'okay-ish' but once applied on a 3d model it's un-usable.

@mehtadushy 4 года назад

I think there are some mis-notions here that need to be clarified. 'Everybody dance now' makes use of a 2D pose backend (OpenPose), not 3D pose. They don't have any bone length consistency constraints, nor 3D plausibility to worry about. Additionally, not being real-time is kind of the key to the better visual accuracy of the 2D backend used by their project. Openpose is applied at multiple image scales and the results combined together, whereas our approach does multi-person 3D at roughly 2-3x the speed of single-scale Openpose, and at least an order of magnitude faster than the multi-scale variant. As far as 'unusability' of motion applied to 3D models, it comes down to the end application. This is where the multi-stage design proposed in the paper comes into play. It allows you to insert domain expertise into different stages to improve the aspects that are important to your end application. You can swap out Stage I for a heavier/ multi-scale pipeline if you care more about accuracy than real-time performance. You can swap out Stage II for alternate designs which incorporate more data or inter-penetration constraints or bio-mechanical constraints or temporal constraints and what not. Similarly you can swap out Stage III to better exhibit the characteristics needed by your end application, or have stronger/ better-tuned temporal filtering applied in Stage III . There are other ways to achieve temporal stability too, such as by breaking causality, which targets a whole different set of applications. I am sure you understand that this video and paper is not an advertisement for a solution that we are selling for money, rather the system shown in the paper is a vehicle for us to demonstrate/convey several key points regarding a new efficient convolutional network architecture design, and a way of thinking about multi-person pose estimation in a multi-staged way, that is different from contemporary approaches. We, in fact, perform comparably to non-real-time contemporary monocular 3D pose approaches on various challenging benchmarks, while running in real-time. This is a research prototype, of course it has limitations, which are even discussed in the supplemental document. Other multi-person 3D pose work also has similar limitations, while not even running in real-time. What we show here was the state of the art multi-person monocular 3D pose estimation system at the time of submission of the paper. Any lessons from recent/future work to mitigate some of these issues can equally be applied to our approach. We are thankful to you for engaging with us, and we welcome suggestions for improvement. I just wanted to set the context and expectations vis-à-vis prior work.