ViTPose: 2D Human Pose Estimation

Подписаться 3 тыс.

Просмотров 3 тыс.

50% 1

In this video, a detailed explanation is provided on how ViTPose utilizes the Vision Transformer (ViT) architecture for the task of 2D human pose estimation. The video discusses the architecture of ViTPose and delves into the techniques employed to achieve impressive performance on the MS COCO dataset. The focus is on showcasing the effectiveness of ViTPose in accurately estimating human poses in 2D space. Various aspects of ViTPose's design and its contributions to advancing the state-of-the-art in human pose estimation are explored in the video.
Paper link: arxiv.org/abs/2204.12484
Table of Content:
00:00 Introduction
00:12 Previous Attempts
02:20 ViTPose
07:02 Variants
07:26 Simplicity and Scalability
08:33 Pre-training
10:10 Input Resolution
11:31 Attention Type
14:53 Partially Finetuuning
16:02 Multi-dataset Training
16:19 Knowledge Distillation
21:11 Results
Icon made by Freepik from flaticon.com

Опубликовано:

30 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 12

@mjalali3109 Год назад

Congratulations, a perfect and neat job

@francisferri2732 Год назад

Thank you for your videos! they are very good to know the state of the art

@soroushmehraban Год назад

Glad you enjoyed it

@wolpumba4099 9 месяцев назад

- 0:00: The video discusses vit post paper which is currently leading in 2D post estimation on the Ms coco data set. - 0:13: Previous attempts to use Transformers for 2D Pro estimation have included transpose and token pose. - 0:26: Transpose uses a CNN backbone to extract local information from the input image and a Transformer encoder to understand the skeleton key points in the image. - 0:58: Token pose uses a similar approach but includes random tokens to represent missing or occluded key points. - 1:33: Another attempt, HR former, combines Transformer blocks and convolutional blocks for down sampling and up sampling. - 2:11: Vit pose simplifies the process by using only Transformers, making it easier to deal with the problem. - 2:21: Vit pose uses an encoder which is a Transformer to create tokens from an input image. - 3:50: Vit pose has two different decoder options - classic decoder and simple decoder. - 6:15: Vit pose allows multi-dataset training, enabling the utilization of different decoders depending on the data set. - 7:03: The video presents different variants of vit pose like base, large, huge, and gigantic, which differ in the number of layers and channel size. - 7:27: The video discusses the simplicity and scalability of vit pose. - 8:33: The video discusses the influence of pre-training data on the performance of vit pose. - 10:11: The video discusses the influence of input resolution on the performance of vit pose. - 11:32: The video discusses the influence of attention type on the performance of vit pose. - 14:55: The video discusses the influence of partially finetuning on the performance of vit pose. - 16:02: The video discusses the influence of multi-dataset training on the performance of vit pose. - 16:21: The video discusses the use of knowledge distillation to improve the generalizability of the model. - 21:12: The video presents the results of vit pose in comparison with different modules for the task of 2D post estimation on Ms Coco dataset. Positive Learnings: - Vit pose simplifies the process of 2D pose estimation by using only Transformers. - The use of an encoder which is a Transformer to create tokens from an input image has proven to be effective. - The use of different variants like base, large, huge, and gigantic can enhance the performance of vit pose. - The use of pre-training data can improve the performance of vit pose. - The use of knowledge distillation can improve the generalizability of the model. Negative Learnings: - Previous attempts to use Transformers for 2D Pro estimation such as transpose and token pose had limitations. - The use of a CNN backbone in transpose limits its effectiveness. - Token pose's use of random tokens to represent missing or occluded key points is not the most efficient approach. - HR former's combination of Transformer blocks and convolutional blocks for down sampling and up sampling makes it complicated. - Partially finetuning can negatively affect the performance of vit pose.

@rohollahhosseyni8564 9 месяцев назад

Great job!

@Fateme_Pourghasem Год назад

That was great. Thanks.

@soroushmehraban Год назад

Thanks for the feedback

@alihadimoghadam8931 Год назад

nice job

@soroushmehraban Год назад

Thanks

@mrraptorious8090 2 месяца назад

Hey, I am asking myself how to train ViTPose by myself. Did you coincidently trained it by yourself? If so could you share experiences?

@nikhilchhabra Год назад

Thank you for this Interesting video. Would be interesting to see Bottom up pose estimation using transformers like ED-Pose. VitPose is top down so (a) Inference time increases with number of person. (b) It can not handle overlapping human scenarios.

@soroushmehraban Год назад

Thanks for the feedback. I didn’t know about the ED-Pose. Surely will read it soon