Generate long form video with Transformers | Phenaki from Google Brain explained

AI Coffee Break with Letitia

Подписаться 49 тыс.

Просмотров 11 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

18 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 26

@automatescellulaires8543 Год назад

This will revolutionize the meme market.

@DerPylz Год назад

Thank you for also explaining Phenaki! I was curious about a non-diffusion model for video generation! 🎊

@barberb Год назад

thank you letitia

@googIe.com. Год назад

Phenaki video generation looks like a dream that was recorded & replayed

@Handelsbilanzdefizit Год назад

Maybe there will be a way to visualize memories and dreams, by using Electroencephalography (EEG) and Neural Networks. So you can see what others think. Or see what others see, through their eyes.

@mrinmoybanik5598 Год назад

Good luck collecting training dataset🙂

@johnkintner Год назад

researchers have already used fmri to do something similar! This was a while ago :D

@davidyang102 Год назад

Because it resumes generation from a few frames it will lose context. Imagine generating a paragraph and then the next one using only the last word you generated. Luckily images captured a lot of information so it's not that obvious. But for example you can't do a video that looks around 360 degrees is it's generated with two iterations. Very dreamlike.

@rewixx69420 Год назад

i want so much infinite video generation on diffusion models

@AICoffeeBreak Год назад

Soon. Just give Google some time to mount more TPUs in their racks. 😅

@AICoffeeBreak Год назад

twitter.com/_akhaliq/status/1595645248243650560?t=PHepVXOP40pPdc5q3upUbQ&s=19 what about this? Didn't look into it.

@elev007 Год назад

Great explanation- thank you 🙏

@federicolusiani7753 Год назад

Thank you for your video, great content as always! One question: in the video, you say that the video encoder is auto-regressive, so that it can be used on arbitrary number of video patches. But aren't standard transformer encoders already able to process inputs of arbitrary length? Usually the auto-regressive architecture is used in the decoder, because at inference time, we need it to generate the output causally. Am I missing something?

@AICoffeeBreak Год назад

Thanks for this great question. Transformer sequence length is an interesting topic, which we've discussed here already: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Xxts1ithupI.html Basically, even if it can generate / take in variable length input, it still has a predefined maximum input / output length due to practical limitations (compute time and memory). You are asking whether a causal model could not generate infinitely long video and -- for practical reasons -- the answer is no. Unmodified causal attention means that one attends to the whole generated past and for very long sequences. This means that the attention window increases linearly and computation time and memory increases quadratically. So because of limited compute time and memory, we cannot generate indefinitely, unless one applies such tricks as the Phenki authors with MaskGIT, to only attend to a small fraction of the tokens of the past generated output.

@sadface7457 Год назад

Hello MsCB 👋

@AICoffeeBreak Год назад

Hello Sad Face! Did not see you in the comments in a long time! 👋

@DerPylz Год назад

Woohooo! Sad Face is back! 🎊

@summary7428 Год назад

great video, but i think it was wrongly placed in your (awesome) diffusers playlist =)

@AICoffeeBreak Год назад

You are right, it is not a diffusion model. It's about content generation. 😅 I was more comfortable with it being in this playlist (especially as the last video in the row) rather than being nowhere close to it's fellow competition. But sure, I do not have the Paella video in the list, although Paella can be argued to be a diffusion model. I need to clean up.

@TheGatoskilo Год назад

I wonder how do they pad the video tensors with variable sequence length.

@AICoffeeBreak Год назад

Do you see this as problematic?

@TheGatoskilo Год назад

I just wonder to the implementation level, these padding values as well as masking the tokens, did someone decide that we will fill these tensors with 0s? Does it matter what we are going to fill those vectors with? What if these padded/masked values of 0s overlap with actual data, how do we effectively instruct the model to disentangle masked values from 0s corresponding to the actual data?

@TheGatoskilo Год назад

@@AICoffeeBreak No, I just wonder how it works in the implementation