No video :(

02L - Modules and architectures

Alfredo Canziani

Подписаться 39 тыс.

Просмотров 22 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 51

@wolfisraging 3 года назад

Been reading and doing ML for the last 5 years, and every time I hear Yan I always get to know something I don't know or missed. Thanks to both Alfredo and Yan for this amazing course. Lovin it!!!!

@alfcnz 3 года назад

Yann, with two n's 😉 You're welcome 😊😊😊

@wolfisraging 3 года назад

@@alfcnz mah bad 😄

@locutusdiborg88 3 года назад

@@alfcnz then it should be also Lecunn , with 2 n's ( to read using Igor's voice)

@alfcnz 3 года назад

Who's Igor? 😮😮😮

@wolfisraging 3 года назад

yeah who's Igor?

@hafezfarazi5513 3 года назад

@8:33 One of the reasons why ReLU is better in deep networks than say Sigmoid is that the gradient in backward pass after each sigmoid nonlinearity gets smaller (multiplied by around 0.25), but in ReLU-like nonlinearities, the gradient does not get smaller after each layer (the gradient is one in positive part)

@alfcnz 3 года назад

Assuming no normalisation layer is used in-between, yes.

@alfcnz 3 года назад

What would be bounded?

@fuzzylogicq 2 года назад

Awesome lecture. Particularly intrigued by the mixture of experts part, definitely trying this out.

@alfcnz 2 года назад

😊😊😊

@asmabeevi 3 года назад

First of all, thank you very much for doing this. The world owes you! What is the meaning of the update rule when the parameter vector is the output a function [at time 1:31:21] ? As the name implies, w is the output of a function, so how can you update the output?

@alfcnz 3 года назад

Through changes on its input. Yann is showing how gradient descent changes its direction when an input is a function of another input.

@user-co6pu8zv3v 3 года назад

Hello, Alfredo :) Thank you for video! :))

@alfcnz 3 года назад

Hello 👋🏻 You're most welcome 😇

@asmabeevi 3 года назад

Also, Do you have quizzes for this class? In the course website I see Homework problems mainly. Thanks!

@cristiano24597 Год назад

Regarding the Mixture of experts, would it make sense to train the Experts nets separatelly? I mean something like training the Expert 1 over a dataset that is for sure Catalan, and the Expert 2 over other language. After that we'll have those trained models, and a general dataset (with multiple languages) could be used to train the Gater only (the Experts' weights wouldn't change anymore).

@alfcnz Год назад

Sure you can do that. It turns out that the joint model works better because it can exploit the similarity between the two contexts.

@cristiano24597 Год назад

@@alfcnz makes sense, thanks!

@alfcnz Год назад

You’re welcome 😊😊😊

@AdityaSanjivKanadeees 2 года назад

do the ideas for update rules for W in the example @1:33:00 come from $dw = \frac{\partial H}{\partial u} \cdot du$ ?

@alfcnz 2 года назад

This side has a better comment explaining the math in the previous year's video. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-FW5gFiJb-ig.html&lc=UgxPIlrkdcQAncIPyQ14AaABAg

@AdityaSanjivKanadeees 2 года назад

at 1:14:38 Yann says that we can do a Non-linear classification with mixture of linear classifiers which are gated, isn't it still linear classifier? Why is it non-linear, what is it that makes it non-linear classification.

@alfcnz 2 года назад

A linear classifier has a single hyperplane cutting the data space. Here we partition the space in two and then in two again. So, we end up with something that is clearly not linear. It actually smells a lot as a decision tree, where you have iterative subdivision of the data space.

@AdityaSanjivKanadeees 2 года назад

@@alfcnz Thank you!!

@alfcnz 2 года назад

You're welcome 😊

@alfcnz 2 года назад

No one mentioned a ReLU here. 🤨🤨🤨

@alfcnz 2 года назад

How is that addressing Aditya's question?

@SubhomMitra 3 года назад

Hey @Alfredo! I'm new to the SP21 course. Is there an order to the videos? You've previously uploaded videos numbered 01, 02, 03... but your recent videos are 01L, 02L, ... What does the "L" mean? Should I watch 01L after 01? I am trying to understand the naming convention here. Thanks! :)

@alfcnz 3 года назад

L stands for lecture. I was not planning to release them, initially. The order / index is on the class website, the official content organisation homepage.

@SubhomMitra 3 года назад

@@alfcnz Thanks for clarifying! The class website isn't fully updated yet, so I was a bit confused. Will you be uploading more Lecture videos?

@alfcnz 3 года назад

I've just published the first theme. You'll have a new one every week. For the latest news you want to follow me on Twitter, where I announce all these things.

@SubhomMitra 3 года назад

@@alfcnz Than you very much! 😇

@antoniovelag.8080 2 года назад

Hello Alfredo, I can't find these slides in the website. Am I just not looking right or are they missing?

@alfcnz 2 года назад

Click on the icon next to the lecture title on the website.

@antoniovelag.8080 2 года назад

@@alfcnz There is only a camera icon and it sends me to this video :(

@alfcnz 2 года назад

Oh, is this link missing? drive.google.com/file/d/1IaDI6BJ6g4SJbJLtNjVE_miWRzBH1-MX/ Feel free to send a PR if it's correct.

@pypy1285 3 года назад

Thank you Alfredo, and I want to know what's the 'z' in the attention architecture example as Yan introduced, does the 'z' also come from the training data? thank you!

@alfcnz 3 года назад

You need to point out minutes:seconds for me to be able to address your question.

@pypy1285 3 года назад

@@alfcnz ~59:02 the topic is "multiplicative modules", thank you

@alfcnz 3 года назад

z is a latent input. Latent means it's missing from the data set. Hence, you need to infer it by minimisation of the energy using GD. We've extensively covered latent variable energy based models in previous lectures.

@pypy1285 3 года назад

@@alfcnz Thank you, Sorry, I didn't get this information in previous lecture (if it in the order of the video), but I noticed that there is a lecture call "05.1 - Latent Variable Energy Based Models (LV-EBMs), inference" in later videos, thank you!

@alfcnz 3 года назад

My bad, I apologise. I thought this was the lecture on associative memories. These topics are only briefly introduced here and will be extensively covered later on.

@HassanAliAnwar 3 года назад

Why doesn't ReLU have a variant with ReLU (x) = -x for x

@alfcnz 3 года назад

What do you need the identify function for? 🤨🤨🤨

@HassanAliAnwar 3 года назад

@@alfcnz I meant RELU(x) = abs(x). It will still be non-linear, but I guess it serves no purpose.

@alfcnz 3 года назад

Absolute value has been used as non linear function, but it wouldn't let you turn off specific inputs. So, the output would always be the non-zero piecewise linear combination of the input.

@ryans6946 3 года назад

@@HassanAliAnwar Hey! I think this is covered at the start of the lecture when Yann takes a question regarding non-monotonic activation functions. To summarise, I think intuitively, he explains that as there would be two solutions for x when f(x) = some value (except for x=0) this means that the gradient descent step could be taken in multiple directions, which can, but not always, lead to less efficient learning for the problem, i.e. if abs(x) = 2, do we walk towards the direction of the gradient being -1 or in the direction of the gradient being 1?

@alfcnz 3 года назад

Yup.