CS231n Winter 2016: Lecture 7: Convolutional Neural Networks

Подписаться 45 тыс.

Просмотров 166 тыс.

50% 1

Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 7.
Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.

Опубликовано:

13 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 102

@user-vs2ej2id3y Год назад

It's still a good way to begin the journey of CV in 2023 , thanks for teaching

@leixun 3 года назад

*My takeaways:* 1. Convolutional Neural Networks 2:00 2. Case study: LeNet, AlexNet, ZFNet, VGGNet, GoogLeNet, ResNet, AlphaGo 45:52

@phargobikcin 8 лет назад

Amazing, super intuitive lectures. Wish all lecturers were like this. Thanks a million!

@vil9386 7 месяцев назад

This clears lot of doubts I had in my head. Thank you Andrej.

@fredlawton7561 8 лет назад

At 24:30 in Lecture #7 a student asks about why filter size F is odd. I believe this is because odd length filters have an integer center. e.g. 2x2 filter has center at 0.5,0.5 but 3x3 filter has center at 1,1 (assuming indices starting at 0).

@rocklee8720 7 лет назад

I think it does a lot to do with EDGE DETECTION in image processing, we use different filter to find the property in different directions of a image, there are 6 filters often being used list below: Roberts filter(2x2), Sobel filter(3x3), Prewitt filter(3x3), Laplacian filter(3x3), Gauss-Laplacian filter(5x5), Krisch filter(3x3)... i thought this would helpful to understand: yangloong.github.io/posts/image_processing/edge_detection/edge_detection.html

@ChimiChanga1337 7 лет назад

That is what Mr. Karpathy has said as an answer to the question in the lecture.

@AnuragHalderEcon 7 лет назад

EPIC MOMENT "But if you take an ensemble of humans and train them for a long time you may be able to get down to may be 2 to 3% or so"

@tushartiwari7929 Год назад

Tweetable Quote

@bayesianlee6447 5 лет назад

He used tesla S model pic at 11:22. It's the reason he been director of AI division in Tesla one year after.

@mostinho7 4 года назад

4:10 the filter does a dot product across the depth of the image, so the image initially had a depth of 3 but after convolving the filter we get an activation map with depth 1

@sumitvaise5452 4 года назад

recommend this if anyone want to understand cnn deeply. its awesome

@manoharn6495 6 лет назад

Really informative and no ambiguity in the lecture, really it was very nice and intresting

@katerinamalakhova9872 8 лет назад

Thanks for the great overview of Conv Nets!

@katerinamalakhova9872 7 лет назад

Bruce Lee, why do you ask?

@roddlez 8 лет назад

Has there been any study on the performance (e.g. metrics like accuracy or training time) of convolutional networks that are trained on source data that is in RGB layers vs something like hue, saturation, intensity?

@divyanshumishra6292 5 лет назад

I love the way you make everything so easy. I have been studying and practising conv nets for roughly 2 years now but still just looking at your videos every time bring a new intuition and understanding. Also, can u make videos covering the latest convnet trends?

@jennyperry4038 4 года назад

Divyanshu Mishra I am sure he will.... ;)

@hidingbear 7 лет назад

Awesome lecture! I learned so much from your presentation! Thanks

@aryanbhushanwar9083 Месяц назад

Much thanks!

@lalithsamanthapuri2055 5 лет назад

1) Sir at 15:32 I used my own predicted formula 7*7 input and 3*3 filter with applied stride 2 has (7-2^2)*(7-2^2) => (7-4)*(7-4) => 3*3 output matrix. Similarly for N*N input and F*F filter with applied stride "S" gives [N-(F-1)^S] * [N-(F-1)^S] output matrix . Sir, please correct me if I'm wrong...! 2)Why we're using square matrices everywhere and why not Rectangular matrices..?? Is there any complex computational problem occurs here ...!!

@OttoFazzl 6 лет назад

At 27:24 he says we always work with squares. It is not true anymore. Modern architectures such as ResNet can work with inputs of any size. It all depends on how you process data at the end of the network. If you use global average pooling per feature layer neat the end of the net, then you do not care about spatial dimensions of input image.

@nahar7861 7 лет назад

Thanks Great Tutorials.

@anmolaggarwal5220 8 лет назад

Thanks a lot for such amazing lectures.

@junhongkim4913 8 лет назад

Amazing lecture! Thank you.

@Alex-gc2vo 8 лет назад

So the "dot product" of a section of image and a filter is actually the dot product of 2 n-dimensional VECTORS not matrices correct? because the dot product of 2 3D matrices is completely different than 2 N-dimensional vectors.

@kristjan2838 7 лет назад

Correct, I don't know why it wasn't explicitly stated here. x is the flattened vector representation of the 5x5x3 of that section of the input. I realize you asked this question a year or more ago, but someone else might also be curious.

@essamal-mansouri2689 6 лет назад

Actually I'm pretty sure he meant dot product of two matrices. If we use a 5x5x3 (75) filter on a 32x32x3 input volume, we get 28x28 (784) unique inputs to the filter. If we are using a set of 6 filters, that means we need to calculate the dot product of two matrices with shapes (784, 75) and (75, 6). The result has the shape (784, 6) which we then reshape to (28, 28, 6). I realize you answered this question a year or more ago, but someone else might also be curious.

@DarkLordAli95 3 года назад

The lecture is so good that no one was even bothered by how the matrix was glitching out... or maybe they all know about the matrix?????? *Dramatic Pause*

@akshayaggarwal5925 6 лет назад

Link to the demo : cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

@simonvutov7575 Год назад

thx

@李小慧-m8h 8 лет назад

Wonderful lecture!

@ArdianUmam 7 лет назад

Minute 59:27, I try to recalculate the total memory by summing all the red numbers, and I end up with around 15M, not 24M. Unless I count the fully connected layers with 4M instead of 4K for 4096 and so on. Which wrong?

@5up5up 5 лет назад

Thank you Thank you Thank you!

@DarkLordAli95 3 года назад

why can't the other professors make diagram like these??!! instead I got 2d diagrams stacked on top of each other, and half of the other stuff is not even there. smh. It doesn't take that much time, and they can reuse them every year. JUST WHY?!!!

@allaabdella4794 6 лет назад

Thank you for this good lecture. I'm new to Conv and I'm not sure how did you compute the number of neurones in the fully connected layers? for example, in Alex's case, how did he get FC6 = 4096 neurons?

@sezaiburakkantarci 6 месяцев назад

1:14:27 - The network never fully converges, but at some point you stopped caring. Because it has been 2 weeks and you are just tired. 😅

@andyyuan97 8 лет назад

Does anyone can enable the subtitle?

@taeefnajib 2 года назад

Who decides the number of filters (K)? It is an arbitrary number, meaning a number randomly chosen by us, or is there any rule for this?

@omeryalcn5797 6 лет назад

Thanks for lecture . Why do not mention about conv-nets backpropagations . It is different N.N. backprop

@Zynqify Месяц назад

if you're confused by the example at 19:50, it is very simple but can be confusing because in the slides the rule for the output size was defined as (N-F) / stride + 1. the lack of parenthesis in that definition causes the confusion as it really should be ((N-F) / stride) + 1 which will then give you the correct answer. wrong: (32 + 2 + 2 - 5) / 1+1 = 31/2? correct: ((32+2+2-5)/1) + 1 = (31/1)+1 = 32

@Kornackifs Месяц назад

Why there aren't any captions with the video?

@sokhibtukhtaev9693 6 лет назад

at 46:38 we get 16@5x5 input and C5 layer outputs 120x1 flattened vector instead of 400(16x5x5 =400)? What is used here, folks?

@hans1066 5 лет назад

Oh so the kernels itself are initialized randomly and are suceptible to training too?

@ShubhanshuAnand 4 года назад

Why do we use stride? As in what happens when we move in direction of 2 instead of '1'?

@owaisiqbal4160 3 года назад

Link to CONVNETJS: cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

@asimpleenigma 7 лет назад

Is there no sigmoidal activation function after convolution? Or does RELU play the role of non-linear activation?

@sherylhohman6464 6 лет назад

It's all RELUs.

@tugberkayar5786 7 лет назад

As far as I understand, filters are initializing randomly. So, how do we know that two seperate filters in the same layer doesn't look for the same thing after reducing the loss function?

@janjanusek4383 7 лет назад

I think you could get answer if you will think about distribution you'll use for weights initialization. So let say we will use uniform dist... It is possible that features will look for similar patterns but definitely not for the same ones. On the other side if you would do bad initialization of random generator I mean with bad or same seeds it is possible to get in the problem you're describing...

@dorraviv8166 6 лет назад

Great Vid!

@puneetjindal8894 7 лет назад

How we decided 6 as depth of output from first input layer after passing through first conv layer?

@miftahbedru543 7 лет назад

I guess i heard of him saying depth is a hyper parameter we need to choose and fine tune , perhaps start with recommended one and change based on experiment ...

@Yasser2652 7 лет назад

In 32:01 how can I say these neurons are locally connected And thank you very much for this lecture!!!

@michaelmoore7568 2 года назад

Is this Karpathy of Karpathy’s formula?

@simonhuang5067 5 лет назад

Teacher, can you increase the English subtitles so that I can understand what the teacher taught, because I am a foreign student, my English listening is weak.

@hydroUNI 4 года назад

Anrej is a program. Confirmed at 47:49.

@ehza 3 года назад

Nice!

@miracle-of-some-sort 4 года назад

Can anybody tell me how can I update filter weights? Where can I find such simple explanation?

@hans1066 5 лет назад

How do you determine the filter? I mean you can have e.g. high-pass filters, low-pass filter etc.? How do you choose the kernel?

@ramnewton 4 года назад

We don't have to. The values of the filter are parameters of the network. They will be learnt during training.

@ishanmishra3320 5 лет назад

The dimension of the input(image) is 32X32X3. I guess 3 is for RGB. So it's actually 3 matrices one for each among R, G and B stacked together. Also same with the filters. Can anyone explain how the final volume has 3rd dimension as 1?

@bf1228 5 лет назад

Dot product operation aggregates all 3 channels into one. Same as aggregating downstream filter outputs.

@puneetjindal8894 7 лет назад

Please help me understand if we need to identify different colours, then should we have 3 filters? Another question what if i take 1*1 size filter?

@tthtlc 6 лет назад

Which color you want to detect? The reason it is 3 layer, is because the original image have been split into 3 different channel: RGB. THerefore, its corresponding output will then only have RGB layer, and if u want any other colour, you have to combine them from individual filter (RGB).

@sherylhohman6464 6 лет назад

The network will tune itself to create a filter that recognizes colors, if that is an important "feature" to distinguish images. First CONV layer takes dot product with each of the 3 colors (3 input depth). So colors get encoded into the weights this way. They are no longer held segregated into layers of their own now, they are encoded into the output of each filter. Each filter encodes colors (and spacial features) independently from each other. So each "depth" of the output contains a different mix of values indicating how important color is to it (for each pixel location). In this way color importance gets remixed and distributed. Then weights for each filter adjust themselves at backprop, taking into account color, location, and brightness. These become the "features" each filter looks for in any subspace of the image. Filter depth must always match the depth of the input layer it acts upon. So First filter would always be depth of 3 if given and RGB image. If the image is Grayscale and encoded as a 2D, then Depth would be 0 (2D), and filter would also be 2D. In that case, a 1*1 would be appropriate.

@user-ju9zf7qk8i 6 лет назад

็Hi, I'm interesting about the class but my English is not very good. Could you please open subtitle on the vdo. Thank you.

@menglin487 8 лет назад

ZFnet has input size 224, (224-7)/2+1 = 109.5 which is not an integer, next layer has input size 110. So how do they deal with this

@menglin487 8 лет назад

+meng lin @56:47

@OttoFazzl 6 лет назад

They are probably using zero padding to make it work.

@anchitbhattacharya9125 5 лет назад

Look for a ghost between 30:20 and 30:30

@OttoFazzl 6 лет назад

Too bad backprop through conv layer was not explained in the lecture.

@lebrful 8 лет назад

where can I get the slides???

@babypiro5561 5 лет назад

what if we have 4 channels.......RGB and Depth .....

@theempire00 8 лет назад

I don't understand, how do you go from the last pooling layer to FC?

@ccoding_2436 8 лет назад

+apple-sauce just stretch it and do some sorting, find relative paper and it will tell you on detail

@Chr0nalis 8 лет назад

+apple-sauce You 'flatten' it. It just means that you unroll the last pooling layer into a single long sausage. It doesn't matter how you do the unrolling though as long as you stay consistent because FC networks do not take into account any spacial coherence and it will just learn the correct mapping through BP.

@wiiiiktor 6 лет назад

t=42:51 Weights are illustrated well by Yosinski in DeepVis: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-AgkfIQ4IGaM.html

@charbelchahla1 8 лет назад

hi, is the code for CIFAR10(the demo shown in this lecture) available?

@Mnnvint 8 лет назад

+Charbel Karpathy's convnetjs code is on his github account.

@TheTharinduTube 7 лет назад

Is the demo available anywhere on the Internet?

@miftahbedru543 7 лет назад

I assume you are referring to the java script demo , if so , you have it her ... cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html ... I type the url pausing the screen haha

@dannyiskandar 5 лет назад

does anybody know the link of the javascript demo here on 38 minutes mark

@yilingliu358 5 лет назад

cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

@mperez671 4 года назад

30:24 - bug with trust_me function

@changliu4452 8 лет назад

Hi, is this video available only at 360p?

@romanring5382 8 лет назад

+Chang Liu youtube takes awhile to process 720p/1080p. Come back in an hour or so

@changliu4452 8 лет назад

Roman Ring great!

@anirudhakumar1653 8 лет назад

Since the image matrix is 3 dimensional, how can the product of a part of image(3D matrix) and the filter matrix(2D) can be 2 dimensional?

@legendarytechuniversity 8 лет назад

+Anirudha Kumar The product of the filter and the part of the image must be summed up, so it will be just a scalar value. Then you slide a filter by stride step to the next part of the image, and do the same thing. Repeat this, until you will reach the end of the image. As a result you will have a 2D matrix consisting of this summed up products across all image.

@jingkangY 8 лет назад

+Dmitriy Bobrenko In my memory, the dot product of an i*j matrix and a j*k matrix is an i*k matrix, how could it obtain a single scalar value? Need these matrix transformed as vectors, and the dot product here is just the sum of corresponding elements' product?

@legendarytechuniversity 8 лет назад

+Jingkang Yang yes, you are absolutely correct. The dot product will be i*k matrix, then to get a scalar value, we need to sum up all this matrix elements into a single value.

@jingkangY 8 лет назад

Dmitriy Bobrenko Thanks, I get it. And that is why it called "Convolutional" Neural Network.

@roddlez 8 лет назад

+Anirudha Kumar I'm not sure that I agree with Dmitriy's replies. Someone please correct me if I'm wrong but: First of all, each filter matrix is 3D (e.g. see 28:44). So, there is no additional step to generate a scalar for a single pixel. E.g. the first layer will examine all three color channels. Thus a single filter will reduce the 3xNxN image to a 1xNxN data matrix. The dimensionality of the second layer is due to the fact that there are F filters in layer 1, resulting in a FxNxN matrix.