All Convolution Animations Are Wrong (Neural Networks)

Подписаться 11 тыс.

Просмотров 61 тыс.

50% 1

Patreon: / animated_ai
All the neural network 2d convolution animations you've seen are wrong.
Check out my animations: animatedai.git...

Опубликовано:

28 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 156

@thomasprimidis9360 2 года назад

All these wrong illustration and animation trends have been among the many problems where you would think "why the hell have we been doing this all wrong, all the time, everywhere?". Finally, someone came and did the obvious. Thank you!

@cfranc1s Год назад

They are not wrong. They just show a special case. They use the special case because the focus is on things like stride, dilation, padding etc. It's good to make the 3D tensor animations, but don't call the existing ones wrong. I think I would have still found it easier to understand the existing ones first and then move on to the 3D animations.

@spider853 Год назад

Oh man, I'm so glad someone took a direct approach to this problem, when I was learning I was so confused by all these animations and explanations in 2D, and then seeing resulting tensor shapes got me super confused, where the depth go and where did it appear? Thanks for bringing this video to the world!

@felipelourenco8054 14 дней назад

Thanks for that. It was really confusing before your animation came up!

@rezhaadriantanuharja3389 Год назад

Premise 1: All convolution animations are wrong Premise 2: This is a convolution animation Conclusion: this is wrong

@Q_20 Год назад

oh shit

@zyansheep Год назад

0:02, "all convolution animations you've seen _up to this point_ are wrong"

@LimitedWard Год назад

Maybe it was a proof by contradiction

@Firestorm-tq7fy 6 месяцев назад

No?

@krischalkhanal9591 5 месяцев назад

Someone just read discrete algebra. (Kudos!)

@randyekrer431 Год назад

you should've started with the typical RGB 3 layer input image, and animate convolutions on that; that's where most people start to get lost as to how the weights match with inputs, translating from the 2D mental model to 3D.

@amitamola2014 11 месяцев назад

Bang on right. So he made a video of how others weren't doing it right but then didn't start from the start itself to explain what actually goes on correctly. I mean, what good is this new one then :/

@allNicksAlreadyTaken Год назад

They are not wrong. They are just displaying a different case than what you are interested in. Maybe they are misplaced in the material you were looking at, but if they were animations for different things, like convolution filters in image processing, they wouldn't be wrong. Have some humility.

@Henriiyy Год назад

Also convolutions as an idea are way older and more general than just image processing or neural networks. He comes off as ignorant of this.

@Firestorm-tq7fy 6 месяцев назад

No, they are wrong. They give the feeling that each feature after convoluted is being a standalone one and post-processed accordingly, Which just isn’t true. They form a new 3d image, which then gets treated as such.

@hannahnelson4569 3 месяца назад

Convolution is defined for any finite dimension of tensor, even 1 dimensional. While the claims made in this video are valid when looking from the domain of machine learning. I do aggree that calling diagrams describing a different purpose of a general structure 'wrong' because its not how your particular field uses it feels a bit sensationalist.

@andreas.karatzas Год назад

@animatedai How did you learn blender? Which were your sources?

@ayamaitham8430 2 года назад

Such a hard work ! Thank you so much

@ГригорийПогорелов-и7о Год назад

Adding a bias term added after convolutions would be a full process representation. Anyway, great visualization!

@Shad0wWarr10r Год назад

With those sahpes og input and filter, is there even any reason to have them 3d over 2d? I get the output as you layer filters, but if the input and filter is just the same thing all thru representing them by singular cubes is not wrong

@jazzvids Год назад

3:30 - the final animation

@menkiguo7805 8 месяцев назад

amazing

@FelipeGustavoSilvaTeodoro Год назад

Amazing!

@Shamisen100 Год назад

Any plans to add your animations to Wikimedia commons? :)

@slime67 Год назад

Great work!

@davidebic Год назад

Is there a way to access your course online? I'm really interested in this subject!

@animatedai Год назад

The course is a work-in-progress. You can see the videos that are completed so far in this playlist: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eMXuk97NeSI.html

@curious_one1156 Год назад

Bravo !

@hjtvgfhjtghvfg5919 Год назад

If all are wrong, then why should i watch this one?

@날개달린_양 Год назад

tysm

@mariogonzalezotero Год назад

amazing!

@LokeKS 5 месяцев назад

There is no convolution

@andreansihombing6780 2 года назад

So the output is a feature map? I don't get it why the feature map on the right, stacked like that. Anyone can explain it?

@animatedai 2 года назад

Yep, the output is a feature map. Each filter produces one feature of the output, which get stacked together like you see. I've got a video covering that concept here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eMXuk97NeSI.html.

@andreansihombing6780 2 года назад

@@animatedai Thank you for the explanations, I barely understand with others visualization, but you really do a good job.

@architech5940 3 месяца назад

You did not introduce convolution in any informative way, nor define any terms for your argument, and you didn't explain the purpose of 3D convolution or why 2D convolution is inaccurate in the first place. There is also no closing argument for what appears to be your proposition for the proper illustration of CNN. This whole video is completely open ended and thus ambiguous.

@avidrucker Год назад

A major thing that feels missing to me in the animations is clear textual labeling. It's fine that you label them out loud, and then, also, it would be more accessible for folks with hearing challenges or cognitive challenges. My crit aside, this animation is lovely, and I'm very impressed with what you've done. You've earned yourself a new subscriber :)

@kuanarxiv Год назад

The example just a concept. I don't agree with this sensational title.

@peabrane8067 Год назад

The animation is just meant as an abstraction of the spatial convolution operation itself. A spatial CNN layer consists of spatial convolution operations across multiple input and output channels (which is what you are referring to)

@logon2778 Год назад

Forget the animation itself (even though its great). I just appreciate a non-moving camera. It bothers me so much when people spin the camera around a nice animation in a circle. Makes me feel like I am on a carnival ride.

@grjesus9979 18 дней назад

So in case of a feature map input, 2d conv just replicate each 2d filter along the feature dimension and do multiplication wise? In the video, the filters are 2d really just replicate to fill in the the number of features? or does each 2d filter is in reality a 3d tensor to match the feature dimension?

@tomo9908 Год назад

Instead of spending 95% of the video ranting about how other animations are bad, I would have appreciated it more if you had spend that time explaining how this animation works. I don't think I learned anything from this video.. How do you go from an input RGB image of size W * H * 3, to some cube of size 5 * 5 * 5 (+padding)? You lost me at step 1..

@animatedai Год назад

Check out this 100% rant-free playlist to learn more! ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eMXuk97NeSI.html

@pew_pew_pew4377 Год назад

well, not speak for the existing animations/figs, i won't say they are wrong, they have some issues, but essentially they are correct. When talking about 2D convonlution, we should know the input and output are 3D as input is a picture and output is also a picture/feature map.

@shuninc9273 Год назад

Not wrong bro. They are just incomplete.

@devjeonghwan Год назад

Unfortunately, only half right. How about if we need to understand 4D or 5D convolution situation? Humans can understand 2D most intuitively and I think this is a reason for why made that 2d based animations. (And 2d convolution can extending to a larger dimension.) And deep learning convolution is unfortunately not mathematically organized. It is derived from "filter" in image processing. and "filter" also derived from "cross correlation" long before. You are animation have a multiple kernels, It just depict an argument called "channels" that is only used by "Neural Network" frameworks.

@peabrane8067 Год назад

Just abstract/generalize it to higher dimensions then. This is the same as saying "why do we visualize vectors in 2d coordinate systems, even though Nd vectors are well-defiend, or even infinite dimensional vectors (Hilbert space)". Visualizations are meant to capture an intuitive/simplified example, not meant for generality. The generality comes from formal mathematical reasoning which no visualization can capture.

@Henriiyy Год назад

Convolution is well defined mathematically, and has been way before the invention of image processing.

@alexeychernyavskiy4193 Год назад

They are not wrong. They are a simplification that helps to understand the concept. As any simplification they are incomplete. But not wrong. It's sad that you use clickbait titles.

@bediosoro7786 Год назад

The first animation you say is wring shows the contribution of one filter operations which is quite accurate. is you considered the number of input channels one and out put channels 1 that is the right figure for the whole operation. the conv2d operation are all element-wise matrices multiplication with shifting windows. the 3D animation you did look great but lack of that notion . that is my option. i stick with the 2D.

@almightysapling Год назад

Thank you! He never explains why his animation is "correct" and in my opinion it simply isn't. 2D convolutions act 2D on 2D data. The fact that we have *multiple data* and *multiple filters* leads to us frequently blocking things in 3D, but the convolution itself is still fundamentally 2D. And if someone doesn't understand that, it's not because of bad animations.

@pere_gin Год назад

"a 2D convolution actually takes in a 3D tensor as input and has a 3D convolution as output", well, it depends right? If you have a single channel/grayscale image then the input is in fact a 2D tensor, and each feature outputs a 2D tensor that is joined with all others in the feature map. So if you have a grayscale image with a single feature, the animations would in fact be correct. I think the animations are perfectly fine, as they simplify a concept to it's most basic form for easy understanding. But it is true that after you understand the basic concept, a 3D - 3D representation is also nice to understand more common and complex examples. Disclaimer that I could be wrong as I am by no means an expert, but this is my take from my current understanding of convolutions :)

@animatedai Год назад

I'm not aware of a library where it depends, i.e., where the depth dimension is optional. PyTorch's Conv2D will accept a 3D tensor or a 4D tensor (batched 3D tensors). The functional interface only accepts 4D. Kera's Conv2D layer will only accept a 4D tensor. TensorFlow's conv2d operation with accept anything with at least 4 dimensions (the last three are treated as the height, width, and channels and all the others before that are treated as batch dimensions. NVIDIA's cudnn implementation of 2D convolution takes a 4D tensor. And in all of these cases the weights will be in 4D which can be thought of as a 3D weight for each filter corresponding to the size of the 3D patch that the filters operate on in the input. So as far as the industry standard, there just isn't a 2D convolution where you don't have a depth dimension in the input. Your grayscale case will only have a depth of 1 but will in fact be a 3D tensor. If you're able to find a mainstream library where the depth dimension in the input is optional, let me know.

@Henriiyy Год назад

@@animatedai Still the concept doesn't need the 3d implementation, as the different features are worked on independently anyway. I definitely think it's a stretch and comes off really condescending, to call all other animations wrong.

@hyahyahyajay6029 20 дней назад

I have been struggling to mentally visualize convolutions, specially going from one dimension to others. I was reading the book Understanding Deep Learning by Simon Prince and I realized what i thought i looked like was wrong ( The 2D to 2D animations from the beginning). I wish I would have stumbled upon yours before having to imagine what was explained in the books XD (Good book tho)

@captainjj7184 2 месяца назад

I like it, really, love it! But... I don't see what's wrong with other illustrations and peculiarly I think yours just iterates what they already clearly illustrate. I was even expecting CNN representations in XYZ visuals. Am I missing some points here? Honest question, would appreciate any enlightenment! (btw, thank you for sharing the world with your own version of splendid animation!) PS: If you're up for the challenge, do Spiking NN, I'll buy you a beer in Bali!

@kartikpodugu Год назад

amazing. you have cleared all my doubts in single shot

@PeppeMarino Год назад

The use of all these misleading animations is the primary cause of misconception about convolutional neural networks; you have finally provided a good visualization. I am happy to share this content with my colleagues.

@Antagon666 Год назад

I don't really care about the animations, the problem is when they start describing convolutions as 2D operations and don't go into detail on the effect of having multiple input and output channels. I wish I found this video sooner, but anyway it's easy enough to derive the solutions yourself from 200 google search results. ( Google really sucks nowadays ). It's actually a good mental excercise to imagine the 3d/4d filter sliding across batch of images... But good luck finding a correct padding for strided convolutions during backpropagation of both Conv and TransConv layers... I had to derive everything by hand, because internet has incorrect and even worse conflicting formulas for that... 😂

@bibimblapblap Месяц назад

Why is your input tensor so many dimensions? Shouldn’t the depth be only 3 (1 for each color channel)?

@nikilragav 2 месяца назад

What actually is the 3rd dimension in this context for the source giant cube? Is that multiple colors? A batch of multiple images?

@cfranc1s Год назад

@Leibniz_28 Год назад

Awesome . Finally a good representation of this computations. Thanks for your hard work!!!

@havehalkow 8 месяцев назад

I'm just curious if this visualisation helps someone who doesn't know what convolution is.

@alessandropolidori9895 Год назад

Love it. I always thought there were no accurate visualization on the internet too. Good job

@HyperFocusMarshmallow Год назад

Honestly, just write down the formula… Nice work though!

@kasuha Год назад

I find these "new and correct" animations confusing, I have no idea what's happening there. I assume this is just "the correct way to display convolution" for AI models? As an old school person who used convolutions mainly for 2D image processing (blur/edge detection) I don't see anything wrong about the old animations, that's exactly what we used to do there.

@kage-sl8rz 3 месяца назад

cool even better add names to the objects like kernel etc would be helpful to new people

@schorsch7400 6 месяцев назад

Wow, really great, thanks for your work! I was struggling with the very problem you mentioned in the video - bringing together the 2D conv visualizations with the multi-channel 3x3 convolutions that are common in modern CNNs. Thanks to your work, I now understood it.

@dereklust3480 Год назад

Great video, but you didn’t explain why the input is a 3d tensor. If we are convoluting a 2d image, where does the 3d tensor come from?

@animatedai Год назад

The short answer is that both the input and output are feature maps. Check out this video explaining it: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eMXuk97NeSI.html

@dereklust3480 Год назад

@@animatedai A better way to ask my question is this - How is the first feature map created in a convolutional network? Surely the time time the 2d image is convoluted we have a 2d tensor and 2d filters, just like the typical animation, right? I get that the output of this convolution will be a 3d feature map, and thus all further convolution will look like your animation that is 3d to 3d.

@dereklust3480 Год назад

the *first* time

@animatedai Год назад

A 2D image is represented as a 3D feature map with 3 features: red, green, and blue. So even the first convolution has a 3D input.

@simonpointner7545 Год назад

@@animatedai You may take a grayscale image as an input because for many cases it is sufficient, and for learning it´s a good simplification, I would not consider this being a wrong animation, it just assumes you have a grayscale image as an input. I get the point of your video but the title is clickbait.

@bluvalor7443 Год назад

Lol, I literally learned this the hard way just about 2 months ago, when the shape for my 2d convolution required 3 parameters, and this made me super confused :,)

@jazzvids Год назад

Thank you so much for this! worth mentioning that the animation has a stride of 2

@yunusbilece8690 Год назад

I liked the idea but title is too big for this kind of correction

@RojinaPanta1 Год назад

I really appreciate the effort and is good one, but I would still go with the 2D one this is way too much jittery for me with so many things happening at one and choice of colors.

@koktszfung Год назад

Convolution is not only used in neural network

@connorvaughan356 Год назад

Why is the does the input shape have have so many layers in this animation? Wouldn't it have a shape equal to the image shape, then 3 layers, 1 for rgb respectively?

@animatedai Год назад

I'm glad you asked. In general, convolution takes a feature map as input, which can have any number of features (depth). An image is a special kind of feature map with 3 features: red, green, and blue. However, this typically only applies for the first layer of convolution in a neural network and the other convolutional layers will have more features for input, e.g., 32, 64, 128, ... 1024, 2048. So to better represent the general case and to encourage viewers to consider more than just the special case of an image, I chose to use 8 for the animations. Although 3 would also be perfectly valid.

@naasvanrooyen2894 Год назад

Thanks so much for this. Also really struggled to get proper animations. Would have liked to see how this looks in the actual neural network. i.e. how the filter can be visualized as the weights. Or show who the filter parameters are trained. Would greatly appreciate a video of GAN and LSTM. The LSTM diagrams are terrible. Really struggled to visualized how they connect to the overall network

@ChaoticNeutralMatt 11 месяцев назад

Oh cool, didn't know Blender had all that.

@MrShadowjockey Год назад

Thank you for this, recently I tried to explain why the input and output shapes behave the way they do, and what gets combined with what. These animations will make it sooo much easier!!

@jkkuusis Год назад

Thanks for the great video! I'm at the moment trying to learn these concepts, so I might be suggesting something that is incorrect. Here it goes anyway: an example with input of depth 3 might be good to better understand this by thinking of rgb image data. It also seems to be the (special) case in your animatioms, that the input depth is the same as the number filters leading to same depth in both input and output. That is not always the case, if I've understood this right.

@animatedai Год назад

I intentionally avoided using an input depth of 3, because that's a special case. Most convolutional layers in a CNN will have an input depth much higher than 3. It's better to think of the convolutional layer as "feature map in, feature map out" rather than "image in, feature map out". By that same logic, I should have made the input depth different than the output depth, because that's also a special case like you said. I had thought about this at one point, but sadly it didn't make it into the final animation. I'll probably fix that on GitHub in the future.

@carnap355 10 месяцев назад

I don't understand why filters are 3D. 8 deep = 8 channels? In the end you get as many channels as there are filters?

@animatedai 10 месяцев назад

Good questions! First: for a good explanation of "why" the filters are 3D, check out this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-XdTn5md3qTM.html. And second: the filter depth matches the input depth, and the output depth matches the filter count.

@honourable8816 3 месяца назад

Stride value was 2 pixel

@WordsThroughTheSky 10 месяцев назад

wait a friggin' minute.... you're telling me that the filter or kernel is 3D? I always thought it's a 2d 3x3 filter that goes to through each "layer" of the input and it recreates a 3d output tensor. Are you sure it's a 3D filter? where is this stated?

@animatedai 10 месяцев назад

Haha, I'm sure :) You can check the documentation for your favorite neural network library to verify. The conv2d operation will actually take a 4D tensor for the filters. Each filter is 3D and you pack all the filters together in one tensor to get a 4D tensor. To convince yourself that it only makes sense for the filters to be 3D, check out the sequel video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-XdTn5md3qTM.html Sources: www.tensorflow.org/api_docs/python/tf/nn/conv2d pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html

@WordsThroughTheSky 10 месяцев назад

@@animatedai you're right... my mind is actually blown, it's all been a lie, ty for the response

@leslietetteh7292 Год назад

As someone who has never built a convolutional neural network, but as someone who has done lots of convolution in image processing algorithms, the convolution they are showing is normal 2d convolution. 3x3 pixel values in -> convolution kernel operation -> single pixel value out. For showing what is actually going on during an operation with a convolution kernel those first animations are perfect. As someone that's built at least a couple of neural networks with linear transformations, and knows exactly how convolution kernels work, I'd hoped to be able to intuit what's going on in a convolutional neural network from your animation, but your animation is super confusing without any context. What is your input, a 3D tensor - I thought the input was a 2D image? What is the output, I thought the output was just "features" extracted from the image? What you've created is so abstract that it literally made me more confused than I was to start with. In my opinion, the best diagram is what you've is shown between 1:11 and 1:33. Except for the concept of 'pooling' it's 100% clear what the actual mechanics within the Neural Network are, and it explains the process of convolution. With no prior knowledge of a convolutional neural network mechanics, I understand it roughly, save the concept of 'pooling'. If you think that information is too much for a student to be able to put together, you're doing too much of the thinking for them. Maybe with knowledge of convolutional neural networks the animation you've made would make sense, or with context, but for an introductory course to convolutional neural networks, it is literally so abstract as to be worse than useless. Its actively confusing.

@carnap355 10 месяцев назад

input is 3D because it has multiple channels, so those are like 3D convolutions but they are commonly called 2D because stride and padding are 2D. Lets say you have an image with 3 channels so its 100x100x3. If your layer has 16 output layers, it will have 16 convolution filters with size height*width*3. So you will end up with a 100x100x16 image

@leslietetteh7292 10 месяцев назад

@@carnap355 I have done a little bit of work with convolutional neural networks now, I apologize if my response came off as rude or blunt, I tend to not use as many niceties as I would in a usual conversation when it comes to the internet. I have to say I still find this confusing, though I get what you're saying, I don't understand what that big 3x3xN cuboid block in the middle is, with regard to each of the convolution kernels, and the image, and the image output (at best guess the 3d block is all the stacked results of the previous convolution operations, being subjected to a new convolution operation?). What do you think of the "CNN Explainer" website (can google it)? That's how I understand convolutional neural networks as of now. I also understand the max pooling layer now to be akin to what I would understand from image processing as a "Maximum Filter" operation. So I "think" I have an understanding of what's going on in a convolutional network, bu tfeel free to correct me.

@jameshopkins3541 3 месяца назад

Which is correct?????

@lizcoultersmith 2 месяца назад

These videos are outstanding! Finally, true visualisations that get it right. I'm sharing these with my ML Masters students. Thank you for your considerable effort putting these together.

@feddyxdx272 Месяц назад

thx

@aintgonhappen Год назад

This is some amazing content. Thank you, buddy!

@axelanderson2030 Год назад

Geonodes were easier?

@PurnenduPrabhat 4 месяца назад

Good job

@macewindont9922 4 месяца назад

sick

@JonDornaletetxe Год назад

🔥

@xuerobert5336 11 месяцев назад

brilliant !

@eliagonzalezmolina366 6 месяцев назад

I was looking for something like this to dispel my doubts and it worked! thanks :) (I think you are right, common animations are super misleading)

@alimurreza 10 месяцев назад

Excellent visualization! I will definitely show these visualizations to my students in Machine Learning course. They will love it.

@GamesEngineer Год назад

Thank you! Great animation. However, I do have a technical nit pick. Your animation shows an operation known as cross-correlation, which is related to convolution, but it is mirrored. "Convolutional neural networks" use cross-correlations in the feed-forward phase and convolutions in the backpropagation phase.

@orenAm 11 месяцев назад

Your videos are very cool! I wonder if you thought about how to present Conv3d, it is a challenge when considering more than one channel

@osmmanipadmehum Год назад

4:04 why isnt last column scanned

@animatedai Год назад

I'm glad you noticed. I picked an even column number specifically to demonstrate that. This happens because there's a stride of 2 and an even number of columns (8 in this case counting padding) and an odd sized filter (3x3). So after it's taken 3 steps, there's only 1 pixel of width remaining, and it can't move 2 more spaces. Convolution handles this by simply ignoring the remaining data and moving to the next row. This is important to know because it could cause you to lose data (which could accumulate over many layers to be significant chunks of your input). In this case, the last column was just padding anyway, so no real data is lost. Note: the last row is scanned because, unlike the columns, we have an odd number of rows, 7 counting padding.

@BornAgainstAll Год назад

Oh.

@twbjr2 9 месяцев назад

Thank you for making this video! I have been trying to visualize this using all the horrible diagrams from papers. I immediately understood what they were trying to convey after watching your video!

@rukshanj.senanayaka1467 10 месяцев назад

This video feels like an iPhone moment - a video I didn't know I needed until I saw it. Thanks a lot!

@momolight2468 Год назад

Best video about neural convolution and filters!? YES!!! Thank you so much!

@37window57 Год назад

3D, what tool are you using? Blender?

@animatedai Год назад

I'm using Blender with a lot of Geometry Nodes.

@37window57 Год назад

@@animatedai 감사합니다 소스 는 오픈 안되있으셔요?

@bala.dhinesh Год назад

This is what I expected for a long time. This explains everything clearing. Thanks for posting this.

@menkiguo7805 8 месяцев назад

This question has been confusing Mr for a long time thank you

@guyindisguise Год назад

Nice animation, are you planning on making animations for Transformers as well?

@Xardex-u5b Год назад

Слышу

@laplaceha6700 Год назад

this conv2d animation you do is right, thanks alot

@____-gy5mq 4 месяца назад

best generalization ever, covers all the corner cases

@sscswist Год назад

Finnaly. You are the best. When i was learning this, i was always looking at all those original animations and i was always so confused ...

@hos42 Год назад

Thank you for putting this out!

@rukshanj.senanayaka1467 10 месяцев назад

Super helpful, any plans to make this open source or make interactable cases where we can change the stride and see the variation?

@animatedai 10 месяцев назад

Thank you! On my GitHub page (animatedai.github.io), you can see a few different variations, and I'm also working on an interactive webgl app where you can pick the parameters.

@Gabbyreel Год назад

I’m a little confused with the video, because I still don’t understand WHY 2d convolution pictures are wrong. What determines the depth of the first input? Same with the convolutional layer. Is this because we have RGBA layers, or?? What’s the benefit of drawing it as 3D instead of 2d? What’s the benefit to us to have a tensor instead of array of convolutional outputs? I’m sure this sounds like thoughtless complaining but I really am curious, and there must be something about convolution in AI that I’m missing in my own knowledge. Thanks for reading this.

@animatedai Год назад

Those are good questions. 1) Why are the 2D animations wrong? Short answer: they simplify away the feature dimension. Long answer: The 2D convolution that you'll find in neural network libraries (all the way down to NVIDIA's hardware interface) is "conceptually" performed on 3D data. These end up being batched so the interfaces technically take 4D tensors, but that's just multiple convolution operations on 3D data performed in parallel. If your only understanding of convolution came from the 2D animations, you wouldn't understand how to create the 4D tensors (or the 3D piece of data for a particular sample in the batch). In fact, you wouldn't know why the operation took 4D tensors at all. If you'd like more information on the feature dimension, my first video on the fundamental algorithm (ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eMXuk97NeSI.html) should provide enough information to understand what the feature dimension is and why it's essential. 2) What determines the depth of the first input? This depends on what data you have. A color image in RGB format would be a depth of 3: red, green, and blue. A color image with transparency in RGBA format would have a depth of 4: red, green, blue, and alpha. A grayscale image with a single brightness value for each pixel would have a depth of 1: brightness. 3) What determines the depth of the output? Check out my video on filter count: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-YSNLMNnlNw8.html 4) What's the benefit of drawing it as 3D instead of 2D? 2D convolution conceptually operates on 3D data (2 spatial dimensions and one feature dimension), so drawing it in 3D shows everything and doesn't simplify anything away from the viewer. 5) What’s the benefit to us to have a tensor instead of array of convolutional outputs? Could you clarify this question? I'm not sure I understand it. Are you asking why we pack all the output values into a single 3D tensor instead of multiple 2D tensors (maybe one for each feature)? It's rare to want the features separated out like that, so it's convenient to have everything together in one tensor. It also has performance benefits from memory locality.

@Henriiyy Год назад

@@animatedai I still don't think the other animations are wrong though. The mathematical concept is the same for an input feature depth of 1, and for higher input depths you are just performing multiple mathematical convolutions at once. I don't think that makes convolution a _fundamentally_ 3d concept, just because it's computationally opportune to package multiple convolutions as one operation.

@Karmush21 Год назад

Is a cube in the filter (or image) a pixel? Or is it a combination of channels?

@animatedai Год назад

Good question. Each cube is a single floating point value.

@Karmush21 Год назад

@@animatedai Thank you for answer, btw, why do you think the typical animations are wrong? Can't we just do a 2D convolution on each slice of an input image and then just stack the slices together to get the same feature map as with your animation?

@animatedai Год назад

That's a great question. So good, in fact that I'm planning to make a follow-up video explaining it. I think a lot of people are struggling with this idea and your question's phrasing really helped me understand where the misconception is. The short answer is that what you're proposing wouldn't be equivalent, because in neural network convolution, each filter sees the all the features/channels of the input, not just a slice of the input. That's why the filters themselves are 3D. More concretely, let's say a bumblebee-detecting neural network wanted to look for the color yellow, so it needed a filter that detected yellow. That filter couldn't just look at the red channel or just the green channel or just the blue channel. It needs to look at all of them together to distinguish yellow from red or from green or from white or any other color. So we can't slice the input up into red/green/blue and then operate on them separately. Does that make sense?

@Karmush21 Год назад

@@animatedai Yeah it makes total sense once you realize (which is quite obvious when you think about it) that the input channels pretty much always have some sort of correlation between them. In your example, we need all 3 channels to see how much red, green and blue we have to get this certain type of yellow. I talked with my professor about this topic (where i referenced to your video). And he believed that they only reason for when this 2D convolution and stacking of slices is better is if you don't have that much data. Also, that the training will be faster. I think a lot of people would appreciate a video exploring the difference between these two ideas in more detail. At least I would. Thank you already for your animations and answers!

@animatedai Год назад

I meant to add that the splitting and stacking isn't a crazy idea as long as you understand the limitation (compared the standard convolution) and compensate for it. In fact, it's the basis of the depthwise-separable convolution, which can be much more efficient that standard convolution. I've got a video on it that you might like: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-vVaRhZXovbw.html