At around 4:32, you are saying that that the filter for 2nd Convolution must have same 4 channels as the output after the 1st layer. Should it not be 3 as the coloured image is an RGB, so filter for the 2nd convolution should be "(3x3 x3)x8", 3x3 is the filter size, followed by the no of channel and finally the no of filters applied in 2nd Conv, ie, 8?
Thanks for very clear explanation. I wnat to know for complicated data like music data where we use mel spectograms and the number of feature are diffreent for every song, if we do not apply segmentation then we have to deal with diffrent number of input features lets say , (1456,80) where 1456 are the number of frames and 80 are bins then next song (3789,80) , then next (7867,80) ,.... so how to specify parameters for the cnn for this because input is change every time ? and how many layers for such data will be reasonable ?
I'm asking the same question but generally the answer is : we cannot predict how many hidden layers we need, so you've to test and see what features your model is going to detect
General idea is, if you think, your data has more complicated information, then use more number of hidden layers. You need to test your model by training and checking the accuracy, and chose the number of layers that gives highest accuracy.