If you found any value from the video, hit the red subscribe and like button 👍. I would really love your support! 🤗🤗 👉 You will get a New Video on Machine Learning, every Sunday, if you subscribe to my channel, here : ru-vid.com/show-UCJFAF6IsaMkzHBDdfriY-yQ
Finally! I found something useful. Thanks a lot, everyone teaches working of gradient descent in very crude way, but almost no one teaches the maths behind it. Almost everyone simply imports gradient descent from some library and no one shows pseudo code. I wanted to understand the working behind those functions, how these parameters get adjusted, and what maths is getting used behind the scenes, so if required we can create our own functions, and this video fulfilled all these requirements.
hey, can you please help me to solve this question? Question: . You run gradient descent for 15 iterations with α=0.4 and compute J(θ) after each iteration. You find that the value of J(θ) increases over time. Based on this, please describe how do you choose a suitable value for α.
If J(θ) is your cost function and it is increasing over time, you need to choose a smaller learning rate for alpha so that it instead decreases over time.
@@CodingLane yeah I figured it out after watching few times but in the video you mentioned that we used derivative of x^2 so I think you should have emphasized that part , over all a great video You made it very much easy to clear some of my doubt in beginners stage plus I would be very much grateful if you could create a community channel on Telegram or on Discord for someone who wants to clear doubts as it's not. Possible on YT
I am a COBOL programmer started machine learning. I have a doubt.. why we randomly fix to 1000 Iterations? As you mentioned, the derivate of cost wrt to theta is a slope, why don't we stop iteration as and when the derivative reached to ZERO(meaning at centre bottom where no slope exist) OR why don't we determine cost function has reached minima by comparing it's previous value less than current value ? since I searched many sites for this reason, no where mentioned the dynamic iteration than constant iteration. I'm not sure if I'm missing something else.please guide
If your derivative reaches 0, then you will stop whether you want to or not. Learning occurs from a non-zero derivative (tells you the direction you need to move in), so if it's 0, you stop. This is typically bad for larger problems because we don't usually have an obvious global minimum, so we want our code to run as long as the cost is decreasing. But if you get a 0, this essentially "kills" the neuron which results in no learning. This is a common problem when using ReLu activation function and is why they created leaky ReLu to mitigate this issue. But if in if you truly did reach the global minima and your derivative is 0, then there's no problem. Your model will stop updating each iteration, but since you reached the minima, you should be good.
Hello, I have a question on the impact of increasing the value of theta when d(cost) / d(theta) is negative. Since the rate of change of the cost function is determined to be positive or negative by (Y - Y_predicted), does this mean that when we INCREASE theta, the value of Y_predicted decreases? I am having trouble understanding this since I assumed because X and Y_predicted share a linear relationship, increasing theta should also increase the value of Y_predicted. Would be grateful if you are able to find the time to clarify this point for me, and by the way, great video I learned a ton!
Hi Paul, we don't manually set (increase or decrease) value of theta. The model automatically sets it. That is why we use Gradient Descent algorithm, to set the appropriate value of theta to make correct predictions. If you manipulate value of theta manually yourself, then your results won't be accurate. The point you should focus on here is why and how the cost function decreases. And how it helps to automatically adjust the value of theta. The value of theta can be very small or very large. Positive or Negative. It doesn't matter. What matters is, it is automatically adjusted (whether positive/negative/small/large) in a way that it makes correct predictions.
@@v1hana350 There is no need to for using cost function in K means Clustering. It is a clustering algorithm, which works differently from linear regression. It works as: - you randomly initialize cluster points. - calculate the distance between cluster point and all other points in the dataset - group data points in a particular clusters in such a way, that we put it into a cluster of nearest cluster point. - compute cluster points as average of all the points in a cluster - repeat the process You dont need cost function here. Still if you want to use one. You can take the summation of distance of cluster points to other points in that cluster:
Theta is a parameter, which we first initialize with zero. Then we train the model to changes the theta value in such a way, that with this changed value, we can make accurate predictions. Think of it like parameters of straight line. Let say, Equation of straight line is y = ax + b. Then a, b are parameters of this straight line. If we have so many such parameters, then we represent it with Theta. So intialy, our straight line will be y=0. And after training the model, value of parameters will be changes, and with these parameters our stright line will fit best on our dataset.
Check out my “What is Linear Regression?” And “Linear Regression Cost function” video from this playlist for better understanding: ru-vid.com/group/PLuhqtP7jdD8AFocJuxC6_Zz0HepAWL9cF
I revisited and got the answer at 9:0, thanks a lot. just because there is no animation while you point out points its bit of a task to listen and figure out. i wish you reach next level in presentation, because you are doing a great job with all logic and fluency! i had a small confusion as i am doing Stanfords machine learning too on coursera and your video helped in no time. thanks. grow well,
@@abhzme1 glad it helped. And thanks for suggesting. I have added presentation and animation in the videos uploaded in Neural Network Playlist. Hope you find it better than this. Let me know if you have any specific suggestion while you go through those videos. I will greatly appreciate it.
@@CodingLane bro, I request you to make video on a roadmap on how to learn ML engineering from scratch to adv, and specify the resources for the same, so every self taught get an idea
Hi, the final performance won't be affected if you divide it by m or 2m. You can check my detailed answer in the comments below (in this videos or some other video of this playlist)
After differentiation the entire function gets multiplied by 2. To eliminate that 2, he divided by 2m in beginning itself. Once the 2 is removed, it makes updating values much easier.