@@Acetyl53 Given two people with identical knowledge of a subject, the person who can explain it more thoroughly and understandably to a layperson either has elevated communication abilities, or has a deeper understanding beyond what can easily be measured. In either case, they are demonstrably smarter in what we care about.
1:32:27 The bias is needed because otherwise all your inputs were zero, no matter what your weights were. Y was being calculated as 0*w1 + 0*w2 and then passed through the sigmoid, which S(0) = 0.5. Adding the bias allowed it to provide a non zero input to the sigmoid in that case Of course this is an old stream so I'm sure you figured that out later, just in case anyone watching was curious, great stuff as always Zozin!
Another simple explanation is that when our inputs are in a range of values between A and B, say like BMI values which in most standard cases are between 16-30 or so, it's helpful to standardize the range to something between 0 and 1. The weights are multiplied/divided by the inputs to scale them up or down. We could divide the upper BMI boundary value by 30 to get 1. However, when we divide the lower BMI boundary by 30, we don't get 0. In fact, no matter what number we choose, we can not bring the lower boundary to 0 by multiplication alone. This is because there is a "bias" in the range (or an offset or **addition**) on the range. The bias term is that extra addition/subtraction needed so that we first make sure that the range starts at 0. Then we do the scaling to make it range from 0 to 1.
I remember from my machine learning classes that the bias term comes from the idea of having a threshold value for the activation. Instead of writing an inequality, you would subtract the threshold from the usual perceptron's linear function (W • x - threshold). The bias is just the negative threshold for mathematical convenience. In fact, the bias can also be thought of as a weight whose input is always 1 (helps understand why you also updated the bias the same as you do with the rest of the weights).
It's much more interesting to learn machine learning like this than to just use some pre-made library I'm far more interested in the underlying mathematics and algorithms than just some 'cat food' approach to learning where we just get a brief overview of how to use some preexisting technology. The mathematics and algorithms are interesting and worth learning especially if you want to be innovative in any field. While it might not be an 'expert' example is an intuitive explanation which is as in depth if not more so than at the universe level AI course which I have taken. Thanks for the great content!
Zozin, you are a wonderful teacher for anything Computer Science related :) you are teaching in a way that actually helps people understand things, so thank you for your videos. If only Universities had people like you to teach
As far as I understand it, the reason for using the square, instead of for example the absolute value, is that apart from giving you always positive values so they won't cancel out when you add them up, the square function has some nice properties in terms of calculus, for example, the derivative exists everywhere (this is not the case for the absolute value) and this can be important for implementing algorithms like gradient descent. The reason can't be to amplify any error, even if it's very small because if it's indeed close to zero, and you square it, instead of amplifying it, you'd make even much smaller! But anyway, this was a thoroughly enjoyable intro to ML.
I am a ml engineer and you are absolutely correct. The main reason to use square instead of modulus is to take derivatives from any point given in order to calculate and preform a gradient descent, which is used for model optimization. But there are some downsides to it. For example, it really bumps up the error. Imagine you are calculating prices of apartments based on some features provided to you. If error is 1`000$, you will ramp it up to whopping 1`000`000$. That means your model will be affected more by the outliers in your training data and model will be trying to compensate the damage of outlying squared values. That is why ML-engineers often have to make a choice between MSE (mean squared error) or MAE (mean absolute error). If you need more optimization and there are no obvious outliers - pick the first one. If there are a lot of outliers in data, pick MAE to make your model less "emotional" if you could say so :)
@@artemfaduev6228 MSE and MAE are not the only loss functions that exists. MSE/L2 loss means that we assume gaussian noise for the data. Instead of gaussian noise we could use student-t distribution as the noise distribution and use the negative log likelihood (differentiable everywhere) of that as the loss. Student-t has heavier tails (with controllable hyperparameter nu) -> more robust to noise. Also there is something like huber loss.
I’ve done this with neural networks. In short, the common neural network with ReLU activations will look like a piecewise function with linear characteristics at the boundaries. In practice this is avoided with sinusoidal output encoding which makes the issue of sinusoid approximation trivial
I have a degree in Physics and i have a feeling you deeply understand mathematics better than i do lol. Best method for sure is central difference method but it doesnt matter your way of teaching and problem solving absolutely rocks and is the best
Not actually, if you go deep to the convolutional nn or other architectures, the math needed there are pretty advanced, like tensor calculus @@klnnlk1078
@@klnnlk1078 Well im bad at math so it seems complicated to me but i still love to learn about it. Need to find the time to study some math the American school system didn't do me any favors.
Generally I can't make myself sit and watch your videos entirely because I don't know what you're doing especially C videos, but today I saw this entire video mainly because of how simply you explained it
The nice thing about doing this in c is that you could use OpenCL (c98 syntax) to parallelise certain operations on gpu cores (like you were using a thread pool) without really changing much of the logic (so long as you're not using some language features that aren't available like function pointers).
a more intuitive way of understanding when XOR "stagnates at 0.25" is pretty much because the neural network is able to model up to 3 of the 4 states that we want it to be able to model. After being able to model 3 of those states, it absolutely cannot model the 4th one due to it's limitations of how it was built, so that last one takes up 25% of all the scenarios of an XOR gate that we want it to be able to imitate :D so at best it will still have cost 25% (or accuracy 75%)
01:38:00 @Tsoding Daily The reason you couldn't model XOR with a single neuron is because xor requires a non-linear classifier to separate the two cases. If you adjust them in a 2x2 matrix you can see why: AND: (instead of writing [0 0 0 1]) [0 0 0 1] you can draw a line separating 0s and 1 OR: [0 1 1 1] again, the classifier only needs a straight line XOR: [0 1 1 0] you need some sort of oval shape, a line isn't enough to classify xor.
Watching you made me realize that understanding the definitions and concepts is perhaps the most important part of programming. The second important part is distilling a high level concept down to its base components is close behind. Third is typing. My knuckles hurt from all the vids I watched. Now I wanna watch parametric boxing (all techinique while blindfolded).
Well this is cool! By just using mathematics and the power of computer you build something that was able to predict the next number (even when it was just that the next number 2 times the input) and also something that can recognise logic gates is just mind bogling. And only with 1-3 neurons. I was very interested about this topic before this video and now i am hooked!
Yeah, essentially a xor b = (a+b) mod 2 suffices, if false=0 and true=1. There, a+b is the linear combination, and mod is the periodic function. Similarly works with sin and cos; and, up to scaling, any activation function with multiple zeros (except the zero constant function).
@@daviskipchirchir1357 I think I wrote it clearly enough, if I say so myself. It doesn't get more precise than when written in a formula. Not sure what the something is that you want to see, but thank you.
Thanks dude, I tried to write something for AI in C a while ago but it was incredibly difficult because all of the information is Python. Saving this video for later.
@@AD-ox4ng I wasn't aware of any pure python implementation videos, all the info I saw used numpy. Re-implementing matrices while trying to understand the math at the same time sucked.
@@andrewdunbar828 There are a lot of deep learning libraries / frameworks in C that are relatively simple and less heavyweight than the mainstream frameworks in Python.
About the XOR thing. The way I understood it, is that with a single neuron you can only model a linear equation, i.e. a line. If you try to plot in a 2D graph the inputs for the OR function, that is putting a "0" in the coordinates (0, 0) and "1" in the coordinates (0, 1), (1, 0), (1, 1), you can clearly see that you can "linearly separate" 0s and 1s outputs (this means drawing a line that separates them). If you try to plot the XOR function, instead, you won't be able to linearly separate 0s and 1s on a 2D plot with a single line, but you will need a more complex model, e.g. two neurons. Moreover, the weights can be seen as the angular coefficient of the line and the bias as the intercept.
1:09:17 activation functions have the main purpose of being non-linear (have weird shapes) Because if you add lines with lines you get more lines, so your 1 trillion deep neuronal network is just as effective as your last brain cell. So with something like ReLU(which is a glorified if) you can have a neuron light up for specific inputs, then in turn trigger other specific neurons to build complexity with every layer.
fp arithmetic for adding lines is also non-linear: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Ae9EKCyI1xU.html in case you haven't seen it, the whole talk is amazing and built on this
nice explanation... the square is used because the variance has units of "stuff we're measuring"², because if your error is eg .8, it'll not be amplified when squaring it
I can’t always catch your livestreams cause of time zones, but I really enjoy getting to catch your videos here on RU-vid. (I’m branchwag btw when I hop into twitch! ^.^). I just love the way you explain your thought process.
@@TsodingDaily dude, the more you hate something, the more you like it. I mean, you can't hate someting you don't care about. So you were interested in it enough to start hating it 😂
Awesome video and very interesting topic! Can't wait to see the next episode. Next stream "4-bit adder" - i see where we going. We slowly approach to build an entire cpu through neural network :D I like the way you explain complicated things from the ground. I love it. It's worth a lot. Thank you!
Oh hell yeah, dat timing though. I started doing precisely this just yesterday evening, and in C rather than my go-to C++, at that. Got stuck with backpropagation so I decided to spend tonight doing some more reading. Said reading has now been moved to tomorrow evening :) Edit: Oh, and side-note; have you read The Martian? A quote came to mind. "So it's not just a desert. It's a desert so old it's literally rusting".
for some reason Ive tried starting and am already stuck aroun the 20 minute mark, I can’t get it to generate a random number 1-10 for some reason? I am using the same code and am confused why it isn’t working, has anyone encountered this problem?
49:40 Bias is important to prevent the "bias" to fit in the entire training set, for example the over fitting, it's important to have the bias term so we can avoid many classical overfitting problems, so as far I know the bias is a just a parameter created to avoid this issues and like you said improve the model training. EDIT: The correct therm to avoid the overfit and also the underfit is called regularization therm, and arrives when we split the data into training set and test set (to validate our model) and see that with this therm we can get fastter the correct model (for largers datasets and complex models)
THANK YOU SO MUCH FOR THIS TUTORIAL!!! I have learnt so much, your explanation and reasoning is very insightful - delivery on subject matter EXCELLENT and humorous :)
tanh is almost always preferable to sigmoid as the activation function. tanh activation function would give you 0 instead 0.5. Also it would enable your cost function to go to 0 fast.
"I'm ooga booga software developer (...)" 🤣🤣 23:24 But in all seriousness, great intro explanation at the beginning. Edit: After watching the whole thing, this was the best ML explanation I have ever seen. Looking forward to the next video.
Thank you....wow!!! A chemist with exceptional "lower level computing" knowledge, skills and talent.....out of 8.1 billion folks, must be rarest of rarer species(0.01%)......wasn't aware of that, should do some cheminformatics, molecular modelling, medcinal chemistry etc topics with C .....Thx.....
2:23:58 That's cool, the model found out this seemingly random configuration that after doing the final boolean simplification gives us x ^ y "OR" neuron actually does --> x & y "NAND" neuron does --> x | y "AND" neuron does --> (~x) & y so after forward-feeding --> (~(x & y)) & (x | y) actually simplifies to x ^ y in the end
Wow I commented about this a few weeks ago. I guess it was recorded before but I'm happy to see this vod, I can't wait to have seen it and the next one too.
Human neurons are also activated concurrently doing different thing in different parts of the brain. It's not a number crunching machine only concerned with processing to get to 1 answer for 1 query.
they picture that you've found was excellent: it showed that distinction between weights and bias is artificial, bias is just another weight. i mean to be truthful, bias has a different purpose than weights; weights control influence of the inputs, bias controls output with the inputs fixed, it shifts the output. but as far as computation goes y = w*x + b is the same as y = w*x + b*1 = [w; b] * [x; 1] = w' * x'. this is super important as gpus are optimized for matrix multiplication and y = w*x is in the form of matrix multiplication.
btw. as far as i understand the reason why we do y = f(w*x) instead of just y = w*x is because the latter is linear i.e. the output is linearly proportional to the input. and not all systems that we want to model are linear, hence the input is funneled through a nonlinear function f to make it nonlinear.
btw. don't understand me wrong, but you would be the best teacher for the future engineers and scientists that i can imagine. i am quite old, but when i see you code, i have this sense of learning through curios discovering, rather than learning by heart. you actually make me wanna learn, and for that i am very much grateful.
1:38:50 You can get any boolean function by just describing the inputs that result in a true output, so for XOR with inputs a and b: a b 0 0 -> 0 0 1 -> 1 1 0 -> 1 1 1 -> 0 inputs [0 1] and [1 0] results are true, now just describe these inputs in terms of a and b: not a and b or a and not b XOR = āb+aƀ I believe this is called the "Canonical SoP form" also known as the sum of min terms, if I remember correctly.
can you please describe this in a C algorithm? I think the purpose of these videos are this. Meanly educational. Everybody knows the theory about this!.
24:49 also, if the error is low like 0.5, 0.5*0.5=0.25, it will get smaller, so if the error is already small, it will be smaller so the AI thinks it is doing better
Hey, student here. How are you typing so efficiently ? I use a custom I3 for my desktop and vim for my text editor. But i am not even 10% of that efficiency, like you can mirror text do dark magic and everything.
Quite nice intro to Machine Learning in C, but there's something you missed during the explanation: One does not square the error to amplify it but because we want to calculate the euclidian distance for the error, otherwise if our model's f(xᵢ) was superior to the actual observed value of f(xᵢ) the cost function for xᵢ would return a negative value
I'm at 1h:34 and I don't think saying "I didn't program it to do AND or OR". I mean in a way you are programing it by telling it what it should be the result. The thing it is doing is like finding the optimal values to get to what you told it the result should be(or I might be understanding it wrongly here).
50:02 the bias is indeed an extremely necessary thing, let's say your training data was not y = 2x but y = 2x + 1 (so 0 ->1, 1 -> 3, 2 -> 5, ...), then no matter how close to '2' your weight was, it would never be enough to reach a near zero error unless you added on the bias (of 1 in this case). The bias is extremely useful in all ways! Great explanation though, and I'm not holding this against you since you openly admitted to not knowing much.
Could someone tell what vim motion he uses at 18:00 I just don't understand. I am talking about how he flips "rand_max" to max_rand" in seemingly one click Thanks in advance
Have author looked into Odin lang? Maybe he mentioned it somewhere. I saw he used Jai before, which is a kinda elder brother of Odin. If not - how to ask Alexey to take a look at it?
just curious, could we just call sigmoid function once like sigmoid(forward()) within cost() function, instead of calling it the time inside forward() function?
!! wow !! , you have virtue of a professor, you just have to adjust your weights, your age will adjust your weights, I like your channel, I am going to subscribe and follow you with great interest, I think you. He is a very capable person, when you. Find your way, it must be tremendous, I don't understand how you are in Siberia, when you could be anywhere in the free world. a cordial greeting from Spain
Thanks a lot for this content. Nice T-shirt in the video, btw. Does it have to do with МФТИ logo)? Anyways, the part with adding "eps" variable, while having derivative by rate in parallel is not very clear to me. Should not it be that eps is assigned the value of derivative * rate * -1 each new iteration?
Ok, I have figured this out. W + eps is to measure speed of change and direction of change of cost function in point W, not the speed of change based on the change of W
Can you go through and reverse engineer all of the possibilities of solutions that it comes up with for your XOR gate? Like, how is the 'not valid' gate you found in one instance working in the whole system? All this really says is that the models that we have come up with as humans to describe this stuff are incomplete.