Тёмный
Serrano.Academy
Serrano.Academy
Serrano.Academy
Подписаться
Welcome to Serrano.Academy! I'm Luis Serrano and I love demystifying concepts, capturing their essence, and sharing these videos with you. I prefer illustrations, analogies, and cartoons, rather than formulas (although we don't shy away from the math when needed).

The topics I have are machine learning, mathematics (probability and statistics), but I'm open to many others. If you have any topics you'd like to suggest, feel free to add them in the comments or drop me a line!

For more information, check out serrano.academy.

And also check out my book! Grokking Machine Learning
manning.com/books/grokking-machine-learning
(40% discount code: serranoyt)
The Attention Mechanism in Large Language Models
21:02
10 месяцев назад
What is Quantum Machine Learning?
51:32
Год назад
Denoising and Variational Autoencoders
31:46
2 года назад
The Beta distribution in 12 minutes!
13:31
2 года назад
The covariance matrix
13:57
3 года назад
Gaussian Mixture Models
17:27
3 года назад
Комментарии
@andben82
@andben82 9 часов назад
Hello very interesting, thank you so much. Anyway, I think there is a little mistake at 19:27, the new y coordinate of the apple should be 2.86 and not 2.43
@tandavme
@tandavme День назад
Thank you, your videos are always deep and easy to follow!
@gunamrit
@gunamrit День назад
If only i could know this in my graduation days, Linear Algebra would have been my favorite subject. I was blindly solving lambda as per the books and finding the value of x,y.... Shit!! ! ! ! ! ! It's like a "whole grad summarised in a video !" Keep it up Luis! Thank You So much !
@levi-civita1360
@levi-civita1360 День назад
I have read a book on statistics "Introduction to Probability and Statistics for Engineers and Scientists" by Sheldon M. Ross and there he used Let d = d(X) be an estimator of the parameter θ. Then bθ (d) = E[d(X)] − θ is called the bias of d as an estimator of θ. If bθ (d) = 0 for all θ, then we say that d is an unbiased estimator of θ. and he proves if we use formula of sample variance with (n-1) then we get unbiased estimator other wise not.
@AbhimanyuKumar-ke3qd
@AbhimanyuKumar-ke3qd День назад
5:54 can you please explain why we square in order to remove negative values....we could have taken absolute values as well i.e., |x1 - u| + |x2 - u| .... Same doubt in case of linear regression, least squares...
@SerranoAcademy
@SerranoAcademy День назад
Great question! We can square or take absolute values. Same thing for regression. When you do absolute values for regression, it’s called L1, when you do squares it’s called L2. I think the reason squares are more common is because a sum of squares is easier to differentiate. The derivative of an absolute value has a discontinuity at zero because th function y=|x| has a spike, while the function y = x^2 is smooth.
@AbhimanyuKumar-ke3qd
@AbhimanyuKumar-ke3qd День назад
@@SerranoAcademy wow! Never thought about it in terms of differentiability... Thank you so much! If you can make a video on it, it would be very helpful
@shafiqahmed3246
@shafiqahmed3246 2 дня назад
Serrano you are a genius bro your channel is so underrated
@leonsaurabh21
@leonsaurabh21 2 дня назад
Great explanation
@aitanapalomanespardos7089
@aitanapalomanespardos7089 2 дня назад
14:21 could you make a video, or give some intuition as to why the generalised eigenvector needs to have the same eigenvalue as the eigenvector? Lovely video, I enjoyed the geometric visualisation.
@rudyfigaro1861
@rudyfigaro1861 2 дня назад
Luis has a talent for breaking down complex problems into simple steps and then build the whole thing back so that ordinary people can understand.
@juancarlosrivera1151
@juancarlosrivera1151 2 дня назад
I would use x\bar instead of mu in the right hand side equation (mins 9 or 10)
@ekarpekin
@ekarpekin 2 дня назад
Thank you Luis for the video. I also had a long time obsession with why the hack is (n-1) instead of n. Well, having watched your video I now explain it to myself as follows: 1) when we calculate the mean value 'mu' out of, say, 10 numbers making up a sample, these 10 numbers are independent for the mean calculation. Knowing the mean number 'mu' of a sample and 10 numbers constituting the sample, we cannot say that all 10 numbers are independent. If we know 'mu', we can compute any single number out of 10 if we know the other 9. 2) Now, when we come to variance, we take the difference between the 'mu' and each of 10 numbers, so we have 10 deltas. Yet, out of these 10 deltas, only 9 are independent, because the another delta we can calculate provided we know 9 numbers and the 'mu'. Hence, for the variance we devide the total sum of deltas by (n-1) - a count of independent deltas (or differences)...
@tamojitmaiti
@tamojitmaiti 2 дня назад
This exact same reasoning seamlessly transitions into ANOVA calculations as well. I personally think the widely accepted proof of unbiasedness of the estimator is intuitive enough. Math always doesn't have to cater to the logical faculties of what a 5 year old can comprehend. I'm a big fan of Luis' content but this video came off as a bit weak in the math intuition part, not to mention super tedious.
@hamidalavi5595
@hamidalavi5595 2 дня назад
thank you for your amazing educational videos! I have a questions though, is there any transformers (+ attention mechanism) involved in the text2image generator (the diffusion model)? If no, then how the semantic in the text is captured??
@archangecamilien1879
@archangecamilien1879 2 дня назад
I know someone who was obsessed with knowing that, lol, back in the day...didn't manage to find a good explanation...there are other things he tried to understand the reasons for, lol...he wasn't sure that many others cared...well, lol, eventually he didn't care much himself, but at a time he did...
@archangecamilien1879
@archangecamilien1879 2 дня назад
Lol...2:35...I hadn't reached that part before I made that comment...so, lol, the person in question wasn't the only one who would obsess over things like that in math...they often just feed you something without explanation, lol...even if you are a math major, I suppose they are thinking "They'll get it later"...they just tell you "You do this", lol...they also just fed the Jacobian to the students in the person's class, without explanation...well, lol, I suppose the student in question didn't have a textbook, but he doubts they explained where the Jacobian comes from in the textbook...
@archangecamilien1879
@archangecamilien1879 2 дня назад
...he would search online, lol...I don't think there were many math videos back then, or perhaps there were and he didn't notice them...
@archangecamilien1879
@archangecamilien1879 2 дня назад
Lol...the person in question didn't even really understand what was meant by "degrees of freedom"...I mean, lol...they would just throw the term around..."if you have a sample of 6 elements, there are 5 degrees of freedom", I think it could get more complicated than that, forgot, like the product of two sample spaces or something?...Not sure, lol...they would then do some other gymnastics...but they would just throw in the word "degrees of freedom" like it was a characteristic, like height, eye color, hair color, etc, lol...like, there would be tables, and they would inform you how many degrees of freedom there were, and that's the only times I would see the term ever appear...and everyone else seemed fine with it, lol, or maybe the student in question was just an idiot and everyone else had an intuitive sense of what was going on (he says he doubts it, lol)...
@SerranoAcademy
@SerranoAcademy 2 дня назад
lol! Your friend sounds a lot like me 😊
@cc-qp4th
@cc-qp4th 2 дня назад
The reason for dividing by n-1 is that by doing so the sample variance is an unbiased estimator of the population variance.
@srastiporwal6791
@srastiporwal6791 2 дня назад
You deserve it really awesome videos for the beginners to understand things visually really helpful🙏
@weisanpang7173
@weisanpang7173 2 дня назад
The algebraic explanation of bariance vs variance was somewhat sloppy.
@santiagocamacho264
@santiagocamacho264 2 дня назад
@ 8:20 you say "Calculating the correct mean". Do you probably mean (no pun intended) "calculating (Estimating) the correct variance"?
@SerranoAcademy
@SerranoAcademy 2 дня назад
Ah good catch! Yes that’s what I **mean**t, 🤣 gracias Santi! 😊
@ian-haggerty
@ian-haggerty 3 дня назад
Awesome! You've sold another book :)
@SerranoAcademy
@SerranoAcademy 2 дня назад
Yay thanks! Enjoy, and lemme know what you think!
@billmichae
@billmichae 3 дня назад
Luis, complements on exceptional presentation style. I have seen many of your super videos on stats and now I am running through your ML. As always you are fantastic. Do you have any course work on Udemy?
@uroy8665
@uroy8665 3 дня назад
Thank you for detail explanation, I came to knw about new BAR function. About (n-1) , at start I thought this: let's say we have 100 people , and we want to find variance of height, suppose there is one person with exact mean height, in that case mean will be correct by dividing by 100, but not the variance by dividing by 100, as one term will be zero and that will lower the variance, if there is no people with perfect mean then dividing by 100 is good for variance, but latter thought if 2 /3/4 etc persons have height of mean, then that would not work. Anyway after watching this video, my thinking changed and better as I am not from STAT.
@SerranoAcademy
@SerranoAcademy 2 дня назад
Thanks, that’s a great argument! Yeah at some point I was thinking about it in a similar way, or considering having an extra person with the mean height. I couldn’t finish the argument but I believe that’s an alternate way to obtain the n-1.
@jbtechcon7434
@jbtechcon7434 3 дня назад
I once got in a shouting match at work over this. I was right.
@robharwood3538
@robharwood3538 3 дня назад
A while back I came across an explanation of the (n-1) correction term from a Bayesian perspective. (It might have been E. T. Jaynes' book _Probability Theory: The Logic of Science,_ but I can't recall for certain.) I was hoping you might go over it in this video, but I guess you didn't come across it in your search for the answer. One thing that is relevant -- and illuminated by the Bayesian perspective -- is that the Bessel correction for *_estimated_* variance implicitly assumes a particular sampling method from the population. In particular, I believe it assumes you are performing sampling _with replacement_ (or that the population is so large that sampling without replacement is nearly identical to with replacement). But in some non-trivial cases, that may not actually be the case, and so the Bessel correction may not be the appropriate estimator in such cases. For example, if the entire population is a small number, like 10 or 20 or so, then if you sample *_without_** replacement* then the distribution would behave differently. In the same way that a hypergeometric distribution is sometimes better than a binomial distribution, for example. As an extreme manifestation of this, suppose you sample (without replacement) all 10 items from a population of just 10 items. Then using the Bessel correction would obviously give the wrong 'estimate' of the true variance, which should be divided by n, not (n-1). A Bayesian approach (supposing that the population size, N=10 is a 'given' assumption) would correctly adjust the 'posterior variance' estimate to the *real* best estimate for sample sizes all the way up to 10, at which point it would be equivalent to the true variance. Unfortunately, I don't remember how to derive the Bayesian estimate of Variance. But maybe if you found it it might shed even more light on your ultimate question of 'why (n-1)?' and perhaps you could do a follow up video? Just an idea! Cheers!
@SerranoAcademy
@SerranoAcademy 2 дня назад
Thanks! Ah this is interesting, I think somewhat I’m looking at sampling with no replacement, but in a different light. I like your Bayesian argument, I need to take a closer look and get back.
@prof.nevarez2656
@prof.nevarez2656 3 дня назад
Thank you for this breakdown Luis!
@mcan543
@mcan543 3 дня назад
**[**0:00**] Introduction and Bessel's Correction** - Introducing Bessel's Correction and why we divide by \( n-1 \) instead of \( n \) to estimate variance. **[**0:12**] Introduction to Variance Calculation** - Explaining the premise of calculating variance and introducing the concept of estimating variance using a sample instead of the entire population. **[**1:01**] Definition of Variance** - Defining variance as a measure of how much values deviate from the mean and outlining the basic steps of variance calculation. **[**1:52**] Introduction to Bessel's Correction** - Discussing why we divide by \( n-1 \) when calculating variance and introducing Bessel's Correction. **[**2:35**] Challenges of Bessel's Correction** - Sharing personal challenges in understanding the rationale behind Bessel's Correction and discussing my research process on the topic. **[**3:20**] Alternative Definition of Variance** - Presenting an alternative definition of variance to aid in understanding Bessel's Correction and expressing curiosity about its presence in the literature. **[**4:45**] Quick Recap of Mean and Variance** - Briefly revisiting the concepts of mean and variance, demonstrating how they are calculated with examples, and explaining how variance reflects different distributions. **[**7:05**] Sample Mean and Variance Estimation** - Explaining the challenges of estimating the mean and variance of a distribution using a sample and discussing why sample variance is not a good estimate. **[**8:49**] Bessel's Correction and Why \( n-1 \) is Used** - Explaining how Bessel's Correction provides a better estimate of variance and why we divide by \( n-1 \) instead of \( n \). Emphasizing the importance of making a correct variance estimate. **[**10:51**] Why Better Estimation Matters?** - Discussing why the original estimate is poor and why making a better estimate is crucial. Explaining the significance of sample mean as a good estimate. **[**13:02**] Issues with Variance Estimation** - Illustrating the problems with variance estimation and demonstrating with examples why using the correct mean is essential for accurate estimates. Explaining the accuracy of estimates made using \( n-1 \). **[**15:04**] Introduction to Correcting the Estimate** - Discussing the underestimated variance and the need for correction in estimation. **[**15:57**] Adjusting the Variance Formula** - Explaining the adjustment in the variance formula by changing the denominator from \( n \) to \( n - 1 \). **[**16:22**] Calculation Illustration** - Demonstrating the calculation process of variance with the adjusted formula using examples. **[**16:57**] Better Estimate with Bessel's Correction** - Discussing how the corrected estimate provides a more accurate variance estimation. **[**18:24**] New Method for Variance Calculation** - Introducing a new method for calculating variance without explicitly calculating the mean. **[**20:06**] Understanding the Relation between Variance and Variance** - Explaining the relationship between variance and variance, and how they are related mathematically. **[**21:52**] Demonstrating a Bad Calculation** - Illustrating a flawed method for calculating variance and explaining the need for correction. **[**23:37**] The Role of Bessel's Correction** - Explaining why removing unnecessary zeros in variance calculation leads to better estimates, equivalent to Bessel's Correction. **[**25:08**] Summary of Estimation Methods** - Summarizing the difference between the flawed and corrected estimation methods for variance. **[**26:02**] Importance of Bessel's Correction** - Emphasizing the significance of Bessel's Correction for accurate variance estimation, especially with smaller sample sizes. **[**30:19**] Mathematical Proof of Variance Relationship** - Providing two proofs of the relationship between variance and variance, highlighting their equivalence. **[**35:24**] Acknowledgments and Conclusion**
@SerranoAcademy
@SerranoAcademy 3 дня назад
Thank you so much! @mcan543
@SerranoAcademy
@SerranoAcademy 3 дня назад
I pasted it into the comments, it's a really good breakdown. :)
@user-um4di5qm8p
@user-um4di5qm8p 3 дня назад
damn man, you a legend!
@user-um4di5qm8p
@user-um4di5qm8p 3 дня назад
by far the best explanation, Thanks for sharing!
@shafiqahmed3246
@shafiqahmed3246 3 дня назад
Serrano You are a genius bro thrilled to watch your videos
@gauravruhela7393
@gauravruhela7393 4 дня назад
I really liked the way you showed the motivation behind softmax function. i was blown away. thanks a lot Serrano!
@larissacury7714
@larissacury7714 5 дней назад
Thank you!
@ravipativenkatesh6810
@ravipativenkatesh6810 5 дней назад
enjoyed learning RBM (excellent work)
@bidaneleon1106
@bidaneleon1106 6 дней назад
Luis Serrano, sos un capo. Lo explicas genial, muy buenas imágenes. You are the best, marvelously explained with such beautiful images which help us grasp the concepts behind topic modelling <33
@andreanegreanu8750
@andreanegreanu8750 6 дней назад
Alex, thank you for your incredible work to vulgarize complex things. But Why the hell, the value function share the same parameters theta as the policy function??!! Can you confirm that? And if this is the case, why?
@Areachi
@Areachi 7 дней назад
Thank you so much, one of the best videos on the topic
@iliasp4275
@iliasp4275 7 дней назад
Excellent video. Best explanation on the internet !
@qinjiang6816
@qinjiang6816 7 дней назад
Very good. But at last, it’s not adding together, but Concatenate.
@bbarbny
@bbarbny 7 дней назад
Amazing video, thank you very much for sharing!
@haikvoskerchian2857
@haikvoskerchian2857 8 дней назад
At 25:00 you said the fact that the poissonn distribution has 2 modes is an anomaly. But actually for every integer lambda, the poisson distribution has two modes. I wouldn't call that an anomaly.
@samaardi
@samaardi 8 дней назад
I really like the way you teach these topics. thank you 🤩
@mikelCold
@mikelCold 8 дней назад
Where does context length come in? Why can some models be longer than others?
@user-xk7dy4nb7w
@user-xk7dy4nb7w 8 дней назад
Great Video. Appreciate all the hard work. Very informative.
@tianqilong8366
@tianqilong8366 8 дней назад
HAHAHA, coming in here from the video about Generative Adversarial Network and realize need to understand this concept in order to understand GAN, the recc Algo really guessed my thoughts right...
@stephenlashley6313
@stephenlashley6313 8 дней назад
All these RU-vid videos are great. There is a 100% mapping between this and all the AI stuff with real brain observations and computational neuroscience. This author is brilliant!
@user-wj7ww4ny7x
@user-wj7ww4ny7x 8 дней назад
Thanks, I do understand this.
@gunamrit
@gunamrit 9 дней назад
@36.15 divide it by the square root of the dimensions of the vector and not the length of the vector. The length of the vector is sq(coeff i ^2 + coeff j^2) amazing video ! Keep it up ! Thanks
@SerranoAcademy
@SerranoAcademy 9 дней назад
Thank you! Yes absolutely, I should have said dimensions instead of length.
@gunamrit
@gunamrit 9 дней назад
@@SerranoAcademy The only thing between me and the paper was this video and it helped me clear the blanks I had after reading the paper. Thank You once again !
@tianqilong8366
@tianqilong8366 9 дней назад
mad respect to you for explaning neural network so clear in 20 minutes, actually amazing
@BetoAlvesRocha
@BetoAlvesRocha 10 дней назад
Muchísimas gracias, profesor Serrano! =) I've seen many explanations in RU-vid regarding naive bayes, most of them from channels that I really appreciate, but your explanation is the best one by far. Thank you so much for making the link between the logic and the bayes formula!
@pedramhashemi5019
@pedramhashemi5019 10 дней назад
A great introduction! thank you sincerely for this great gem!
@hamzawi2752
@hamzawi2752 11 дней назад
This guy is not normal teacher. He is a legend and genius. He has a way to transfer the knowledge and explains the concepts in very very simple way. Thank you so much for every single second you spent making this video. Please try to make more videos and publish books.
@SerranoAcademy
@SerranoAcademy 11 дней назад
Thank you for such a kind message! It's people like you who give me energy to continue making content. I'm have a bunch of videos coming out in the next few weeks, keep an eye!
@hamzawi2752
@hamzawi2752 11 дней назад
@@SerranoAcademy I finished my PhD and I am preparing for job interview in ML. I spent too much understanding the intuition behind NB until I found your masterpiece. I promise if I am hired, I will buy your books.
@hamzawi2752
@hamzawi2752 11 дней назад
I hope if you can make a series for LLMs from scratch. I haven't found anyone who has made a LLM content for absolute beginners. In the meantime, there is a big hype around LLMs. Thank you so much for your valuable time.
@SerranoAcademy
@SerranoAcademy 11 дней назад
@@hamzawi2752 Thanks! I made this playlist about LLMs and attention, check it out! ru-vid.com/group/PLs8w1Cdi-zva4fwKkl9EK13siFvL9Wewf
@epistemophilicmetalhead9454
@epistemophilicmetalhead9454 11 дней назад
n-gram look at n words in a sentence and predict the next looking at previous instances of those n words being used in the sentence in the same order sentiment analysis take a weighted sum corresponding to all the words in the sentence. 1*weight associated with each word + bias. words that indicate a positive/negative sentiment may have positive/negative weights. This job is carried out by a perceptron (11:57) Tokenization Break a sentence down into tokens Word embeddings (Word2Vec) You have a neural network that comes up with vectorial representations for each words that describes the features of that word. Similar words' embeddings will have a higher similarity score. The values of these embeddings are learned and so, the neural network will find an optimal set of embeddings Positional encoding You use a sequence of functions that are dependent only on the position of a token in a sentence and add those positional values to the word embeddings to get positional encodings ***words are similar if they appear in the same context too many times*** Softmax turns scores into probabilities because when predicting, you don't wanna lose out on other suitable candidates (next tokens to predict) probability is always > 0 so taking e^x helps even if you have negative numbers Architecture In an encoder/decoder, you have a bunch of attention and feed forward layers with a softmax layer at the end Fine-tuning use suitable tasks (similar to what you want the model to achieve) to post-train.
@epistemophilicmetalhead9454
@epistemophilicmetalhead9454 11 дней назад
Word embeddings Vectorial representation of a word. The values in a word embedding describe various features of the words. Similar words' embeddings have a higher cosine similarity value. Attention The same word may mean different things in different contexts. How similar the word is to other words in that sentence will give you an idea as to what it really means. You start with an initial set of embeddings and take into account different words from the sentence and come up with new embeddings (trainable parameters) that better describe the word contextually. Similar/dissimilar words gravitate towards/away from each other as their updated embeddings show. Multi-head attention Take multiple possible transformations to potentially apply to the current embeddings and train a neural network to choose the best embeddings (contributions are scaled by how good the embeddings are)