Dropout literally has a whole show about nerds making pedantic corrections called "Um, Actually". Mike Trapp was the host. But I realize I don't get any points because I failed to say "Um, Actually".
I absolutely agree. Tossing more layers just *feels* wrong. There definitely is something missing in these newer neural models that while they perform well, they don't really do so efficiently. Either they in the future will massively improve via using some of the old techniques, or by being crafted architecturally with more biological inspiration.
Not really, this is older technology to relate similar contexts together. Modern LLMs (or Muppet Models, as I like to call them) use continuous representations to do that.
I haven't finished the video, so apologies if you cover it, but in the 2023 CS224N NLP lecture on coreference resolution, Chris Manning introduces the (very complicated and demoralizing, to me) Hobb's algorithm, and then basically says something like "Hobbs HIMSELF said publicly that he didn't like the algorithm, and often pointed to it as ~an example of how we clearly needed something better."
Your video has sparked a meaningful conversation. How has being a young-onset Parkinson's patient shaped Jessica's perspective on life? As the host of a dream interpretation channel, I'm curious to explore how her experiences with Parkinson's influence her dreams and subconscious mind. I truly appreciate the opportunity to learn more about Jessica's journey, and I've already liked and subscribed to the channel for more insightful content like this.
Can you share this video with the president of Harvard? I don’t think she got the message. Yet somehow DEI still think it was okay for her to cheat. DEI is accusing everyone of racism.
Awesome, can't believe guys tried doing this in your class. This is like commiting a burglary and leaving a confession note and a business card. This is really funny.
You're welcome! Glad to be of help. This is an old video (pre-neural revolution), I just went through it again and it holds up pretty well (except for my not-so-great green screen).
If you're a human, thank you! If you're a bot, you're an excellent example of the technology in the video, so thank you for providing a real-world example. :)
Great intro video, and lovely coverage of the key concepts there. I listened to the guy credited with coming up with the transformer model, and I think in adjusting the word vectors to predict the next word in a sequence more effectively, its also mapping phrases, sentences, ideas and concepts into multidimensional space, up to its input context length. So it ends up having what Isaac Asimov described as a "perceptual schematic" of the world, how everything relates to everything else, encoded in multidimensional space. Then all the behaviours it's trained to perform based on rlhf are possible because it has this initial perceptual schematic.
Yes, but that schematic isn't a schematic (yet). It's just a vector space, which means that the exact meanings can get fuzzy. This association can only get us so far, which is why we're starting to see the technology's limits. Exciting to see what happens!
@JordanBoydGraber I'm not sure we are starting to see the technology's limits? I appreciate your breadth and depth of knowledge in the field, but all of the indications from these companies would appear to suggest that we're not close to approaching an asymptote with regards to these models yet. I do think I know what you're saying though, and I agree, what it has is a set of interrelated numbers, it has no actual "knowledge" per se, its what it's trained to do with these interrelated numbers really. I think the best analogy to get at what im saying is with the vision transformer model. It starts off representing small patches of the image as vectors like words, and has an associated positional encoding vector for each patch too. It learns to not only classify the entire image, and to cluster similar images in dimensional space when it classifies them, but it also learns positional encodings for each patch of the image, adjusting the positional encodings for each patch of the image, to orient it correctly in terms of the image so it has a much better chance of classifying the whole image. I see the same with the language transformer model. It's adjusting vectors on a word level, but because it's using these word vectors to do something with the whole block of text, its still learning to place the entire block of text, in one word iterations, up to its context length, in certain positions in interrelated dimensional space, just like it does with images, even though it only has vectors for words, like it only has vectors for small image patches. Then further training helps it prune down this vast interrelation to a conceptual map (2nd part just a theory from me here). I think there may be a limit with purely language based models, but potentially the sky is the limit with multimodality. The constraining factor appears to be hardware ATM imo.
9:42 for Rm(H), what is the use of taking expectation over all samples? As we saw previously, like from 6:12, calculation of empirical Rademacher does not use the true label of samples, rather just the size of the sample.
Really excellent like the other videos on this series. I am sharing the course with colleagues and hoping to go thru the syllabus in the Spring. Thank you for the excellent work, Prof!
in 6:55, it is said that H(x, M) = sum(log(M(xi))), but accroading to the defination of cross entropy, it should be H(P, Q) = sum(-1 *P(x)log(Q(x))), so are we assuming P(x) is always one when computing perplexity?
This is a really good point. Typically when you evaluate perplexity you have one document that somebody actually wrote. E.g., you're computing the perplexity of the lyrics of "ETA". In that case we have a particular sequence of words. Given the prefix "He's been totally", the probability of P(x_t) = "lying" and everything else is zero. For some generative AI applications, this might not be true. E.g., for maching translation you might have multiple references. Thanks for catching this unstated assumption!
should equation 6) be: 2e^(-epsilon*m/2)? This is because the chance of sampling from the whole highlighted region is epsilon, so the probability of sampling from a specific region is epsilon/2? Thank you for the great lecture!
Cool idea! Other rating ideas: how evenly does the straight line cut the country into two pieces? Are they the same size? Same Population each side of the line? This way you can allow for easy countries and hard countries, where you can score the "even" disection of countries irrespective of how long the line is. Also a hint: your microphone has some awful automatic gain setting or something, where all the quiet sounds are amplified and all the loud sounds are quieted down, so your tiniest breathing in is the same volume as your loudest talking bits. It's really annoying
1) I like the population bisection idea. It's obviously easier to go through less popular areas. 2) Thanks for mentioning that, it's easy to tune these sorts of things out.
the name muppet models is super cute but alas the perspective that muppet models just make stuff up is misplaced, true in some ways but also dangerously wrong, they do get things wrong or out of place when speaking off the top of their head, but, um, statistically far less than humans do already, the confusion is that they're so much better at talking than humans that they can give almost accurate coherent essays about stuff completely off the top of their heads while a human would just be saying "uhhhhh", if you give them the equivalent of a human salary worth of compute they can also check the accuracy of things a zillion times better than any human could ever check
Yuval Pinter makes the excellent point that I shouldn't conflate "writing system" and "language". Indeed, this video should have been titled "How to Know if Your Writing System is Broken". See more in their excellent position paper on the subject: aclanthology.org/2023.cawl-1.1/