Тёмный
bycloud
bycloud
bycloud
Подписаться
I cover the latest AI tech/research papers for fun

xLSTM: The Sequel To The Legendary LSTM
11:42
2 месяца назад
How Did Llama-3 Beat Models x200 Its Size?
13:55
3 месяца назад
All You Need To Know About Running LLMs Locally
10:30
5 месяцев назад
Mamba Might Just Make LLMs 1000x Cheaper...
14:06
5 месяцев назад
The Turbulent Rise of AI Avatars
11:53
6 месяцев назад
2023 AI Rewind But In Memes
2:22
7 месяцев назад
We Are Not Ready For AI Generated Videos
9:17
7 месяцев назад
AI that can read 150,000 words at once
5:26
7 месяцев назад
First Look At Meta's Emu Edit & Emu Video
3:49
8 месяцев назад
Комментарии
@hightidesed
@hightidesed 20 часов назад
Can you maybe make a video explaining how Llama 3.1 8B is able to have a 128k context window while still fitting in an average computers ram?
@CharlesVanNoland
@CharlesVanNoland 20 часов назад
LLMs are always going to be fundamentally flawed because they're created by using a compute-heavy algorithm to train a network on a static dataset. They don't learn while inferring (i.e. learn from experience). While recycling their output back in as input makes it appear that they're capable of one-shot learning, they're not learning anything. It's a fixed crystallization of knowledge gleaned from text on the internet, and varying your prompt is akin to shining a laser at different angles through this knowledge crystal to get it to shine something different out of it. I want a crystal that changes based on what lasers are shining through it, to affect what it outputs in the future, just like a brain that learns from experience. Massive backprop-trained networks will one day be regarded as outdated old-fashioned antiques after the advent of a proper dynamic online learning algorithm is created that is effectively digital cognition/consciousness. Nobody has created it yet, but after 20 years of staying on top of the neuroscience and artificial intelligence research that has come down the pipe, I know we're close - and it's not going to involve backprop-training massive deep networks on static datasets.
@casualuser5527
@casualuser5527 20 часов назад
Fireship thumbnail bruh. Got baited 😂
@SuperSmashDolls
@SuperSmashDolls 20 часов назад
So, the way I've understood grokking is that, when you train an AI model, you also have a regularization step, which reduces weights towards zero. And by grokking you're giving that regularization step a LOT of opportunities to prune weights and patterns that aren't contributing to model performance. Because, remember, the first thing we do with an AI model is initialize all the weights to random values, so there's going to be a lot of patterns that don't actually mean anything but happen to score well enough on test output to not be overwritten by normal gradient update. The Grokfast paper seems to imply this explanation is totally wrong and that grokking is just a fundamental property of gradient descent backprop. Or is regularization just so commonplace that it's just assumed and nobody calls it out?
@sarahlynn7807
@sarahlynn7807 20 часов назад
Overfitting seems like an extra opaque for of reasoning.
@Macorelppa
@Macorelppa 21 час назад
Man your consistency is inhuman!
@picklenickil
@picklenickil 21 час назад
😂😂😂as a behavioral scientist.. i think this one is going straight to the crapper.. mark my words.😂😂😂
@norlesh
@norlesh 22 часа назад
We need a foundation model that has been trained until Grokking a children's primary school syllabus before it ever sees an equation or chemical formula.
@Interpause
@Interpause 22 часа назад
wait if im understanding grokfast correctly, they are attempting to predict the rate of change of weights at any given moment using fourier transform? thats insane that has way more use cases for other neural network architectures outside of just transformers
@Djplax11
@Djplax11 23 часа назад
just saying that is a lot like the Dunnig-Kruger curve.
@narpwa
@narpwa День назад
what about 1T experts
@OwenIngraham
@OwenIngraham День назад
such good content
@copywright5635
@copywright5635 День назад
This seems... oddly human. Does anyone else agree? It's weird that repetition is something both humans and AI greatly benefit from
@RickySupriyadi
@RickySupriyadi День назад
so SSM won't be able to generalized? @bycloudAI
@jp.girardi
@jp.girardi День назад
I struggle to comprehend how this process doesn't result in more hallucinations through syllogistic reasoning, given that the 'generalization' seems to be derived precisely from this inherent syllogism.
@blockshift758
@blockshift758 День назад
Ill call it Moe(moe ehh) instead of em oh ih
@BrainSlugs83
@BrainSlugs83 День назад
Bro. that is NOT how you pronounce "Heinlein". 🤦‍♂
@robertputneydrake
@robertputneydrake День назад
THE NEEDLE IN THE HAYSTACK"! THAT'S WHAT I SAID!
@clearandsweet
@clearandsweet День назад
I love this because it's exactly the same as how human beings learn. Also very excited for that paper mentioned at the end. This type of generalization is a big smoking gun that will lead to AGI very quickly so speeding up the grok is incredibly hype news
@74Gee
@74Gee День назад
I find that alice in wonderland type responses can be significantly improved when system prompting to form data structures from the known data and then inferring from that structure - something like this (a minimal version) ``` You are tasked with solving complex relationship questions by first mapping all known facts into a JSON structure and then using this structure to infer answers. When given a question, follow these steps: 1. Extract all given facts. 2. Create a JSON structure to represent these facts. 3. Use the JSON structure to navigate and infer answers. 4. Provide clear and logically consistent responses based on the JSON file. ``` I used this technique very successfully when working with gossip analysis and determining the source of gossip but quickly realized its benefits in other logical fields.
@danielsan901998
@danielsan901998 День назад
I am not surprised about the failure of LLMs to do basic reasoning with problems that involve numbers, it is already known that language models don't understand basic math, the most successful strategy is, instead of asking the LLM to solve the problem to instead translate the problem to a more explicit definition, that's how Google achieved to solve some mathematical olympiad questions by translating to Lean, with the advantage that you can verify the answer automatically and reject unverifiable proofs. Another alternative is asking the model to solve the problem using a programing language, since the python dataset is larger than the Lean dataset it is easier to train a model or use a pretrained model.
@TimothyChakwera
@TimothyChakwera День назад
I knew FFT was the way to go
@TheInfectous
@TheInfectous День назад
I wonder how long it will take before the pattern recognition and replication machines are thought of as pattern recognition and replication machines instead of magic. Magic certainly sells better though I guess, does come at steep crash in the future though.
@j.j.maverick9252
@j.j.maverick9252 День назад
interesting graph for the llm learning curve up, then down, then up again. Looks eerily similar to Dunning Kruger
@toasteroven6761
@toasteroven6761 День назад
Ngl, "Grokking" sounds like some amphibian level brainrot from TikTok.
@voidling93
@voidling93 День назад
im surprised you didnt ai voice change that meme at 7:37 to the wording
@ImaplanetJupiteeeerr
@ImaplanetJupiteeeerr День назад
Awesome video all in all, I have wanted to know a recap of what's happening in the last few months in text to video and this answered everything I had in mind! :) Thanks. ❤️
@dzbuzzfeed908
@dzbuzzfeed908 День назад
1. **Current State of LM Benchmarks** *Timestamp: **0:00:00* 2. **Benchmark Performance Issues** *Timestamp: **0:00:03* 3. **Implications of Reordering Questions** *Timestamp: **0:00:20* 4. **Alice in Wonderland Paper Findings** *Timestamp: **0:01:57* 5. **The Concept of Grocking** *Timestamp: **0:04:27* 6. **Grocking vs. Double Descent** *Timestamp: **0:06:06* 7. **Potential Solutions for Improved Reasoning** *Timestamp: **0:09:02* 8. **Grocking in Transformers and New Research** *Timestamp: **0:08:12* 9. **Grock Fast Implementation** *Timestamp: **0:11:28* ### Ads 1. **Hub SWAT AI Resources** *Timestamp: **0:01:13* ### Funny Jokes 1. **"Absolute dog water"** *Timestamp: **0:00:03* 2. **"Kind of crazy from a more cynical and critical perspective"** *Timestamp: **0:00:33* 3. **"Can you imagine an AI being able to do this only humans would be able to come up with something this random and absurdly funny"** *Timestamp: **0:03:13* 4. **"If an AI can truly do this it actually might be so over over for us so for the sake of burning down rainforests"** *Timestamp: **0:03:23* 5. **"Elon's Grock LM is probably named after the book and not related to the ML concept that we are talking about today"** *Timestamp: **0:05:43* 6. **"Mr. Zuck said that when llama 370 never stopped learning even after they trained it three or four times past the chinella optimum is not copium"** *Timestamp: **0:10:03*
@BooleanDisorder
@BooleanDisorder День назад
First, we need to realize that there is no such thing as general intelligence. Second is that they are language models and represent language based features with no time or steps. Only then will we be able to mimic the human brain. Step by step reasoning is not a thing that actually happens. It is an interpretation of some people's externalization of their reconstruction of how they came to an answer. It's all post-hoc rationalizations that don't actually tell you anything of an underlying system. Most of what is happening is simply a very complex combination of representations that are filtered through other representations and iteratively refined until it fits with an internal world model of all representations without violating any of the ones it partains to.
@mirek190
@mirek190 День назад
I wonder how good will pass that test llama 3.1 70b , gemma 2 27b or opus 3.5 ....
@anywallsocket
@anywallsocket День назад
Why would super overtraining improve generalization??
@ilikegeorgiabutiveonlybeen6705
if it will it will idk we do learn things like this though... partially
@zyansheep
@zyansheep День назад
I looked through your videos and saw I had watched literally every one but didn't subscribe lol. I'm subscribed now!
@randomperson-nq7nk
@randomperson-nq7nk День назад
Why is the comment section so small
@PotatoKaboom
@PotatoKaboom День назад
nice video well done! but wasnt the grok paper about specific puzzles? in you statements it seems like grokking could work magically for any task.. maybe im wrong but i thought it was for a very specific subset of tasks like "addition and subtraction" where the weights could randomly "click" at some point and just get the whole task right. this would never happen for a general use LLM right?
@telotawa
@telotawa День назад
omg they put a low pass filter on it to make it grok faster? that's nuts
@dot1298
@dot1298 День назад
*reasoning is overrated* - now prove me wrong xD
@briangman3
@briangman3 День назад
They need benchmark Testing that has variations in questions inputs and randomization of choices
@Rockyzach88
@Rockyzach88 День назад
I'm sure the people who are actually passionate about building these things are doing all the things.
@BYZNIZ
@BYZNIZ День назад
Great video shout out to Jerry M for recommending the channel
@Ikbeneengeit
@Ikbeneengeit День назад
Intelligence is more than just a model of the world of course. Its also the ability to suggest actions to shape the world.
@GodbornNoven
@GodbornNoven День назад
You don't know how right you are 😂 Grokking will be a super important step to AGI. Essentially, you're training a model on data so much it practically becomes an expert at it. At some point, we will achieve the quantity of compute necessary to achieve this. At that point, might as well take the road of brute force. Naturally, algorithmic breakthroughs are incredibly important and also very essential to the improvement of LLMs. As they allow us to do more with less
@tankieslayer6927
@tankieslayer6927 День назад
In the loss landscape, some minimum are better than the others. These minimum are flat, i.e., the diagonalized Hessian matrix contains zeros. Due the the stochastic nature of the SGD algorithm, the gradient trajectory tends to jump out of non-flat minimum and get stuck in the flat minimum, resulting in a simpler parametrization of the function. Flat minimum only lead to slightly better generalization and will certainly not lead to AGI. The fact that some ML researchers had to invent the term "grokking" only shows their lack of understanding in basic multi variable calculus.
@TiagoTiagoT
@TiagoTiagoT День назад
How far are we from just having a virtual block of solid computronium with inference result simply being the exit points of virtual Lichtenberg figures forming thru it, with most of the volume of the block remaining dark?
@ravenragnar
@ravenragnar День назад
So how do we make money from it?
@bloopbleepnothinghere
@bloopbleepnothinghere День назад
As folks get past the initial "wow" moment of this new tech, we begin to see through veil. A LLM is only as smart as its training data. It predicts tokens. It's a parlour trick, an amazingly powerful illusion of intelligence. People can already spot an AI piece of art from a mile off, same goes for text. It all smells of LLM, and the more we use LLMs for brainstorming, especially the highly guard railed models, the less interesting they become. Then there is the reliability problem. Can't really trust what it says in the same way you can with traditional search that uses actual human produced content references. Idk, there is value in LLMs, significant value, but we are sobering up and levelling our expectations rapidly.
@j.d.4697
@j.d.4697 День назад
By cheating like this, companies are only shooting themselves in the foot once it gets real and closer to AGI, because they will be left behind with this approach.
@raspberryjam
@raspberryjam День назад
I'm thinking: If you imagine a model divided in two halves where one is the generalization part and the other is the overfitting part, it's still most beneficial to have the generalization half get as close to the right answer as possible so as to lighten the load on the overfitting half. Or in another way, you should devote as many parameters as you can to memorizing the corrections to the wrongest answers, and you can do that by minimizing the number of parameters needed to get to what is a generally a fairly close answer
@jankram9408
@jankram9408 День назад
I am sorry but, Grokking just sounds like a Brain rot term...
@Deagan
@Deagan День назад
we goonin && grokkin
@mAny_oThERSs
@mAny_oThERSs День назад
thanks for the shoutout
@rubncarmona
@rubncarmona День назад
Grokking is akin to the evolution of language even after a population has been entirely alphabetized. Every once in a while someone figures out a new connection between seemingly unrelated concepts, uses one of them in a new context by mistake or because it forgot the intended word etc. This continuous increase in information entropy even after exhausting the parameter space reminds me a lot of what some scientists say about information in degenerate era's black holes.
@320770471
@320770471 День назад
This channel is worth watching just for the memes even if you have no clue what the heck he is talking about