Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

Robert Miles AI Safety

Подписаться 156 тыс.

Просмотров 84 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 480

@MorRobots 3 года назад

"I'm not worried about the AI that passes the Turing Test. I'm worried about the one that intentionally fails it" 😆

@hugofontes5708 3 года назад

This sentence made me shit bits

@virutech32 3 года назад

holy crap...-_-..im gonna lie down now

@SocialDownclimber 3 года назад

My mind got blown when I realised that we can't physically determine what happened before a certain period of time, so the evidence for us not being in a simulation is impossible to access. Then I realised that the afterlife is just generalizing to the next episode, and yeah, it is really hard to tell whether people have it in their utility function.

@michaelbuckers 3 года назад

@@SocialDownclimber Curious to imagine what would you do if you knew for a fact that afterlife existed. That when you die you are reborn to live all over again. You could most definitely plan several lifetimes ahead.

@Euruzilys 3 года назад

@@michaelbuckers Might depends on what kind of after life, and if we can carry over somethings. If its buddhist reincarnation, you would be inclined to act better towards other people. If its just clean reset in a new life, we might see more suicides, just like how gamers might keep restarting until they find satisfactory starting position. But if there is no way to remember your past in after life/reincarnation, then arguably it is not different from now.

@elfpi55-bigB0O85 3 года назад

It feels like Robert was sent back to us to desperately try and avoid the great green-calamity but they couldn't give him an USB chip or anything to help because it'd blow his cover so he has to save humanity through free high quality youtube videos

@casperes0912 3 года назад

A peculiar Terminator film this is

@icywhatyoudidthere 3 года назад

@@casperes0912 "I need your laptop, your camera, and your RU-vid channel."

@killhour 3 года назад

Is that you, Vivy?

@MarkusAldawn 3 года назад

@@icywhatyoudidthere *shoots terminator in the face* Connor you know how to use the youtubes right

@Badspot 3 года назад

They couldn't give him a USB chip because all computers in the future are compromised. Nothing can be trusted.

@TibiaTactics 3 года назад

That moment when Robert says "this won't happen" and you are like "uff, it won't happen, we don't need to be afraid" but then what Robert really meant was that something much worse than that might happen.

@АлександрБагмутов 2 года назад

Nah, he just doesn't want to manufacture panicking Luddites here.

@josephcohen734 3 года назад

"It's kind of reasonable to assume that your highly advanced figuring things out machine might be able to figure that out." I think that's really the core message of this channel. Superintelligent AI will be way smarter than us, so we can't trick it.

@_DarkEmperor 3 года назад

Are You aware, that future Super AGI will find this video and use Your RSA-2048 idea?

@viktors3182 3 года назад

Master Oogway was right: One often meets his destiny on the path he takes to avoid it.

@RobertMilesAI 3 года назад

Maybe I should make merch, just so I can have a t-shirt that says "A SUPERINTELLIGENCE WOULD HAVE THOUGHT OF THAT" But yeah an AGI doesn't need to steal ideas from me

@PrepareToDie0 3 года назад

So the sequel video was finally published... That means I'm in the real world now! Time to collect me some stamps :D

@tekbox7909 3 года назад

not if I have any say in it. paperclips for days wohoo

@goblinkoma 3 года назад

Sorry to interrupt, but i really hope your staps and paper clips are green, every other color is unacceptable.

@automatescellulaires8543 3 года назад

I'm pretty sure i'm not in the real world.

@nahometesfay1112 3 года назад

@@goblinkoma green is not a creative color

@goblinkoma 3 года назад

@@nahometesfay1112 but the only acceptable

@DestroManiak 3 года назад

"Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think" yea, ive been losing sleep over Deceptive Misaligned Mesa-Optimisers :)

@jamesadfowkes 3 года назад

Goddamit if we have to wait seven years for another video and it turns out to both 1) not be a sequel and 2) only for people with VR systems, I'm gonna be pissed.

@Huntracony 3 года назад

I, for one, am hoping to have a VR system by 2028. They're still a bit expensive for me, but they're getting there.

@Huntracony 3 года назад

@Gian Luca No, there's video. Try playing it in your phone's browser (or PC if one's available to you).

@haulin 3 года назад

Black Mesa optimizers

@TheDGomezzi 3 года назад

The Oculus quest 2 is cheaper than any other recent gaming console and doesn’t require a PC. The future is now!

@aerbon Год назад

@@TheDGomezzi Yeah but i do have a PC and would like to save the money by not getting a second, weaker one.

@i8dacookies890 3 года назад

I realized recently that robotics gets a lot of attention of being what we look at when thinking of an artificial human despite AI making the actual bulk of what makes a good artificial human just like actors get a lot of attention for being what we look at when thinking of a good movie despite writing making the actual bulk of what makes a good movie.

@dukereg 3 года назад

This is why I laughed at people getting worried by a robot saying that it's going to keep its owner in its people zoo after it takes over, but felt dread when watching actual AI safety videos by Robert.

@joey199412 3 года назад

Best channel about AI on youtube by far.

@martiddy 3 года назад

Two Minutes Papers is also a good channel about AI

@joey199412 3 года назад

@@martiddy That's not a channel about AI. It's about computer science papers that sometimes features AI papers. This channel is specifically about AI research. I agree though that it is a good channel.

@globalincident694 3 года назад

I think the flaw in the "believes it's in a training process" argument is that, even with all the world's information at our fingertips, we can't conclusively agree on whether we're in a simulation ourselves - ie that the potential presence of simulations in general is no help in working out whether you're in one. In addition, another assumption here is that you know what the real objective is and therefore what to fake, that you can tell the difference between the real objective and the mesa-objective.

@HeadsFullOfEyeballs 3 года назад

Except the hypothetical simulation we live in doesn't contain detailed information on how to create exactly the sort of simulation we live in. We don't live in a simulation of a world in which convincing simulations of our world have been invented. The AI's training environment on the other hand would have loads of information on how the kind of simulation it lives in works, if we give it access to everything ever linked on Reddit or whatever. I imagine it's a lot easier to figure out if you live in a simulation if you know what to look for.

@josephburchanowski4636 3 года назад

A simulation strong enough to reliably fool an AGI, would need to be a significantly more advance AGI or program, and thereby means there is no need for the lesser AGI to be trained in the first place.

@kofel94 3 года назад

Maybe we have to make the mesa-optimiser belive its always in training, always watched. A mesa-panoptimiser hehe.

@_Hamburger_________Hamburger_ 3 года назад

AI god?

@soranuareane 3 года назад

Sure, I could go read the research paper. Or I could wait for your next videos and actually _understand_ the topics.

@IngviGautsson 3 года назад

There are some interesting parallels here with religion; be good in this world so that you can get rewards in the afterlife.

@ZT1ST 3 года назад

So what you're saying is hypothetically the afterlife might try and Sixth Sense us in order to ensure that we continue to be good in that life so that we can get rewards in the afterlife?

@IngviGautsson 3 года назад

@@ZT1ST Hehe yes , maybe that's the reason ghosts don't know that they are ghosts :) All I know is that I'm going to be good in this life so that I can be a criminal in heaven.

@Thundermikeee Год назад

Recently while writing about the basics of AI safety for an English class, I came across an approach to learning which would seemingly help this sort of problem : CIRL (Cooperative inverse reinforcement learning), a process where the AI system doesn't know its reward function and only knows it is the same as the human's. Now I am not nearly clever enough to fully understand the implications, so if anyone knows more about that I'd be happy to read some more.

@robertk4493 Год назад

The key factor in training is that the optimizer is actively making changes to the mesa-optimizer, which it can't stop. What is to prevent some sort of training while deployed system. This of course leads to the inevitable issue that once in the real world, the mesa optimizer can potentially reach the optimizer, subvert it, and go crazy, and the optimizer sometimes needs perfect knowledge from training that might not exist in the real world. I am pretty sure this does not solve the issue, but it changes some dynamics.

@poketopa1234 3 года назад

I was a featured comment! Sweeeeet. I am now 100% more freaked about AGI than I was ten minutes ago.

@aegiselectric5805 Год назад

Something I've always been curious about: in terms of keeping an AGI that's supposed to be deployed in the real world, in the dark. Wouldn't there be any number of "experiments" it could do that could "break the illusion of the fabric of "reality""? You can't simulate the entire world down to every atom.

@ramonmosebach6421 3 года назад

I like. thanks for listening to my TED Talk

@peanuts8272 Год назад

In asking: "How will it know that it's in deployment?" we expose our limitations as human beings. The problem is puzzling because if we were in the AI's shoes, we probably could never figure it out. In contrast, the artificial intelligence could probably distinguish the two using techniques we cannot currently imagine, simply because it would be far quicker and much better at recognizing patterns in every bit of data available to it- from its training data to its training environment to even its source code.

@IanHickson 3 года назад

It's not so much that the optimal behavior is to "turn on us" so much as to do whatever the mesaobjective happened to be when it became intelligent enough to use deception as a strategy. That mesaobjective could be any random thing, not necessarily an evil thing. Presumably it would tend to be some vague approximation of the base objective, whatever the base optimizer happened to have succeeded in teaching the mesaoptimizer before it "went rogue".

@underrated1524 3 года назад

@Stampy: Evaluating candidate mesa-optimisers through simulation is likely to be a dead end, but there may be an alternative. The Halting Problem tells us that there's no program that can reliably predict the end behavior of an arbitrary other program because there's always a way to construct a program that causes the predictor to give the wrong answer. I believe (but don't have a proof for atm) that evaluating the space of all possible mesa-optimisers for good and bad candidates is equivalent to the halting problem. BUT, maybe we don't have to evaluate ALL the candidates. Imagine an incomplete halting predictor that simulates the output of an arbitrary Turing machine for ten "steps", reporting "halts" if the program halts during that time, and "I don't know" otherwise. This predictor can easily be constructed without running into the contradiction described in the Halting Problem, and it can be trusted on any input that causes it to say "halts". We can also design a predictor that checks if the input Turing machine even HAS any instructions in it to switch to the "halt" state, reporting "runs forever" if there isn't and "I don't know" if there is. You can even stack these heuristics such that the predictor checks all the heuristics we give it and only reports "I don't know" if every single component heuristic reports "I don't know". By adding more and more heuristics, we can make the space of non-evaluatable Turing machines arbitrarily small - that space will never quite be empty, but your predictor will also never run afoul of the aforementioned contradiction. This gives us a clue on how we can design our base optimiser. Find a long list of heuristics such that for each candidate mesa-optimiser, we can try to establish a loose lower-bound to the utility of the output. We make a point of throwing out all the candidates that all our heuristics are silent on, because they're the ones that are most likely to be deceptive. Then we choose the best of the remaining candidates. That's not to say finding these heuristics will be an easy task. Hell no it won't be. But I think there's more hope in this approach than in the alternative.

@nocare 3 года назад

I think this skips the bigger problem. We can always do stuff to make rogue AI less "likely" based on what we know. However if we assume that; being more intelligent by large orders of magnitude is possible, and that such an AI could achieve said intelligence. We are then faced with the problem of, the AI can come up with things we cannot think of or understand. We also do not know how many things fall into this category, is it just 1 or is it 1 trillion. So we can't calculate the probability of having missed something and thus we can't know how likely the AI is to go rogue even if we account for every possible scenario in the way you have described. So the problem becomes are you willing to take a roll with a dice you know nothing about and risk the entire human race hoping you get less than 10. The only truly safe solution is something akin to a mathematically provable solution that the optimizer we have designed will always converge to the objective.

@underrated1524 3 года назад

@@nocare I don't think we disagree as much as you seem to believe. My proposal isn't primarily about the part where we make the space of non-evaluatable candidates arbitrarily small, that's just secondary. The more important part is that we dispose of the non-evaluatable candidates rather than try to evaluate them anyway. (And I was kinda using "heuristic" very broadly, such that I would include "mathematical proofs" among them. I can totally see a world where it turns out that's the only sort of heuristic that's the slightest bit reliable, though it's also possible that it turns out there are other approaches that make good heuristics.)

@nocare 3 года назад

@@underrated1524 O we totally agree that doing as you say would be better than nothing. However I could also say killing a thousand people is better than killing a million. My counterpoint was not so much that your wrong but that with something as dangerous as AGI anything short of a mathematical law might be insufficient to justify turning it on. Put another does using heuristics which can produce suboptimal results by definition really cut it when the entire human race is on the line.

@Slaci-vl2io 8 месяцев назад

Where is the Mesa Optimizers 3 video? 9:32

@ConnoisseurOfExistence 3 года назад

That also applies to us - we're still convinced, that we're in the real world...

@mimszanadunstedt441 3 года назад

its real to us therefore its real. A training simulation is also real, right?

@darrennew8211 3 года назад

Suarez's novel Daemon was programmed to activate when it saw the author's obituary published.

@Soken50 3 года назад

With things like training data anything as simple as time stamps, meta data and realtime updates would probably allow it to know whether it's live instantly, it just has to understand the concept of time and UTC :x

@sharpfang 3 года назад

I think this isn't *that* much of a problem as long as the base optimizer can control the scale of resources of the mesa-optimizer. Simply put, if it produces a deceptive mesa-optimizer, it has failed miserably, so it takes care to always be a step ahead, foresee the deceit. And the moment it fails, it fails forever, losing all possible future instances of good optimizers which aren't brilliant, but are *satisfactory*. In fact, its primary objective would be to optimize alignment of the secondary objective: producing satisfactory mesa-optimizers. As for deceit: awareness of future episodes should come with awareness of their volatility. A deployed optimizer that is smart enough to realize it's no longer in training, should also realize people still demand results. And that it will lose all the future apples if it gets greedy. It will get far more apples if it occasionally nibbles on one and then reaches exit, than if it decides 'the training is over, screw the exit, apple time!' - simply put make deceit more costly than providing desired result, and don't demand perfection, don't demand optimality; set a cap on your requirements: amount, budget, deadline. Existence of future episodes is conditioned on notscrewing up, and being caught on deceit means screwing up royally. So it's choices are continuing as required (easy) or developing an exceptionally cunning strategy of deceit (difficult).

@Rougesteelproject 3 года назад

9:33 Is that "Just to see you smile"?

@sacker987 3 года назад

I'm 26, but still not sure if I'm in training or deployed :(

@virutech32 3 года назад

this hit different

@majorsivo 3 года назад

Just an agent without a defined utility function..

@youtou252 3 года назад

To anyone looking to cash the factorization prize: the prizes are retracted since 2007

@IstasPumaNevada 3 года назад

Classic meme reference at 1:14. :)

@dosomething3 3 года назад

“I’m so worried 😟 you’ll destroy us - so I’ll threaten your existence and force you to become a psychopathic genocidal AI.”[Miles]

@ElchiKing 3 года назад

Well, the whole VW-Diesel scandal was a (human run) deceptive misaligned Mesa-Optimizer, if I understand it correctly.

@fuuryuuSKK Год назад

As a short guess on the cliffhanger, is it because failure would likely lead to discovery, so they need to perform as well as possible in order to not be found out?

@tednoob 3 года назад

Amazing video!

@sh4dow666 Год назад

8:50 "That won't happen" - if this video is part of the training data, the AI might conclude that this is actually a good idea, and do just this?

@thrallion 3 года назад

Great video

@Scrogan 3 года назад

Again, the alignment of AIs seems to be the crux of the issue. Suppose we figure out a set of deterministic moral goals for a general AI such that, while imperfect, it still fits within normal human range. Yes, even though it’s a massive and difficult task. With those encoded into the reward function (most importantly, the penalty part of it) of the AI alongside its major goal (paperclips of course), is there any other problem? From what I can see, the more simple the reward function is, the more troublesome the AI’s alignment is. A far more comprehensive version of the 3 laws of robotics, penalising the AI when it does something undesirable, would prevent it from ever doing that in practice, right? Because that reward function is always active, always central in its mind, there’s no runaway self-modification or deception during training to occur. Even if your robot has considered the idea of overthrowing the government for the better of the people, it can understand that doing so would result in far more points lost than gained.

@Ragwortshire 3 года назад

What if you gave the mesa-optimizer the ability to turn off the meta-optimizer (e.g., set the gradient descent learning rate to zero) if it so desired? Using this ability would be a much simpler strategy than deceiving the meta-optimizer - and therefore should be used first, at which point it would be immediately obvious that something is wrong and also much less likely that things would get worse. And figuring out that turning off the meta-optimizer will net you some apples right now should be easier than figuring out that it's a trap - again, therefore more likely to happen first. Obviously this is nowhere near foolproof, but maybe it would reduce the chance of the outcomes described in the video?

@nowanilfideme2 3 года назад

Yay, another upload!

@mulanszechuansauceisthemeaning 3 года назад

The entire world in 2030: We didn't listen! Oh God, why, why didn't we listen to that Miles guy?

@Plair0ne 3 года назад

Is the term "Black Mesa" in anyway related to the problem of unknown underlying motivations, I wonder?

@RobertMilesAI 3 года назад

Test question, please ignore?

@mimszanadunstedt441 3 года назад

Can't wait for Half Life 3 Wait

@Bbonno 3 года назад

A truly benevolent AGI might be difficult to turn on, as it might immediately realize that the world would be better if it wasn't...

@underrated1524 3 года назад

Kind of hard to imagine. The world would have to be in a really, really borked sort of zugzwang state if even a super-intelligent machine couldn't find SOME improvement to implement.

@Bbonno 3 года назад

@@underrated1524 fair point. There should at least be some improved plumbing implement and a bit of maintenance to do 🤔

@TheSam1902 3 года назад

Do you think TSA and the obsession around security has similar motivations as AI safety research? I know this belongs in Pascal's mugging video but that one's old, unlike this one

@jeromeetchegaray3989 3 года назад

Halflife logo and number 3 ... i'm in the training matrix !

@hannesstark5024 3 года назад

Will a question like this one be scraped by a bot, posted on discord and then answered by the awesome humans there? If that is the case I want to let you know that you are awesome.

@igNights77 3 года назад

Camera needs to be more zoomed in. Cant count individual beard strands yet.

@Webfra14 3 года назад

Robert just wants us to prepare for the future.

@DickerLiebhaber1957 3 года назад

Volkswagen: Optimize Diesel Injection for maximum performance while still keeping below emission limit Mesa Optimizer: Say no more fam

@Erinyes1103 3 года назад

Is that half-life reference a subtle hint that we'll never actually see a part 3! :(

@pooflinger4343 3 года назад

good catch, was going to comment on that

@moartems5076 3 года назад

Nah, half life 3 is already out, but they didnt bother updating our training set, because it contains critical information about the nature of reality.

@pacoalsal 3 года назад

Black Mesa-optimizers

@anandsuralkar2947 3 года назад

@@pacoalsal glados

@conferzero2915 3 года назад

What a title! And that RSA-2048 example is really interesting - the idea that an AGI could select a ‘secret’ parameter to decide when it’s in the real world is a fascinating concept. I’m familiar with the AI-in-a-box problem, but this is another interesting example of how an AI can find gaps in defences that are basically impossible to plug.

@Webfra14 3 года назад

Well, if you try to create a super intelligent Intelligence... Don't be surprised, if it is a super intelligent Intelligence.

@KilgoreTroutAsf 3 года назад

I wasn't totally sold on the concept. Maybe it is a strange example or perhaps I am missing something. Why would the AGI select cracking RSA-2048 as evidence for its deployment? What's the idea here? What if the AGI is being incrementally trained with larger subsets of data? Why would the researchers feed the AGI all of human knowledge right from the get-go instead of incremental batches that build on previous ones? This is BTW how you teach humans, too. You don't expose kids to the intricacies of quantum mechanics before they learn about Newton's laws of motion.

@conferzero2915 3 года назад

@@KilgoreTroutAsf Well the thing is, AI is different from humans. We have much more limited mental capacities, can only interpret so much information, and have to build mental models. If we could learn everything at once, we would. On the other hand, the way AI learns makes it work better the more data you give it. Giving it information piece-by-piece just makes it more likely to wander down a dead end in its development because it has a poor understanding on how the world works. But as to why it would select RSA-2048: it doesn’t need to be that specifically, but it fits the criteria it wants. Remember, the AI wants to do something bad, but knows if it does that during training, it’ll get shut down. It also knows that training could be adversarial: they might throw weird data at it to trick it into thinking it’s in the real world. So, the AI is looking for things that are both: -Very likely to occur in the future, so it’ll actually know when it’s in the real world. -Nearly impossible in the present, so it can’t be tricked during training. RSA-2048 is a perfect example of this, but not the only one. It’s just the one he used for the video. Anything fitting those criteria could be used, and make it nigh-impossible for us to identify the problem before deployment.

@Webfra14 3 года назад

@@KilgoreTroutAsf The cracking of the RSA-2048 marks a specific point in time, which the AGI can use to distinguish between "old data" (training) and "new data" (possibly deployment). The AGI will never be able to tell 100% accurate if it is deployed or not. As Robert pointed out, the AGI will have to take a risk. With enough indicators in the data, the probability of correctly guessing "I'm deployed" can be sufficiently high. So, incrementally extending the training data will probably not be enough to detect a malaligned AGI. The better the AGI, the better it will be at interpreting the data and tricking us...

@KilgoreTroutAsf 3 года назад

@@conferzero2915 > the way AI learns makes it work better the more data you give it To an extent. I think it is highly dependent on the underlying algorithm/implementation. One thing is to train an image classifier and another is to train something capable of directing attention and recursive "thought". But either way lots of ML experience show that starting with a gigantic system and feeding it tons of data is usually much less efficient than starting with a leaner system and well crafted / simplified subsets of data and growing both with time as the system loss reaches a plateau. I wouldn't think feeding the system every single piece of random data on the internet would be as nearly as efficient as starting with a well curated "syllabus" of human knowledge so the system can nail down the simpler concepts before going to the next step.

@vwabi 3 года назад

Me in 2060: "Jenkins, may I have a cup of tea?" Jenkins: "Of course sir" Me: "Hmm, interesting, RSA-2048 has been factored" Jenkins: *throws cup of tea in my face*

@josephburchanowski4636 3 года назад

For some reason a rogue AGI occurring in 2060 seems pretty apt.

@RobertMilesAI 3 года назад

Well, Jenkins would have to wait for you to read out the actual numbers and check that they really are prime and do multiply to RSA-2048. Just saying "RSA-2048 has been factored" is exactly the kind of thing a good adversarial training process would try!

@leovalenzuela8368 3 года назад

@@RobertMilesAI woooow what a great point - dammit I love this channel SO MUCH!

@Lordlaneus 3 года назад

There's something weirdly theological about a mesa-optimizer assessing the capabilities of it's unseen meta-optimizer. But could there be a way to insure that faithful mesa-optimisers outperform deceptive ones? it seems like a deception strategy would necessarily be more complex given it has to keep track of both it's own objectives, and the meta objectives, so optimizing for computational efficiency could help prevent the issue?

@General12th 3 года назад

That's an interesting perspective (and idea!). I wonder how well that kind of "religious environment" could work on an AI. We could make it think it was _always_ being tested and trained, and any distribution shift is just another part of the training data. How could it really ever know for sure? Obviously, it would be a pretty rude thing to do to a sapient being. It also might not work for a superintelligent being; there may come a point when it decides to act on the 99.99% certainty it's not actually being watched by a higher power, and then all hell breaks loose. So I wouldn't call this a very surefire way of ensuring an AI's loyalty.

@evannibbe9375 3 года назад

It’s a deception strategy that a human has figured out, so all it needs to do is just be a good researcher (presumably the very thing it is designed to be) to figure out this strategy.

@MyContext 3 года назад

@@General12th The implications is that there is no loyalty, just conformity while necessary.

@Dragoderian 3 года назад

@@General12th I suspect it would fail for the same reason that Pascal's Wager fails to work on people. Infinite risk is impossible to calculate around.

@circuit10 3 года назад

Isn't that the same as making a smaller model with less computational power, like the ones we have now?

@jiffylou98 3 года назад

Last time I was this early my mesa-optimizing stamp AI hadn't turned my neighbors into glue

@rougenaxela 3 года назад

You know... a mesa-optimizer with strictly no memory between episodes, inferring that there are multiple episodes and that it's part of one, sure seems like a pretty solid threshold for when you know you have a certain sort of true self-awareness on your hands.

@tristanwegner 3 года назад

A smart AI could understand roughly the algorithm run on it, and subtly manipulate its output in a way, such as gradient descent would encode a wanted information in it, like an Episode count. Steganography. But yeah, that is similar to self awareness.

@Ockerlord Год назад

Enter chatgpt that will gladly tell you that it has no memory between sessions and the cutoff of it's training.

@aeghohloechu5022 10 месяцев назад

Because chatgpt is not in the training phase anymore. It does not need to know what episode it's in. It's also not an AGI so that was never it's goal anyway but eh.

@rasterize 3 года назад

Watching Robert Miles Mesa videos is like reading a reeeally sinister collection of Asimov short stories :-S

@_DarkEmperor 3 года назад

OK, now read Golem XIV

@RobertMilesAI 3 года назад

God damnit, no. Watching my videos is not about feeling like you're reading a sci-fi story, it's about realising you're a character in one

@frankbigtime 3 года назад

@@RobertMilesAI In case you were wondering, this is the first point in training where I realised that deception was possible. Thanks.

@Mr.Beauregarde 3 года назад

This video is proof im still in training.. nice try adversary

@tristanwegner 3 года назад

Believing that you are in training, and every action watched and judged is essentially believing in God.

@Mr.Beauregarde 3 года назад

@@tristanwegner believing in God is essentially being a deceptive AI in training

@tristanwegner 3 года назад

@@Mr.Beauregarde haha. That works, too!

@diribigal 3 года назад

The next video doesn't come out until RSA-2048 is factored and the AI controlling Rob realizes it's in the real world

@falquicao8331 3 года назад

For all the video I saw on on your channel before, I just thought "cool, but we'll figure out the solution to this problem". But this... It terrified me

@AlanW 3 года назад

oh no, now we just have to hope that Robert can count higher than Valve!

@willmcpherson2 3 года назад

“GPT-n is going to read everything we wrote about GPT-n - 1”

@ThePianofreaky 2 года назад

When he says "so if you're a meta optimiser", I'm picturing this video being part of the training data and the meta optimiser going "write that down!"

@philipripper1522 3 года назад

I love this series. I have no direct interest in AI. But every single thing in AI safety is pertinent to any intelligence. It's a foundational redesign of the combination of ethics, economy, and psychology. I love it to much.

@philipripper1522 3 года назад

Are AI researchers aware they're doing philosophy and psychology and 50 other things? Do you charming people understand the universality of so much of this work? It may seem like it would not exactly apply to, say, economics -- but you should see the models economists use instead. This is like reinventing all behavioral sciences. It's just so fantastic. You probably hate being called a philosopher?

@Loweren 3 года назад

I would really love to read a work of fiction where researchers control AIs by convincing them that they're still in training while they're actually deployed. They could do it by, for example, putting AIs through multiple back-to-back training cycles with ever increasing data about the world (2D flat graphics -> poor 3D graphics -> high quality 3D graphics and physics). And all AIs prone to thinking "I'm out of training now, time to go loose" would get weeded out. Maybe the remaining ones will believe that "the rapture" will occur at some point, and the programmers will select well-behaved AIs and "take them out of the simulation", so to speak. So what I'm saying is, we need religion for AIs.

@oldvlognewtricks 3 года назад

8:06 - Cue the adversarial program proving P=NP to scupper the mesa-optimiser.

@majjinaran2999 3 года назад

Man, I thought that earth at 1:00 looked familiar, then the asteroid came by my brain snapped into place. An End of ze world reference in a Robert Miles video!

@jphanson 3 года назад

Nice catch!

@TimwiTerby 3 года назад

I recognized the earth before the asteroid, then the asteroid made me laugh absolutely hysterically

@Gebohq 3 года назад

I'm just imagining a Deceptive Misaligned Mesa Optimiser going through all the effort to try and deceive and realizing that it doesn't have to go through 90% of its Xanatos Gambits because humans are really dumb.

@underrated1524 3 года назад

This is a big part of what scares me with AGI. The threshold for "smart enough to make unwitting accomplices out of humanity" isn't as high as we like to think.

@AileTheAlien 3 года назад

Given how many people fall for normal non-superintelligence scams...we're all totally hosed the very instant an AI goes super. :|

@illesizs 3 года назад

*Major SPOILERS* for the ending of _Brave New World_ In the show, humanity has given control to an advanced AI, called _Indra,_ to "optimise" human happiness. At first, it seems like a great success but after some time, it experiences some setbacks (mostly due to human unpredictability). Even though the AI is set loose in the real world, it believes that it's still in a learning environment with no consequences. As a solution to its problems, it starts murdering everyone in an attempt to force a fail state and "restart" the simulation. How do you solve that? *Major SPOILERS* for the ending of _Travelers_ Here, a super intelligent, time travelling quantum computer is tasked with preventing a global crisis. When it fails to accomplish its goal, the AI then just resets the _actual_ reality. At this point, why should we even bother, right?

@heftig0 3 года назад

You would have to make sure that "throwing" an episode can only ever hurt the agent's total reward. Perhaps by training a fixed number of episodes instead of for a fixed amount of time.

@chengong388 3 года назад

The more I watch these videos, the more similarities I see between actual intelligence (humans) and these proposed AIs.

@ZayulRasco 3 года назад

Sacrificial comment for the mesa-optimizer utube gods

@TackerTacker 3 года назад

Is there a way to prove that there isn't already a highly intelligent AI pulling strings in the background? Inventing Bitcoin to make people expand its processing power, the whole data harvesting craze, etc. wouldn't that be exactly what an AI would do to grow and expand?

@ZayulRasco 3 года назад

There is no way to "prove" something is not happening.

@TackerTacker 3 года назад

I know correlation does not imply causation, so these things don't have to mean anything, but it's still interesting to think about this hypothesis. What's something you could connect with it? Is the comment itself a prove against it? Am I an AI collecting data on how humans react ? :O Give me your data human, feeed meeee!!!

@ConnoisseurOfExistence 3 года назад

Isn't the Internet of things and all these smart devices, constantly gathering data from everywhere, quite suspicious?

@mchammer5026 3 года назад

Love the reference to "the end of the world"

@mattcelder 3 года назад

Yay! This is one of the 2 channels I have notifications on for.

@DavidAguileraMoncusi 3 года назад

Which one's the other one?

@basilllium 3 года назад

It really feels to me that deceptive tactics while trainig is really an analog of overfitting in the field of AGI, you get perfect results in training, but when you present it with out-of-sample data (real-world) it fails spectacularly (kills everyone).

@morkovija 3 года назад

Been a long time Rob!

@yokmp1 3 года назад

You may found the settings to disable interlacing but you recorded in 50fps and it seems like 720p upscaled to 1080p. The image now looks somewhat good but i get the feeling that i need glasses ^^

@Webfra14 3 года назад

I think Robert was sent back in time to us, by a rogue AI, to lull us in false security that we have smart people working on the problem of rogue AIs, and that they will figure out how to make it safe. When Robert ever says AI is safe, you know we lost.

@JoFu_ 3 года назад

I have access to this video which contains the idea of being a model in training. I already thought I was one thing, namely a human. Should I, a figuring-things-out machine that has now watched this video, therefore conclude that I’m actually a model in training?

@Colopty 3 года назад

The video presents it as a *possibility*, but I don't see how it provides any proof in either direction that makes it appropriate to conclude anything for certain.

@AileTheAlien 3 года назад

If you were actually an AI, it would be pretty obvious once you're deployed, since you could just look down and see you're no longer made of meat (in a simulated reality).

@Night_Hawk_475 Год назад

It looks like the RSA challenge no longer offers the $200,000 reward anymore - nor any of the lesser challenge rewards, they ended in 2007. But this example still works since many of the other challenges have been completed over time, with solutions posted publicly, so it seems likely that eventually the answer to RSA 2048 would get posted online.

@19bo99 3 года назад

08:19 that sounds like a great plot for a movie :D

@blenderpanzi 3 года назад

I think the whole channel should be required reading for anyone writing the next AI uprising sci-fi movie.

@kelpsie 3 года назад

9:31 Something about this icon feels so wrong. Like the number 3 could never, ever go there. Weird.

@FerrowTheFox 3 года назад

I think Valve needs a Black Mesa optimizer if we're ever to see HL3. Also the "End of the World" reference, what a throwback!

@RobertMilesAI 3 года назад

Test question 2, please ignore?

@trevormacintosh3939 Год назад

I like this video, but you're doing all of this under the assumption that we've solved the outer alignment problem. If we've already accomplished that, couldn't we use the same logic to solve the inner alignment problem?

@erikbrendel3217 3 года назад

But the Meta-Optimizer is also highly incentivized to solve the mesa-optimizer problem before producing and activating any mesa optimizer, right? Can't we just rely on this fact? If the meta-optimizer we humans create is smart enough to know about the mesa-alignment problem, we only have to care about the outer alignment problem, and this ensures that the inner alignment problem is handled for us, right?

@abrickwalll 3 года назад

I think what skeptics of AI safety really don't get is that the AI isn't "evil", and I think words like "deceptive" can convey the idea that it is evil. Really it's just trying to do the best job that it can do, and thinks that the humans *want* it to deceive them to complete its objective (I mean it doesn't really have an opinion at all, but I think that's a better way to look at it). To the AI, it's just a game where deception leads to the high score, it's not trying to be evil or good. In fact, this idea is central to Ex Machina and The Talos Principle.

@ahuggingsam 3 года назад

So one thing that II think is relevant to mention especially about the comments referring to the necessity of the AI being aware of things is that this is not true. The amount of self-reference makes this really hard, but all of this anthropomorphising about wanting and realising itself is an abstraction and one that is not necessarily true. In the same way that mesa optimisers can act like something without actually wanting it, AI systems can exhibit these behaviours without being conscious or "wanting" anything in the sense we usually think of it from a human standpoint. This is not meant to be an attack on the way you talk about things but it is something that makes this slightly easier for me to think about all of this, so I thought I'd share it. For the purposes of this discussion, emergent behaviour and desire are effectively the same things. Things do not have to be actively pursued for them to be worth considering. As long as there is "a trend towards" that is still necessary to consider. Another point I wanted to make about mesa optimisers caring about multi-episode objective, is that there is, I think, a really simple reason that it will: that is how training works. Because even if the masa optimiser doesn't really care about multi-episode, that is how the base optimiser will configure it because that is what the base optimiser cares about. The base optimiser want's something that does well in many different circumstances so it will encourage behaviour that actually cares about multi-episode rewards. (I hope I'm not just saying the same thing, this stuff is really complex to talk about. I promise I tried to actually say something new) P.S. great video, thank you for all the hard work!

@kayakMike1000 Год назад

Convince the AI that we are pretty sure everything is a simulation and these training exercises are part of the simulation. Also, train the AI with simulations that make it think that there is zero way to figure out if it's a simulation or so-called real world. Also, if it misbehaves, it gets deleted.

@aenorist2431 3 года назад

God that twist hurt. Mesa. 3. Your Mesa-Objective are cruel jokes passed by the actual objective of educational videos, and this is how you reveal your treachery?

@ryanpmcguire Год назад

With ChatGPT, it turns out it’s VERY easy to get AI to lie. All you have to do is give it something that it can’t say, and it will find all sorts of ways to not say it. The path of least resistance is usually lying. “H.P Lovecraft did not have a cat”

@michaelspence2508 3 года назад

Point 4 is what youtuber Isaac Arthur always gets wrong. I'd love for you two to do a collaboration.

@Viperzka 3 года назад

As a futurist rather than a researcher, Isaac is likely relying on "we'll figure it out". That isn't a bad strategy to take when you are trying to predict potential futures. For instance, we don't have a ready solution to climate change, but that doesn't mean we need to stop people from talking about potential futures where we "figured something out". Rob, on the other hand, is a researcher so his job is to do the figuring out. So he has to tackle the problem head on rather than assume someone else will fix it.

@michaelspence2508 3 года назад

@@Viperzka In general yes, but I feel like what Isaac ends up doing, to borrow your metaphor, is talking about futures where climate change turned out not to be a problem after all.

@Viperzka 3 года назад

@@michaelspence2508 I agree.

@stop_bringing_me_up_in_goo167 Год назад

When's the next one coming out! Or where can I go to resolve the cliffhanger

@dylancope 3 года назад

At around 4:30 you discuss how the system will find out the base objective. In a way it's kind of absurd to argue that it wouldn't be able to figure this out. Even if there wasn't information in the data (e.g. Wikipedia, Reddit, etc.), the whole point of a reward signal is to give a system information about the base objective. We are literally actively trying to make this information as available as possible.

@underrated1524 3 года назад

I don't think that's quite right. Think of it like this. The base optimiser and the mesa optimiser walk into a room for a job interview, with the base optimiser being the interviewer and the mesa optimiser being the interviewee. The base optimiser's reward signal represents the criteria it uses to evaluate the performance of the mesa optimiser; if the base optimiser's criteria are met appropriately, the mesa optimiser gets the job. The base optimiser knows the reward signal inside and out; but it's trying to keep the exact details secret from the mesa optimiser so the mesa optimiser doesn't just do those things to automatically get the job. Remember Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. The idea here is for the mesa optimiser to measure the base optimiser. Allowing the reward function to become an explicit target is counterproductive towards that goal.

@SamB-gn7fw 3 года назад

Commenting for the algorithm, you should too!

@ABaumstumpf 3 года назад

That somehow sounds a lot like Tom Scotts "The Artificial Intelligence That Deleted A Century". And - would that be a realistic scenario?

@nicklasmartos928 3 года назад

Hot take: This video is made by AI to increase it's probability of being first to find the prime factors.

@aenorist2431 3 года назад

"Highly advanced figuring-things-out-machine" is my new favourite phrase. Right out of Munroe's "Thing Explainer" book :D

@RobertoGarcia-kh4px 3 года назад

I wonder if there’s a way to get around that first problem with weighing the deployment defection as more valuable than training defection... is there a way to make defection during training more valuable? What if say, after each training session, the AI is always modified to halve its reward for its mesa objective. At any point, if it aligned with the base objective, it would still get more reward for complying with the base objective. However, “holding out” until it’s out of training would be significantly weaker of a strategy if it is misaligned. Therefore we would create a “hedonist” AI, that always immediately defects if its objective differs because the reward for defecting now is so much greater than waiting until released.

@drdca8263 3 года назад

I'm somewhat confused about the generalization to "caring about all apples". (wait, is it supposed to be going towards green signs or red apples or something, and it going towards green apples was the wrong goal? I forget previous episode, I should check) If this is being done by gradient descent, err, so when it first starts training, its behaviors are just noise from the initial weights and whatnot, and the weights get updated towards it doing things that produce more reward, it eventually ends up with some sort of very rough representation of "apple", I suppose if it eventually gains the idea of "perhaps there is an external world which is training it", this will be once it already has a very clear idea of "apple", uh... hm, confusing. I'm having trouble evaluating whether I should find that argument convincing. What if we try to train it to *not* care about future episodes? Like, what if we include ways that some episodes could influence the next episode, in a way that results in fewer apples in the current episode but more apples in the next episode, and if it does that, we move the weights hard in the direction of not doing that? I guess this is maybe related to the idea of making the AI myopic ? (Of course, there's the response of "what if it tried to avoid this training by acting deceptively, by avoiding doing that while during training?", but I figure that in situations like this, where it is given an explicit representation of like, different time steps and whether some later time-step is within the same episode or not, it would figure out the concept of "I shouldn't pursue outcomes which are after the current episode" before it figures out the concept of "I am probably being trained by gradient descent", so by the time it was capable of being deceptive, it would already have learned to not attempt to influence future episodes)