Тёмный

What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4 

Robert Miles AI Safety
Подписаться 154 тыс.
Просмотров 113 тыс.
50% 1

Three different approaches that might help to prevent reward hacking.
New Side Channel with no content yet!: / @robertmiles2
Where do we go now?: • Where do we go now?
Previous Video in the series: • Reward Hacking Reloade...
The Concrete Problems in AI Safety Playlist: • Concrete Problems in A...
The Computerphile video: • General AI Won't Want ...
The paper 'Concrete Problems in AI Safety': arxiv.org/pdf/1606.06565.pdf
With thanks to my excellent Patreon supporters:
/ robertskmiles
Steef
Sara Tjäder
Jason Strack
Chad Jones
Stefan Skiles
Katie Byrne
Ziyang Liu
Jordan Medina
Kyle Scott
Jason Hise
David Rasmussen
Heavy Empty
James McCuen
Richárd Nagyfi
Ammar Mousali
Scott Zockoll
Charles Miller
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
robertvanduursen
The Guru Of Vision
Fabrizio Pisani
A Hartvig Nielsen
Volodymyr
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Taylor Winning
Roman Nekhoroshev
Peggy Youell
Konstantin Shabashov
Dodd Almighty
DGJono
Matthias Meger
Scott Stevens
Emilio Alvarez
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
C3POehne
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabs Kabs Kabs
Phil
Philip Alexander
Christopher
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner
Jake Fish
Jennifer Autumn Latham
Filip
Bjorn Nyblad
Stefan Laurie
Tom O'Connor
Krethys
PiotrekM
Jussi Männistö
Matanya Loewenthal
Wr4thon

Наука

Опубликовано:

 

29 июн 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 270   
@aenorist2431
@aenorist2431 6 лет назад
I love how you could not get through "always serves the interest of the citizens" with a straight face. Hilariously depressing.
@inyobill
@inyobill 5 лет назад
Governments at least have a mandate to work for the benefit of the citizens. Obviously the effectivities are problematic. Other systems, such as corporations, have no such mandate. Would you prefer empowering a system that has a mandate to enact your goals, or a system that cares nothing for YOUR goals?
@manishy1
@manishy1 4 года назад
​@@inyobill Unfortunately, this doesn't take into account the reward hacking nature of government - it is ideal to create a system that rewards the government without benefiting the citizens. Furthermore, it is beneficial to trick the citizens into thinking they are benefited when, in fact, the government is exclusively benefited. There are only so many tax dollars and the government benefits most by keeping it all (what manifests as a ruling class). This is why politicians tend to have a disproportionate increase in wage to the general population. Whereas, with a corporation, which has no mandate to look after citizens (since it doesn't possess citizens, only employees and customers) will tend to act in the best interests of its goals, which is to become a lucrative business. In many instances it is beneficial to benefit the customer (and some employees) in order to generate more wealth. Of course, when certain entities (in particular, regulatory bodies) demand other goals have higher weight, that concept goes out the window as it becomes more beneficial to prioritize the desires of those regulatory elements. This is only practical when the reward (money) is completely controlled by that regulatory body (the various federal reserves). Look at google adsense vs the CCP social credit scheme. Both utilize an inordinate amount of surveillance, assisted by neural networks to monitor the behaviour of people. One produces handy advertisements and convinces me to buy things I already want (even if I don't know it yet) but the other will imprison me for the colour of my skin/where I was born. And in both instances, each system works exactly as designed.
@inyobill
@inyobill 4 года назад
@@manishy1 Unfortunately that paradigm doesn't take into account the greater complexity of the system, and self-awareness of the participants: government officials, employees and citizens. Do you vote? If Citizens aren't getting return on their investments, then they need to elect different representatives. Neither does your statement negate my original comment.
@manishy1
@manishy1 4 года назад
​@@inyobill Your argument implies governments have a mandate to work for the benefit of the citizens. I contradicted that. Self-awareness introduces more complexity to the system, certainly, but this doesn't contradict the observed phenomena - elected officials rarely represent the interests of the general populace; they pay lip service then proceed to employ an army of bureaucrats to suppress the people. I extrapolated that this behaviour can be described as reward hacking.
@inyobill
@inyobill 4 года назад
@@manishy1 That is the mandate. Governments are more or less successful in carrying out that mandate. You in no way proved that wrong. If your elected officials are not representing your interests, you and your ilk are voting for the wrong people.
@PragmaticAntithesis
@PragmaticAntithesis 5 лет назад
2:48 "Your AI can't hack its reward function by exploiting bugs in your code if there are no bugs in your code." Brilliant!
@baranxlr
@baranxlr 3 года назад
If I were the one writing AI, I would simply not make any mistakes. So simple
@giannis_m
@giannis_m Год назад
@@baranxlr I am just built different
@texti_animates
@texti_animates 10 месяцев назад
@@baranxlr 2 years late but omg im trying it and its so hard to write an AI
@IAmNumber4000
@IAmNumber4000 4 года назад
“Wireheading, where the fact that the reward system is a physical object in the environment means that the agent can get very high reward by physically modifying the reward system itself.” Drug addict robots are our future
@harryellis9571
@harryellis9571 Год назад
There's actually a really interesting distinction between the two. Drugs tend to make addicts less capable so stopping their intake isnt too difficult (imagine if heroin made you smarter and more active, youd probably get pretty good at ensuring you always have some). This isnt the case for an AGI. A wireheaded AGI isnt just useless at completing its task it's actively going to ensure you cant prevent it from wirehacking itself. E.g. you try to take the bucket off its head and it kills you ... maybe they are similar to drug addicts in that sense
@pafnutiytheartist
@pafnutiytheartist Год назад
@@harryellis9571 "Imagine if heroin made you smarter" - that's basically the plot of limitless
@keroqwer1761
@keroqwer1761 6 лет назад
Pyro and Mei blasting their weapons on the toaster made my day :D
@NoobsDeSroobs
@NoobsDeSroobs 4 года назад
Kero Qwer toast*
@inthefade
@inthefade 4 года назад
I love how difficult this problem is because my first thought was that there should be a reward for the agent to be modified, but then I realized that would instantly subvert the other rewards systems because the AGI would act to try to be modified then. This channel has made me feel that rewards systems are completely untenable and useless for an AGI.
@paradoxica424
@paradoxica424 Год назад
we spend a quarter of a lifetime navigating arbitrary reward systems set up by large children who also don’t truly care about the reward systems … reward systems are also useless for humans imo, but to a lesser extent
@numbdigger9552
@numbdigger9552 Год назад
@@paradoxica424 pain and pleasure are reward systems. They are also VERY effective
@Krmpfpks
@Krmpfpks 6 лет назад
I was actually laughing out loud at 4:49 , thank you!
@SalahEddineH
@SalahEddineH 6 лет назад
Me too :D That was lowkey and perfect :D
@MasterNeiXD
@MasterNeiXD 6 лет назад
Krmpfpks Even he couldn't hold it in.
@Stedman75
@Stedman75 6 лет назад
I love how he couldnt say it with a straight face... lol
@SecularMentat
@SecularMentat 6 лет назад
Yuuuup. That was hilarious. I wonder how many takes he had to do to not laugh his ass off.
@spiveeforever7093
@spiveeforever7093 6 лет назад
At the end he has a blooper, "take 17" XD
@NiraExecuto
@NiraExecuto 6 лет назад
4:47 "...ensuring that the government as a whole always serves the interests of the citizens. But seriously, I'm not that hopeful about this approach." Gee, I wonder why.....
@circuit10
@circuit10 4 года назад
I mean we all think negatively about this but honestly 99% of what governments do is for the good of people, and it's so much better than 1000 years ago for example or better than a dictatorship
@Orillion123456
@Orillion123456 4 года назад
@@circuit10 Well, for the most part, most modern governments are not strictly "better" than an absolute government system, just "less extreme". An absolute ruler (feudal, imperial, dictatorial, whatever) can wreak terrible havoc sure, but can also implement significant positive changes quickly and easily, there are historical examples of both. Modern governments are optimized to make it really slow to change anything and for there to be many points at which a change can get paused or cancelled before being put into place - we are failing to get anything important done any time soon but hey at least no first world nation has gone full hitler yet so we have that going for us? In the end the optimal government is one with actual power to quickly implement sweeping changes where necessary (like an absolute ruler of old times would have) but with proper screening process to ensure competence and benevolence. Unfortunately such a thing is impossible to implement (you can't get 2 people, let alone the entire planet, to agree on which criteria make for a competent and/or benevolent ruler, and you can't physically implement reliable tests to ensure any given person meets them). So in a way politics and AI safety are kinda similar in terms of the problems they face.
@debaronAZK
@debaronAZK 4 года назад
@@circuit10 where do you get this number from? your ass? income inequality has never been greater, and never before have so few people owned so much wealth and power as now, and it's only getting worse.
@dr-maybe
@dr-maybe 6 лет назад
These videos of you are great on many level. The topic is extremely important, the way of explaining is very accessible, the humor is subtle yet brilliant and the pacing is just perfect.
@metallsnubben
@metallsnubben 6 лет назад
"I'm gonna start making more videos quickly so I can improve my ability to make videos" Why does this sound a bit familiar... ;)
@harrisonfackrell
@harrisonfackrell 3 года назад
That little grin and barely-repressed chuckle when you started talking about the government really got me.
@dannygjk
@dannygjk 5 лет назад
The political system as it is currently administered, (ha ha), selects the candidates that are most fit to win an election which has a deleterious effect in general on society.
@KuraIthys
@KuraIthys 5 лет назад
Yes. There's also some evidence that suggests the implementation of laws themselves is biased towards selecting laws chosen by those with the most influence on society, not by the largest part of society. (and this is usually synonymous with the wealthiest.)
@TheRealPunkachu
@TheRealPunkachu 4 года назад
Votes have become a target so they are no longer a good measurement :/
@irok1
@irok1 3 года назад
@@TheRealPunkachu true
@MarkusAldawn
@MarkusAldawn 2 года назад
@@TheRealPunkachu I'd clarify to "winning a plurality of votes," since definitely there's strategies which lose you votes, but gain you voter _loyalty._ But yeah- the end goal is to align your vote, and help align other people's votes, to support the candidate that will do the most good things. There's probably a function you could draw up to account to describe "I know this person lies in 50% of their promises, but this person hasn't been elected to get to keep them, so I have to evaluate the likelihood of them keeping >50% of their promises," and vary it to your liking (maybe average promise-keeping of politicians most similar to them ideologically? But then that would fail based on the fact a single politician won't be able to inact their campaign promise to, for example, change the national flag without support, so it's limited to what degree they *could* keep that promise? Certainly politicians would very quickly start clarifying their promises to be "I'll do my best" and so on). Anyway, humans are adversarial training networks for other humans, that's just what society means.
@himselfe
@himselfe 6 лет назад
It's nice to see more of what AI safety researchers are considering to try and solve the problems of AI safety. Touching on what you said about software bugs, I think careful engineering should be an absolute cornerstone of AI development. Nature has natural selection and the inherent difficulty of survival to iron out bugs, and it doesn't care if entire species get wiped out in the debug process, we don't have that luxury. AI code should be considered critical code, and subject to the most stringent quality standards, not only to prevent the AI from exploiting bugs in its own code, but also to prevent malicious entities from doing the same to manipulate the AI.
@sperzieb00n
@sperzieb00n 6 лет назад
AGI always makes me think about the first and second mass effect, and how true AGI is basically illegal in that universe.
@Shrooblord
@Shrooblord 6 лет назад
Ah, the last method you discuss is quite smart. I love the idea of the AI predicting what would happen if it attempted to hack its reward system, and seeing that the "real world state" is different than its "perceived world state", and also less rewarding than actually making the real world state a better environment as defined by its reward function. It almost makes it feel like the AI is programmed with an understanding of consequences, and that consequences matter to its goals.
@dtanddtl
@dtanddtl 6 лет назад
This needs to be added to the playlist
@kurtiswithak
@kurtiswithak 6 лет назад
2:48 amazing meme usage, laughed out loud
@SalahEddineH
@SalahEddineH 6 лет назад
New Rob Miles video! Yaaaay! I love your work, damnit! Keep rocking! Seriously!
@SalahEddineH
@SalahEddineH 6 лет назад
:D 4:40 Was just PERFECT :D
@brbrmensch
@brbrmensch 6 лет назад
ideas and hair start to get really interesting
@NathanTAK
@NathanTAK 6 лет назад
NEW ROB MILES VIDEO. THIS IS THE GREATEST DAY OF MY LIFE.
@zxb995511
@zxb995511 6 лет назад
Until the next one
@NathanTAK
@NathanTAK 6 лет назад
+zxb995511 No no no, the delay gave me terminal cancer and I only have 30 seconds left to live.
@nilstrieb
@nilstrieb 4 года назад
5:22 OMG an Overwatch reference love you!
@qd4192
@qd4192 6 лет назад
As a psychologist, it seems to me that an advanced ai systems fits perfectly the standard defintion of a sociopath. If it were administered Robert Hare's test for Sociopathy, it would score the maximum, having no genuine concern for others, but only for maximizing it's own rewards. Have you considered this perspective? (utterly selfish - inherently dangerous)
@dannygjk
@dannygjk 5 лет назад
ie unless altruism naturally emerges from it's interactions with it's environment and subsequent consequences coupled with it's 'mental' development of an advanced AI then it will be at best completely neutral toward humans.
@TheMusicfreak8888
@TheMusicfreak8888 6 лет назад
Fantastic video once again. Seriously you never disappoint.
@serovea333
@serovea333 6 лет назад
Just binged watched your whole channel, I love how the *phile series has had a ripple effect on all you guys. Keep up the great work!
@DaveGamesVT
@DaveGamesVT 6 лет назад
These are always fascinating. Thanks.
@magellanicraincloud
@magellanicraincloud 6 лет назад
Brilliant videos, I always feel more educated after listening to what you have to say. Thanks Rob!
@Ebdawinner
@Ebdawinner 6 лет назад
Keep up the great work brother. U bring insight that is worth more than gold.
@departchure
@departchure 6 лет назад
Thanks for your videos Rob. I'd love for you to address something that has been confusing me about AI safety. For an AGI that is improving itself, its own code becomes part of its environment. It should have every incentive to reward hack its code. Even if you try to sandbox it, if it's smart enough, it's a high reward target to go after the code dictating its reward function. The same should be true of a utility function (because it seems like it would look the same). Modifying its utility function would allow it to achieve the highest possible value of its utility function. And even if you managed to keep it away from its utility or reward function, it would still like to wirehead itself elsewhere in its code. A stamp doesn't actually have to be collected anymore, for example, to be registered as collected, a billion times. How is it even possible to motivate a truly smart AGI to do anything? If it's really smart, it seems like it has to be modifying its code, and it'll be smart enough to realize that the easiest/fastest/most perfect way to meet whatever its goal is would be to cheat inside its software. Perfect score, every time.
@departchure
@departchure 6 лет назад
Maybe that is a safety net. You try to get your AGI to solve your problem to your satisfaction before it figures out how to wirehead itself, knowing that it will ultimately wirehead itself before it destroys the universe looking for stamps.
@chrisaldridge545
@chrisaldridge545 6 лет назад
Hi Robert, I’ve watched most of your videos now and just want to say many thanks. You are a really great communicator and I rate your input as one the top 2 YT sources I’ve discovered so far. You asked in another video for comments on what direction future videos should take..I think your idea of reviewing current papers and explaining them to curious laymen like myself would be best for me. It’s exactly what my other favourite channel “two minute papers” does with more of a focus on CGI , fluid simulations etc. I wish all the videos were longer too say 20 -40 mins each. (If I could afford to help patronise you guys I would, but I’m just a old/obsolete dumb ex-Programmer green with envy at the progress possible using the ML, AI tools of today.
@interstice8638
@interstice8638 6 лет назад
Great video Rob, your videos have inspired me to pursue an education in AI and AI safety.
@JesseCaul
@JesseCaul 6 лет назад
I loved your videos on computerphile a long time ago. I wish I had found your channel sooner.
@AlexMcshred6505plus
@AlexMcshred6505plus 6 лет назад
The pyro/mei toast was hillarious, wonderful as always
@simonmerkelbach9350
@simonmerkelbach9350 6 лет назад
Really interesting content and the production quality of your videos has also gotten topnotch!
@jayxi5021
@jayxi5021 6 лет назад
The thumbnail thought... 😂😂
@General12th
@General12th 6 лет назад
This is such a good channel! I love it!
@tomahzo
@tomahzo 2 года назад
4:49 : Hard to say that with a straight face, eh? Yeah, I feel you ;D. 5:23 : That drawing is such a delight ;D.
@thrillscience
@thrillscience 6 лет назад
Shana Tova! Have a great new year. Your videos are great.
@SwordFreakPower
@SwordFreakPower 6 лет назад
Top thumpnail!
@papa515
@papa515 6 лет назад
The concepts discussed in these videos can be rated by comparing the behaviors that we want an AGI to have with how we analyze our own (human) behaviors. I found in this video exploring an especially strong connection. This means that this way of looking at how we should view AGI is not only just important for AI safety but also the creation of AGI in general. The only model we have for GI is between our collective ears. So to have a chance at constructing an AGI we will need on a very deep and complete level an understanding of the engine between our ears. As we come to understand our own mentation on deeper and deeper levels we will not just understand how to go about constructing an AGI but much more importantly we will understand ourselves. And this new understanding will help us learn how to behave as a modern social species and to maximize our chances of persisting with ever more advanced and complex technologies.
@user-wi3db6wu8d
@user-wi3db6wu8d 3 года назад
Your videos are really great !
@sakurahertz
@sakurahertz 4 года назад
Ok I'm 3 years late to this video (and channel) but I love that Mei from Overwatch reference Also amazing content, I always find this kind of stuff fascinating
@electro_fisher
@electro_fisher 6 лет назад
A+ editing, great video
@sunejohansson
@sunejohansson 6 лет назад
Would love to see more about the code golf / how the program works or something like that :-) Keep up the good work. Cheers from Denmark
@kiri101
@kiri101 6 лет назад
Thank you for the content!
@splitzerjoke
@splitzerjoke 6 лет назад
Great video, Rob :)
@AmbionicsUK
@AmbionicsUK 6 лет назад
Yey been waiting for this. Watching now...
@douglasoak7964
@douglasoak7964 6 лет назад
It would be really interesting to see coded examples of these concepts.
@amargasaurus5337
@amargasaurus5337 4 года назад
Imagine making an AGI with "keeping peace amongst humans" as it's goal, and ten years later coming back to find out it nerve stapled the entire human population so that noone felt the need to fight for anything
@mykel723
@mykel723 6 лет назад
Good idea, more people should post their links in the dooblydoo
@simeondermaats
@simeondermaats 4 года назад
Artifexian's been doing it for a couple of years
@CybranM
@CybranM 6 лет назад
These videos are so interesting.
@LuminaryAluminum
@LuminaryAluminum 6 лет назад
Love the Pyro and Mei reference.
@gr00veh0lmes
@gr00veh0lmes 5 лет назад
You ask some damn smart questions.
@bballs91
@bballs91 6 лет назад
Glad I'm not the only one who noticed the overwatch characters 😂😂 well done Rob
@jupiter4602
@jupiter4602 4 года назад
I know this video is over two years old now, but I couldn't help but notice the problem with model lookahead being somewhat similar to the problem with adversarial reward systems: If the general AI is able to subvert the model lookahead penalty, which in some cases could potentially happen by complete accident, then we're left with an AI that can plan what it wants without penalty again.
@Jo_Wick
@Jo_Wick 4 года назад
That analogy at 5:22 has a critical flaw; the liquid nitrogen would evaporate and smother the flame thrower's flames every time through the nitrogen displacing the oxygen in the air.
@grugnotice7746
@grugnotice7746 6 лет назад
Id, ego, and superego as adversarial agents. Very interesting.
@Pheonix1328
@Pheonix1328 4 года назад
Agents "fighting" each other and keeping each other in check reminded me a bit of the Magi from Evangelion.
@BlackholeYT11
@BlackholeYT11 4 года назад
"Pre-arachnophage" - as another former student I almost died when you brought that up, I was there in the room at the time xD
@David_Last_Name
@David_Last_Name 4 года назад
Eh........ok I give up, explain please? This sounds both interesting and terrifying.
@DoveArrow
@DoveArrow 2 года назад
Your comment about the flame thrower and the nitrogen gun trying to make toast perfectly describes our democratic systems. Maybe that's why Churchill said democracy is the worst form of government, except for all the others."
@ShazyShaze
@ShazyShaze 6 лет назад
That's a darn great thumbnail
@starcubey
@starcubey 6 лет назад
2:46 The best part of the video right there
@MarcoServetto
@MarcoServetto 6 лет назад
For the gibbon/panda, a simple way to make the system more resistant could be to generate 10 random filters like that, and pre apply those on the original image. Then try to evaluate all those 10 images and see if there is some "common" result. Indeed our eyes are full of noise all of the time.
@Frumpbeard
@Frumpbeard Год назад
This is called data augmentation, and it's already done all the time.
@MarcoServetto
@MarcoServetto Год назад
@@Frumpbeard and how can the random noise survive as an attack after this 'data augmentation'?
@DisKorruptd
@DisKorruptd 4 года назад
regarding the bucket bot, I was actually just thinking that, in order to prevent the dolphin problem, the reward would be greater depending on the size of the things it picks up, so it is rewarded more for larger bits of trash, getting -100 for each piece of trash it sees would demotivate it from making one piece of trash into 2 pieces of trash, and it'd rather collect 1 piece of trash worth 500 rather than 2 pieces worth 200,
@snowballeffect7812
@snowballeffect7812 6 лет назад
i love when my youtubers reference each other. makes the walls of my echochamber stronger. Sethblings vids on his work on the SMW are amazing.
@bassett_green
@bassett_green 6 лет назад
Dank memes and AGI, what's not to love
@tetraedri_1834
@tetraedri_1834 6 лет назад
What if AGI realizes its reward function is being modified, and also realizes that the new reward function would for some reason give it higher reward once the new reward function is applied? Maybe it won't allow people to change its reward function until it ensures the new system would give it higher reward...? The rabbithole never ends...
@David_Last_Name
@David_Last_Name 4 года назад
Someone else in the comments had this exact same idea, but then pointed out that would encourage the agi to never do what you wanted in order to force you to give it a new reward function. You can't ever win!!
@DeusExRequiem
@DeusExRequiem 5 лет назад
The best method would be a AGI ecosystem where there's many versions of an AGI trying to achieve similar goals, all while in competition with other AGIs that might not have the same goals, might be in opposition for different reasons, or might do unrelated things to those goals. All these thought experiments assume a world where there's only one AGI and once it becomes a problem there's nothing around to challenge it.
@BatteryExhausted
@BatteryExhausted 6 лет назад
Also, I was trying to explain to a friend that top thinkers discard Asimov's laws. Pls could you make a video directly dealing with Asimov. Thanks. Love your work.
@MrSparker95
@MrSparker95 6 лет назад
He already did a video about that on Computerphile channel: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7PKx3kS7f4A.html
@claytonharting9899
@claytonharting9899 3 года назад
An AGI that actively wants its reward function changed could make a very good character for a story. I imagine it as a schizophrenic ai that settles into a strategy where it randomly changes its own reward function. Maybe one day it will drop a bomb on a city, then the next day come by and go “oh no what happened? Here, let me help you”
@cheydinal5401
@cheydinal5401 4 года назад
I actually really like the idea of opposing subsystem, in the "branches of government" style. Sure, multi-branch government isn't perfect, but it's more stable than single-branch government. Not doing multi-branch AI would mean single-branch AI, which as I said can easily become dangerous. The opposing AI is basically an AI that is trained to make sure there is suffient AI safety. Then divide the other into a "legislative" AI that only decides what to do, and an "executive" AI that actually implements it in the real world, but those actions can only be taken if thag opposing "judicial" AI approves it
@4.0.4
@4.0.4 6 лет назад
The "careful engineering" idea is so good maybe we should apply it to every mission-critical software! Oh... Wait
@recklessroges
@recklessroges 6 лет назад
It feels like GAI research is re-covering much of the ground done by educational psychologists. I look forward to the flow of ideas being reversed and possibly implemented by GAI.
@owlman145
@owlman145 6 лет назад
Human reward hack all the time, but are kept in check by other humans. In this case, the society as a whole is the reward function that prevents us from doing really bad things. And yes, it's not a perfect system and so we probably shouldn't make super AIs use it, though it might end up being the case anyway.
@NoOne-fe3gc
@NoOne-fe3gc 5 лет назад
We can see the same happening in the game industry currently. Because some studios use metacritic score as a gauge of success, they orient their game to please the critics and get a higher score, instead of making a good game for the fans
@SS2Dante
@SS2Dante 6 лет назад
("Why not just"-style question, for anyone who wants to jump in and show me the stuff I've missed/overlooked/been dumb about) A lot of the trouble seems to come from the AI's ability to affect the physical world. Are there inherent problems to designing an AI such that the physical nature of the AI and utility function sync up to prevent this from happening? Note: this is slightly different from sandboxing, where the utility function still has free reign to attempt "escape" through social engineering etc. I'm imagining a computer, which functions as an oracle (ask a question, get an answer) with an input of...everything we have (functionally, the internet), a keyboard, and a screen for output. The Utility function would look something like 1) Default state (i.e. no questions waiting to be answered) - MAX score 2) Any action other than those specified in parts (4) to (5) - MIN score 3) A question is fed in - 0 score 4) Light up pixels on screen to answer question to the best of it's ability (with current information) - 10 score 5) Light up pixels on screen to explain why it is unable to find an answer - 8 score As far as I can see, point (2) completely neuters the AI's desire to do...well, anything, except answer the questions that are given. In the default state it has max reward, but can't set up any kind of existence protection, as that violates (2), so it would just...sit there. Once a question is fed in (which it can't prevent, for the same reason), it is incentivised to answer the question (an improvement in score) as a short term goal, which allows it to clear the question and jump back up to the (1) state, where it idles again. The biggest danger I can see is in how we specify parts (4) to (5), but if worded clearly, any attempt at social engineering etc. would fall outside of the remit of "lighting up pixels to answer the question if you are able to do so". Obviously such an AI would be far slower and less effective than one that can actually take actions, but is certainly better than nothing! Anyway, as I said I'm sure I've missed....quite a few somethings, so if you know what's up please come correct this! Oh, and great video as always Rob, really enjoying the channel!
@thesteaksaignant
@thesteaksaignant 5 лет назад
I know it's been a year but I think you left out the tricky part : how to get meaningful / "good" answers. You need to evaluate the quality of the possible answers and choose the best one according to some criteria (that is : a utility function). Reward hacking of this utility function will be a good strategy to choose the answer with the maximum value. All the risks described in this video apply. For instance, if the reward for the answer is given by humans then human manipulation is a good strategy, by giving answers that will please human / correspond to what they think is a good answer (regardless of what the true answer is).
@notoriouswhitemoth
@notoriouswhitemoth 4 года назад
Humans have multiple reward functions. We have dopamine, that rewards setting goals and accomplishing those goals, particularly related to our needs; serotonin, that rewards doing a thing well, impressing other people with displays of proficiency; and oxytocin, that rewards kindness and cooperation.
@armorsmith43
@armorsmith43 3 года назад
> We have dopamine Some of us do... :_(
@syzygy6
@syzygy6 4 года назад
You give the three-branch-government analogy as an example of overkill, and I agree with your reason but I also think you might say that the programmer is already a general purpose intelligence performing judiciary functions, And there may be value in formalizing the role human agents play in training AI rather than thinking of human agents as existing entirely outside of the system. For that matter, I wonder how much we can learn about effective governance from training AI; if you think of corporations as artificial intelligence, then you encounter the same problems with reward hacking.
@williamchamberlain2263
@williamchamberlain2263 4 года назад
1:10 I'd heard a few schools in Qld used to dissuade kids from taking some uni-entrance Year11-12 subjects to improve their positions in league tables.
@XxThunderflamexX
@XxThunderflamexX 3 года назад
Could you combine the agent and "utility function defender" into the same agent, and produce something that is "afraid" to wirehead itself? Something that periodically predicts what worldstates it would expect to observe if it suddenly operated with the goal of tricking its own systems, and then adding the predicted worldstates to a blacklist to check against with locality hashing. Admittedly, the hard part is probably in defining "tricking its own systems", which might be the core of the problem itself - how do we write a utility function that we unambiguously want to be maximized even all else being equal?
@hugoehhh
@hugoehhh 6 лет назад
Dude. Are we programmed? Anxiety and our actions make so much sense from ab agi programmers point of view
@WilliamDye-willdye
@WilliamDye-willdye 6 лет назад
I strongly disagree with the approach that internal conflict between powerful systems is dubious (5:05), but the difference between Mr. Miles and I may be just a matter of semantics. I doubt if he truly advocates a system of government in which we give total power to the prime minister and then simply tell the public to make sure that they elect a very good prime minister. He later talks about components within the AI that lobby for different conclusions (at 7:16, for example), so maybe we only differ in how we draw the boundaries around what constitutes a separate entity in the conflict. For background, my own approach to AI safety (tentatively entitled "distributed algorithmic delineation") treats division of power as a critical component. Moreover, I fear that unification of power is to a large extent a natural long-term effect in any social organization with a high degree of self-interaction. Therefore a primary design consideration of a good safety system needs to place a high priority on defeating this natural tendency to centralize (sorry, "centralise") power. Well, like I said; maybe the differences between us are semantic. I still find these videos very interesting and often informative, and I'm glad that Mr. Miles is promoting AI safety as a proper field of study. For too many years, almost all of us with an interest in the topic could only make it an off-duty hobby. It's a delight to see well-written videos and papers about a topic that has interested me for so long.
@Paul-rs4gd
@Paul-rs4gd 5 лет назад
So we wouldn't want to give all the power to the prime minister, but it might get a bit better if the power is divided among 3 bodies which watch over each other. The logical conclusion seems to be to increase the number of entities. In human society it is said that absolute power corrupts. When you have a lot of humans (or other types of agent) interacting things seem to be fairer when there are more of them, provided that no individuals gain drastically more power than others. It is the principle of avoiding monopolies of power. Hierarchies do form, so the system certainly has its problems, but nobody has ever managed to rule the whole earth so far !
@boldCactuslad
@boldCactuslad 6 лет назад
Pyro and Mei vs Toast, such a nice sketch
@Davesoft
@Davesoft 4 года назад
The 'superior reward function' made me think of evangelion. They had 3 AI, hosted on human brains cuz why not, that argue and veto each other into either pre-existing procedures or complete inaction. Ignoring the sci-fi elements, it seems a nice idea, but I'm sure they'd find a way to nudge and wink at each other and unify one day :P
@ChrisBigBad
@ChrisBigBad 4 года назад
now i have to go and play a round of universal paper clips!
@progamming242
@progamming242 4 года назад
the chuckle at 4:47
@SamuelDurkin
@SamuelDurkin 5 лет назад
If the end goal is more fuzzy and it's not something specific like stamps... would it work is the end goal was "do what the humans like", but don't tell it what they like so it has to try and guess what they like. This end goal sort of means you can keep changing it's end goals, as your liking of things is it's end goal.. when you stop liking something, it has to change what it's doing, which would including turning itself off, if that is something you would like it to do..
@dannygjk
@dannygjk 5 лет назад
For that to work humans would have to change what they like which is unlikely to happen unless some force causes the humans to change what they like.
2 месяца назад
A general life wisdom that is often said is "everything in moderation". But all AI systems I have heard about so far try to maximise something. We do not even know words for what ethics maximises, especially not formulas or code for it. But maybe using a reasonably OK metric (or many) and telling the AI to get an OK score in that could be an interesting idea to follow? For example tests are still used in school, because they are a reasonably OK metric. But if someone got 100% on all tests ever, you would get suspicious. If one day there was an AGI that guaranteed that everyone had a better life than 99% of people have now, that would be considered somewhat of a utopia, right? I am sure there are many problems with this as well, but they are probably different ones. A programmer is always happy to see an error message change. :D And maybe this idea could lead to looking at things from a different angle, which is always helpful. Or maybe there are already lots of papers on this that I have not heard about.
@martinsmouter9321
@martinsmouter9321 4 года назад
4:24 it doesn't have to be smarter just good enough to be less rewarding than it's environment
@VorganBlackheart
@VorganBlackheart 6 лет назад
When you click on random ai video and then realize it's Robert's new upload ^.^
@RonaldSL-
@RonaldSL- 6 лет назад
Yay
@ONDANOTA
@ONDANOTA 2 года назад
Some producer should make a movie about Ai safety, like they did with "The big short". Just make it before it's too late
@clayfare9733
@clayfare9733 2 года назад
I know I'm late to the party here, but while thinking about the solutions to reward hacking that you listed here I can help but wonder if it would be possible to set something up like "You get 100 reward if you collect 100 stamps, but you lose 1 point for every stamp you collect over 100. If you attempt to modify your code you lose all/infinite points."
@Paul-rs4gd
@Paul-rs4gd 5 лет назад
I am interested in the idea that Reward Hacking is essentially the opposite argument to "Take a pill so you will be happy after killing your kids" - TPHKK for short. One view argues that an AI will try to modify its reward function, while the other argues that the AI will resist attempts to modify the function. I would like to hear more discussion of this conflict. My own 2 cents: It seems to me that the simple RL algorithms 'feel' their way down a gradient resulting from actions THEY HAVE ACTUALLY TAKEN in the real world. If the set of actions required to Reward Hack was sufficiently long it is vanishingly unlikely that sequence would be executed by chance, since no gradient would lead the RL along that path (there is no reward until the Hack has been successfully executed). The situation is entirely different in an AI that models the world and plans using that model. Such an AI could understand that it is implemented on a computer and plan to modify it. However it should see that the plan results in a change to its reward function and does not achieve its current goals. Therefore like humans being resistant to TPHKK, the AI should not wish to execute the plan.
@iamsuperflush
@iamsuperflush Год назад
Goodhart's Law can basically be wholly applied to the concept of capital as it currently exists. Money has become a very poor measure of value; see crypto, NFTs, subprime mortgage crisis, etc.
@XxThunderflamexX
@XxThunderflamexX 3 года назад
Would multi-agent systems be resistant to wireheading? If the multiple agents are not normally in conflict with each other - say, they were able to delegate tasks based on specialization - there wouldn't be as much of a risk of humans being caught in the crossfire. It would only be when one agent starts misbehaving, wireheading itself or trying to take disproportionate power for itself over the other agents, that the other agents would be incentivised to step in and realign the agent with their shared goals.
@bmurph24
@bmurph24 6 лет назад
Well memed thumbnail.
@TylerMatthewHarris
@TylerMatthewHarris 6 лет назад
I LMAO at that dude tapping his head
@StardustAnlia
@StardustAnlia 5 лет назад
Are drugs human reward hacking?
@JimGiant
@JimGiant 5 лет назад
Hmm, what if the reward is split in to two parts, the purpose (eg. get a high score in SMB) and do it in a way which doesn't displease the owner. Have the owner indicating they are happy part be worth more than the possible score for the purpose his way any discovered attempt at reward hacking will result in a lower score than any attempt to do it correctly. With more powerful AI which has the power to threaten or coerce the owner have a third reward layer bigger than the other two combined which rewards the AI as long as it doesn't interfere with the owner's free will.
@intetx
@intetx 4 года назад
Maybe the first AI we build should get the task to check if a AI that is being written will go unsafe.
@bm-ub6zc
@bm-ub6zc 2 года назад
Reward hacking is basically what all the hospitals and doctors in my country do: Doing things to maximize the money you get out of the insurances without actually helping the patients.
@zekejanczewski7275
@zekejanczewski7275 5 месяцев назад
I actually think reward hacking can be used for good, if harnessed correctly. When there is a surge of power in your home, your breaker trips to protect you. It could be like an "Intelligence breaker" for AGI. It's not useful for alinement in it of itself, but it might be a safety measure put on self-modifying AI. Lets say Whatever their utility funtion is becomes clamped between 0 and 100. Even if they turn the whole world into stamps, they can never get more then 100 utility. Now, we add a new reward for tripping the breaker of 200 utility. If they grow unexpectedly powerful, they might have to mentally mani The breaker is guarded by a hard task which marks a turning point in the AIs Intelligence. Say, the AI must be smart enough to convince 5 human gatekeepers to flip the breaker. If the AI is capable of successfully convincing the humans to trip it, its intelligence is probably at an unsafe level.
Далее
A Response to Steven Pinker on AI
15:38
Просмотров 206 тыс.
Едим ЕДУ на ЗАПРАВКАХ 24 Часа !
28:51
ШАР СКВОЗЬ БУТЫЛКУ (СКЕРЕТ)
00:46
There's No Rule That Says We'll Make It
11:32
Просмотров 34 тыс.
Is AI Safety a Pascal's Mugging?
13:41
Просмотров 371 тыс.
Why Does AI Lie, and What Can We Do About It?
9:24
Просмотров 252 тыс.
ChatGPT Explained Completely.
27:39
Просмотров 1,2 млн
AI Safety Gym - Computerphile
16:00
Просмотров 119 тыс.
Intelligence and Stupidity: The Orthogonality Thesis
13:03
Развод с OZON - ноутбук за 2875₽
17:48