AXRP, pronounced axe-urp, is the AI X-risk Research Podcast. On this podcast, I (Daniel Filan), have conversations with researchers about their papers. We discuss the paper and hopefully get a sense of why it’s been written and how it might reduce the risk of artificial intelligence causing an existential catastrophe: that is, permanently and drastically curtailing humanity’s future potential. Patreon: www.patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast
Interesting topic, could you avoid so much jumping between subjects? Sometimes I need to listen for minutes before I know where you're going with your train of thought.
Awesome interview, but I think your audio mixing is off on the Spotify version. I was trying to listen to this in my car and had to crank my volume to the max to barely hear your voice
Except in the unlikely event you actually have the 'optimal policy' properly defined correctly on the first try for the AGI, aren't you . . . *done* ? You can just point to that. You probably won't actually have it properly defined though, because how do you do that (without some risk of getting it wrong)? I feel like defining what we really want is one of the hardest issues in AI.
Any other interesting work to recommend on the idea that our senses/control mechanisms are both generative processes and predictive processes? I also had some more to add on the topic why there isn't convergent behavior that might seem obvious. Like Jan mentioned, there might be local environment state differences, prediction optimization on a longer timescale has potential errors, etc. But there are also potential 'hardware' and 'software' differences as well -- humans don't run on completely homogenized brains, and it's possible to imagine that the initialization of various weights in our brains are randomly distributed that yields different outcomes.
This was great to hear. Aside from the volume, which was too low. I had to max out my phone volume and it still wasn't quite high enough. I guess it's fine for me on closed back headphones, but anything would be.
This one was pretty technical for those of us that haven't read some of the foundational work for SLT. I had to stop and look up some specific details later, and still don't feel like I fully grasp what makes SLT different than other predictions about degeneracy and simple functions preference in terms of making predictions about nn behavior. David's framing of fundamental structures in the data being more important across any training runs makes a lot of sense, I still don't grok how this helps with alignment. I suppose understanding stability of structure moves us closer, but both on something similar to interoperability but also on capabilities.
When you think about it, you sum all the "like" and they take 105 minutes in this video, the "yeah" take 40 minutes, and the "um" take about 15 minutes… Take them off and you'd have a much more digestible video.
I've been loving this new stream of content on spotify during long drives! Daniel you are pretty well up to date on papers generally, I am always impressed.
If a factory near my house produces toxic chemicals that have a 1%/year chance of killing me but doesn’t affect my health otherwise, is it possible for me to sue the factory and have a court order them to stop? (In Europe, in this situation you can probably sue the government if they don’t protect you against this, and have the ECHR or a national court order the government to stop the factory from doing it and possibly award a compensation.)
"Even if the stars should die in heaven Our sins can never be undone No single death will be forgiven When fades at last the last lit sun Then in the cold and silent black As light and matter end We'll have ourselves a last look back And toast an absent friend" Sorry. Feeling angsty about the world today. I had a friend in highschool who I'd sometimes complain about my problems to. She'd always say the same line in reply, and I couldn't argue. "Well, do better." That meme stuck around in my head. "Do better." It's a weird place to be. "Oh, these lab leaders think there's a 20% chance of doom." "So they haven't ruled out doom?" "Well, no, they just think it's unlikely." "I wouldn't call 10-20% 'unlikely' when we're talking about 'literally everyone dies and/or nearly all value in the future is irrevocably lost,' but okay, why do they think its possible but less likely than throwing heads in a coin flip." "Well, they don't really explain why, but it's something like 'human extinction seems weird and extreme, and while they can imagine it, they feel much more compelled by other grand and wonderful things they can imagine' - at least, that's the vibe I get." "Annnnd we don't think there's some kind of motivated cognition going on here? I think people buying lottery tickets are also imagining very vividly the possibility of them winning, but that doesn't make them right to say whatever % they feel intuitively." "They'd say AI is more like the invention of agriculture than a lottery. Like, maybe you make some huge foreseeable mistake and cause a blight, but if you have some random list of virtues like 'common sense' or 'prudence' or 'caution' then you'll probably just make a bunch of value." "I think Powerball is a good metaphor. Let's take features of the universe we'd all want to see in the future and tag them with a number. We then play a million number Powerball and hope each one of those numbers we chose show up. What are the odds that will happen? 80%?" "This sounds like a wonderful argument on how to reason about a specific kind of uncertainty, but people don't want to reason about uncertainty, they want to reason about how their most convenient and readily actionable policy is actually totally fine and probably not going to be an unrecoverable catastrophe." "Well, they should do better." "I appreciate the sentiment, though I would like to note that in this case this has to be nearly the largest understatement in the 13.8 billion years of the universe." "Here's another: I'm pretty bummed out about this."
"Is there any strategy these models can deploy that would allow them to cause problems." Has anyone in the Black Dolphin prison ever managed to kill a guard or another prisoner? Not sure, but I'd guess probably 'yes.' And those would just be humans, not even particularly selected for their intelligence, just selected for their 'killing intent' and prior extreme bad behavior. An AI that's as smart as our civilization working for an entire year per hour of IRL computer runtime will find any medium bandwidth channel of causal influence more than sufficient to destroy the world. Even if you give it a 1 bit output channel and iterate on asking it "yes/no" questions, that probably adds up to a lethally dangerous level of causal engagement with our universe eventually. Even if you reset it after every question, 0 online learning, it can probably guess its position in the sequence if the input contains enough deliberate or accidental shadows of the intervention the prior instances of the system have done. "Safe no matter what" sounds great, but it's like saying some product is "indestructible" - well, you're failing to imagine the maximum forces that can be brought to bare on the object. Specifically, a sheer and strict 'whitelist' policy is only as safe as your ability to predict the consequences of every action you whitelist, and if you could predict all of that, then the AI is no better than a Tool Assisted Speedrunner program or a manufacturing robot. It can precisely and quickly do only as much good as humans could do slowly and less precisely. As soon as you're getting "superhuman" you need something that does superhuman-level human-value alignment. Your merely human-level control/safety techniques will be insufficient to cover that wider space. You've got a relay in a circuit that's meant to carefully switch off a power supply when there's a surge, and it looks super safe and reliable, since you can prove it successfully activates and breaks the connection even up to the level of the capacitor banks that fire inertial confinement lasers. And yet, in practice, the surge comes through, the relay flips, and there's a white hot arc through open air as the electric field shreds air molecules into plasma, and the energy grounds out in the delicate machinery - now molten slag - you had downstream of that relay. That jump through open air is the problem. That's what outside the box is pointing to. Good luck constraining safe operation beyond the box, when you can't see that region in advance, and if you aren't trying to go outside of the box, then why the hell are we even trying to build this stuff.
22:34 "roughly human level"... Ok, but even if this works, what are the odds that AI Labs will only use "roughly human level" AI Agents? E.g. Rather then "the best AI agents available". It seems likely that "roughly human level AI" will only be "roughly human level" for weeks or months, when they are then replaced with 2x or 10x versions. Even if you were able to contain a "roughly human level AI agent", this could be a very temporary solution? Would "roughly human level AI agents" be able to safely do any useful testing and alignment work on an ASI level model? Doing alignment work (goal and intention testing, capability testing, red teaming, even containment) on an ASI would likely require greater then "roughly human level AI agents"?
Also, it seems relevant that bilateral anterior cingulate cortex destruction produces a psychopath, at any age. That seems a very important point. We don't really learn moral behavior, not from a blank slate exposure to our world. We learn morality the way we learn to walk. Its perfecting our strategy for doing something we basically already know how to do instinctively.
There's a lot more than heartbeat and such we are born with. We expect 3 dimensions of space and one of time and that we are agents with preferences for future states of the world. We expect other things that move or have what appear to be 'eyes', to be other agents. We try to figure out what those agents 'want' soon after birth. We can exhibit jealousy by 3 months of age. We can recognize some facial expressions instinctively. People can have phantom limb syndrome who never had the limb, so there is a mental map of a normal human body. Probably many many more things.
The number of times "like" is said is painful, considering these relatively intelligent people. It was very "like" distracting and "like" difficult to "like" take them "like" seriously.
I tend to find Leike's speaking style really difficult to listen to. Unfortunately, everything he says is pretty valuable, so you just kind of have to pull through.
What if we have an AI that does this for us? And an ai that interprets the interpreter and so on. Maybe an ai wave process in order to give us a constant state of interpretation of what is going on.
There are approaches that use this tactic for outer alignment. I highly recommend checking out the classics: Christiano IDA and debate, etc. It's definitely a common motif in this area of research. But then again, I've seen people raise concerns that automating interpretability tools may enable deceptively aligned policies/agents to further entrench themselves. Check out "AGI-Automated Interpretability is Suicide" by RicG
Try Google Podcasts - you can listen in your browser, and it lets you adjust the speed! podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5saWJzeW4uY29tLzQzODA4MS9yc3M And sorry - back when I uploaded this, only a small fraction of my AXRP listens were on RU-vid. Now that more are, it might make sense to cross-post to RU-vid (altho I haven't been as active on The Filan Cabinet recently).
I think part of the problem is that this is actually an explanation of why the Neural Network's method works, not an explanation of what its method actually is. Since, the network hasn't learned the concept of the trig functions, it's just learned how to embed the inputs (0-113) on a lossy version of the trig curves etc... A mechanical description, I think, would also be clearer to a less math-y audience. It feels to me like the (quite excellent) authors saw the math that they were familiar with and honed in on why it worked, rather than giving just a straight account of what the network is doing in a step by step fashion.
Not at the moment, but I could do so if there were sufficient interest. There are only 20 responses, so I'm not sure how interesting it will be for others.
The images are quite helpful, especially for a complete beginner to the field when it comes to terms like stochastic descent. This channel is very underrated.
Yep - AXRP is primarily a podcast, available whereever good podcasts can be accessed (e.g. spotify, apple podcasts, google podcasts). You can also read a lovingly-compiled transcript here: axrp.net/episode/2021/12/02/episode-12-ai-xrisk-paul-christiano.html
Just my two cents..i have a background in civil engineering and medicine, not computer science , but i am extremely good at connecting dots. The existential risk is not 20% in my opinion, i would guess the chance of shit NOT hitting the fan is less than 5 % , based on everything I've heard about the Black box nature of current systems and how far behind alignment research is in terms of funding as well as progress., this guy has like 3 yrs past school and he considers himself an expert? Sorry to be a troll but i am an extremely logical person and the nature of code and human psychology and the nascent exponential increase in power of AI and its open sourcing recently make me very pessimistic, i really hope im wrong... Even if it doesn't kill us directly there are a lot of malicious actors that will use open source AI to create malicious code l, bots and viruses to totally disrupt our current digital society, not to mention the energy cost of using this technology. Where am I wrong? This is the big problem my mind, even if AI itself is totally beneficent if it's used for malevolent purposes by some and it's extremely powerful are.we not hosed? What if it allows bad actors to hack into the nuclear weapons sites? Or someone can easily create code which will shut down energy grids which are all digital these days? Not to mention creating dangerous nanotech, the list goes on. Am I naive or are you?
Yeah you are being naive. First of all, please listen to my man correctly here. You are talking about probabilities but the 20% he is talking about is not the chance or probability. It is expected value of the human potential. Also he was mainly saying that for alignment misses. Your worry about malicious actors is not an example of misalignment. And even in that case you are being naive in thinking that these guys are not thinking about that. Also if you look, this podcast is 1 year old and chatgpt/other generative models wasn't even on anyone's radar, not even openai's(they did not expect this much performance). It's easy come with hindsight here and say that look at all the exponential progress, AI is going to doom us etc etc. You think that even if there was no there would not be classical handcrafted algorithms that would be able to create software, malware at a faster rate? Do you know how much better classical algorithms have become over the years? The malicious actor thing always has been on everyone's list.
So the bird's eye view hypothesis ought to make predictions about when and where the instances of the agent's source code ought to be occuring and what sort of inputs the agent ought to be subjectively perceiving, and if the inputs it is actually subjected to are in agreement with the hypothesized inputs, then there is a good match between the hypothesis and the objective?
Huh - this is being crossposted from libsyn, where the description has dot-points for the timestamps (and also links for the research being discussed). Lemme try to fix.