What If Someone Steals GPT-4?

Asianometry

Подписаться 724 тыс.

Просмотров 87 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

4 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 395

@emuevalrandomised9129 10 месяцев назад

Honestly, it would be a very curious idea to see how the model would behave in the absence of all the limiting systems.

@100c0c 10 месяцев назад

From what I've read, not as good as you'd assume. Just more erratic and wrong...

@quickknowledge4873 10 месяцев назад

@@100c0c mind sharing what you read specifically? Very interested in coming up with my own conclusion on this.

@amandahugankiss4110 10 месяцев назад

endless child porn that seems to be the goal of all of this

@nobafan7515 10 месяцев назад

@@100c0cwhat's weird is I've been hearing the main one is already making more errors from users inputting incorrect data.

@obsidianjane4413 10 месяцев назад

It will just do any dumb sht the meat puppets tell it to.

@mikebarushok5361 10 месяцев назад

A very good friend of mine did some recent work upgrading storage for the research division of a very large pharmaceutical corporation. Their security protocols were good, but also inflexible, creating motivation to work around restrictions that slowed the upgrade down to a near standstill. The financial incentives, combined with a sense of hubris resulted in several major potential risks of security being temporarily bypassed in ways that weren't fully auditable. If an insider was waiting for the moment when exfiltration of very expensive and proprietary data and software was possible, then they got their chance. Security is always in tension with getting work done and there's no such thing as perfect security.

@fxsrider 10 месяцев назад

Even on my level, typing my password every time I wake up my computer gets on my nerves. Encrypted files are fun as well. I have removed security numerous times only to swing the other way worrying about malware etc. This is on my personal PC. I worked for decades at an aerospace company that had sign in and log on requirements that were super annoying to repeat many times a day. Then I had to change my password all the time it seemed. Everyone had to do it every 3 months or so. To the point I had rolled the entire alphabet as the last character and was well into the upper case when i retired.

@mikebarushok5361 10 месяцев назад

@@fxsrider I know that same frustration with frequently having to change passwords at aerospace companies, having worked for a couple of them myself. It was an open secret at one of them that everyone left post-it notes with their most recent password under the keyboard.

@craigslist6988 10 месяцев назад

as an engineer I've never once seen a company that wasn't compromised by China. China has a lot of people trying and small US companies are such easy fodder. People act like best security practices simply existing somewhere makes the tech world safe... but if you graphed population vs competency of IT, it would look like wealth in the US - almost all of the high competency is in a very small number of people. The other 99% are abysmal. It's hard to be smart enough about security now, there are so many attack vectors and corporations see it as an expensive cost with a low risk high punishment, so they justify not paying for it. And tbh the amount of money to compete for those few people who are actually very competent might not be worth it to the company.

@michaelpoblete1415 10 месяцев назад

Llama 2 is now almost at the level of GPT-3.5, even without breaches, Llama 3 might be at the level of GPT4, in that case, isnce Llama series is open source, then the question of what would happen if GPT4 is stolen might become moot and academic since anyone can just download open source Llama which at some point in the near future might reach the level of GPT4.

@ebx100 10 месяцев назад

Well, Llama is only sort of open source. If you commercialize it, you pay.

@michaelpoblete1415 10 месяцев назад

@@ebx100 this video's topic is about the ramifications of GPT4 getting stolen. With a stolen model, you dont even have the option to pay for it, you go straight to jail.

@96nico1 10 месяцев назад

Yeha I had the same thought

@joaosturza 9 месяцев назад

@@ebx100 it doesn't prevent people from commercializing it covertly, to prove it would require you to prove a certain work was done by a specifc AI, something that we curretnly cannot

@nixietubes 10 месяцев назад

Commoncrawl doesnt provide data only for machine learning, it's for research of all sorts. And the 45TB number is inaccurate, the dataset is measured in PB

@Nik.leonard 10 месяцев назад

This already happened in the Image generation space when the NovelAI model got leaked from a badly secured Github account, downloaded and used as a (somewhat) foundational model for a lot of anime image generation models.

@asdkant 10 месяцев назад

Small correction: SSH is used for remotely operating (unix and linux) machines, for API and web traffic it's more common to use TLS (also called SSL coloquially , technically ssl is older)

@TheOwlGuy777 10 месяцев назад

I work next door to a movie studio. Our own IT department monitors all traffic in the area and there are multiple mobile piracy attempts a week.

@nexusyang4832 10 месяцев назад

Just a matter of time we see a "Folding at home" equivalent project that can train a single distributively and decentralized. Then it isn't about theft, but what can be done with such a tool....

@sangomasmith 10 месяцев назад

It is darkly hilarious to watch AI companies spend enormous effort and resources to to fend off the theft of their models, when the models themselves were build off of stolen and public-domain data.

@makisekurisu4674 10 месяцев назад

Hence stealing stolen goods is perfectly fair.

@relix3267 8 месяцев назад

not exactly

@vidal9747 8 месяцев назад

There is public in public-domain... You can argue it is wrong to train in non-public domain data.

@insom_anim 10 месяцев назад

I think the AI companies are probably more afraid of an open source competitor that makes all of these protections irrelevant. There's no need to steal something built on publicly accessible information with enough time and effort.

@magfal 10 месяцев назад

0:44 I don't know how successful OpenAI would be in enforcing the proprietary nature of their model if it leaked. It's built upon mountains of stolen and misappropriated data after all.

@dingodog5677 10 месяцев назад

If AI is based on what’s on the internet. It’s gonna be the dumbest thing around. Garbage in garbage out. It’ll probably become sentient and commit suicide from depression.

@moth.monster 10 месяцев назад

What people think large language models are: Skynet, HAL-9000 What large language models really are: Your keyboard's predictive text if it read the entirety of Reddit

@SalivatingSteve 10 месяцев назад

This x1000. The fear mongering over AI is way overblown. The models are useless without new human-created data to feed into the system. My CS professor pointed out that if people stop posting on Stackoverflow or Quora because now they’re using ChatGPT instead, then it will just regurgitate old info and get outdated very quickly. It turns into this weird bootstrap paradox feedback loop where “knowledge” effectively stagnates.

@guilhermealveslopes 8 месяцев назад

The entirety of Reddit plus some lots of other sources

@RandomPerson-bv3ww 10 месяцев назад

as usual with these questions its not if but when

@florianhofmann7553 10 месяцев назад

So ChatGPT pulls all these answers out of only one TB of data? Sounds like the most efficient data compression we've ever created.

@tardonator 10 месяцев назад

its lossy

@Greyboy666 10 месяцев назад

1TB of /parameters/, working on 45TB of text. thats an absolutely staggering amount of information for what it can manage

@dtibor5903 10 месяцев назад

LLMs are not storing the training data like a database, it is remembering it more like humans. It is lossy, it has gaps, it has mistakes.

@Geolaminar 10 месяцев назад

That's because AI don't store their answers. I don't know how many times it has to be explained that AIs are not lookup tables. They're not compression, lossy or otherwise. That's made up by the NoAI crowd to try to pretend a generative AI can't produce original work. it was literally never true. Compression doesn't let you retrieve something that wasn't in the original dataset.

@gorak9000 10 месяцев назад

They must be using Hooli Nucleus

@LimabeanStudios 10 месяцев назад

The effectiveness of generating training data off of existing public models has been really impressive. The open source community has been embracing it for obvious reasons to some real results. As of right now fine tuning off of generated data is where it's most used

@dr.eldontyrell-rosen926 10 месяцев назад

"Malicious capabilities?" please define.

@retard1582 5 месяцев назад

generation of spam that is so complex that it will fool 90% of laymen. Help with the creation of fake bank login landings, and fake shopping sites. There's all kind of stuff that's possible. Voice spoofing, fake news generation, propaganda creation.

@damien2198 10 месяцев назад

Gonna be nice when will be able to run theses huge model distributed/trained/infered on "Folding@home" systems, uncensored

@aniksamiurrahman6365 10 месяцев назад

For LLM to be truely embedded all around people's lives, it needs to be open sourced. There are many importatnt things can be done with GPT-4, like using it to automate corporate paperwork, to use it to aid peer review of scientific research, summerizing and investigating documents etc. What Microsoft is doing will never do these. The closed source nature also ensures that there can't be anything better than what they got, essentially inhibiting any proper growth and application.

@AmericanDiscord 10 месяцев назад

The data is available and there are open source models with close to equivalent performance. The problem is the cost curve for more advanced queries. The leaders in AI will likely be determined by access to efficient hardware, not anything else. Worrying about protecting weights, while it shouldn't be ignored, is the wrong direction.

@SalivatingSteve 10 месяцев назад

This is why the USA has put restrictions on certain GPU & chip exports to China.

@AmericanDiscord 10 месяцев назад

@@SalivatingSteve I don't think improvements to current hardware architectures are going to get AI past the coming hardware wall. You are going to be looking at something different.

@bbirda1287 10 месяцев назад

You have to remember he mentions state actors many times during the presentation, so a lot of the hardware / software / resource limitations for anonymous hackers don't really apply. State actors can easily have servers to store Petabytes of information and have multiple hi speed connections for download.

@aspuzling 10 месяцев назад

I think the reason data has to be exfiltrated slowly is that it probably sits behind hardware that limits the speed of any outgoing network connection.

@SalivatingSteve 10 месяцев назад

@@aspuzlingit has to be done stealthily with lots of connections masked to look like normal traffic, because trying to download a massive amount of data to a single user would raise red flags.

@cbuchner1 10 месяцев назад

A verbatim copy of those 1TB weights would not be valuable for very long as I am sure OpenAI are continually updating and refining it and I am sure they already have the next big thing in the pipeline. It would just be a momentary snapshot with a fixed knowledge cutoff

@joaosturza 9 месяцев назад

the training data, however, is so precious it would warrant a massive ransom, as its public release would see every IP holder suing the company, especially since in several jurisdictions you are required to protect you copyright against violations and not suing OpenAI might eventually be interpreted by a judge as not caring if your work appears in any AI

@isbestlizard 10 месяцев назад

What if someone steals the collective writing of humanity, every book, news article, reddit post ever written, and uses it to train a model they then consider propietary trade secret? Can you really 'steal' something that was already stolen and hoarded?

@dr.eldontyrell-rosen926 10 месяцев назад

They hope to build these institutions amass huge investment and valuations and then cash out when regulations really hit.

@TwistersSK8 10 месяцев назад

When you read a book and acquire new knowledge, are you stealing the knowledge from the author of the book?

@stevengill1736 10 месяцев назад

Apparently the use of synthetic data is supposed to avoid DRCM or copyright issues as well as speed up processing, but I had to look up synthetic data: en.wikipedia.org/wiki/Synthetic_data

@howwitty 10 месяцев назад

@@TwistersSK8Uhhh... not the same as a machine "reading" the book. Isn't that obvious? Pirates made a similar argument that copying digital files isn't theft because the owner still has the original copy. Maybe you should try stealing this book?

@EpitomeLocke 10 месяцев назад

@@TwistersSK8 lmao are you seriously equating a human and an ai model

@lashlarue7924 10 месяцев назад

8:45 Look, it isn't that we here in the US don't appreciate the contributions of Chinese nationals (and others too) to our infrastructure projects. We do. The issue is that if you have family, real estate, or other ties to China, or if you LIE about those ties, then you are susceptible to being manipulated, blackmailed, or otherwise vulnerable to coercion by regimes that can snap their fingers and send your parents or children into a gulag. That's why you guys get your clearances held up. It's not that we don't like you guys, it's that we have to face the cold hard facts about what happens when someone gets their arm twisted by the Ministry of State Security.

@joaosturza 9 месяцев назад

the companies would imediatly be massively sued if the training data is leaked, as it gives every party with works in it the possibility of suing the company, it is an unwinnable battle as hundreds potentialy tens of thousands of IP holders will sue chat GPT and openAI

@aleattorium 10 месяцев назад

9:30 - worth also researching Okta and Microsoft Azure hackings of their ticketing and supporting systems.

@AlexDubois 10 месяцев назад

Data at rest is only encrypted for the layers below the encryption process. If done by the OS, the client of the OS sees the data in clear. So which layer does the encryption is important. For encryption of data in use. Intel SGX is a very common way to secure cloud playloads, however an application vulnerability on the code running in SGX negate the security properties of SGX. This is why languages such as Rust should be used and the number of lines running inside the enclave needs to be limited as much as possible to limit the attack surface. A Man in the process for such enclave is very hard to detect.

@obsidianjane4413 10 месяцев назад

Meh. The LLM datasets are less important than the algorithms that build them. GPT is just a chatbot. A big, good, training set is valuable for its functionally and the cost it took to build. Lots of datasets are being built these days. They are going to be like cyptos. The first one was valuable, but then everyone made one and the value of all dropped. Chatbots are good at "talking", as in it can predict what a human would say based upon the keywords in the prompt input. But the model does not "know" or "think" anything. Most of them are dumb. There best utility is in making serendipitous connections of concepts and ideas from masses of data.

@isbestlizard 10 месяцев назад

What do you think a human mind is, but lots of chatbots talking with each other, supervising each others output, correcting, analysing, reviewing, rating, amending in a way that creates the epiphenomenom of intelligence?

@obsidianjane4413 10 месяцев назад

@@isbestlizard That is not what the human mind is any more than it is a computer, or any other poor metaphor used before.

@Charles-Darwin 10 месяцев назад

I would think Quora is a massive source of conversational Q&A made available and contributes to the dataset - unfettered. Adam D’Angelo is a senior board member basically at both companies. Also, what OpenAI did with going live on such a simple interface was 100% stroke of genius. I firmly beleive this format allowed for not only training, but providing a very solid baseline of what humanity cares about OF the data set - else there is just way too much data to model on. This 2x bootstrapped a 'scope' to start from and trained errors out based on the acceptance of the result to a query. This is prob some secret sauce as to why they're able to iterate so fast. Its the end user.

@SalivatingSteve 10 месяцев назад

Exactly the project narrows the scope on its own as it trains out errors.

@aapje 10 месяцев назад

Quora is extremely low quality data, though, for the most part.

@raylopez99 10 месяцев назад

The biggest risk to GPT "theft" is simply an employee walking out the door with the knowledge of GPT. In California you cannot stop an employee from using what they remember. You can stop them from taking files with them however. It's a delicate balance but in general, "information wants to be free" and it's hard to keep stuff proprietary. At the core, GPT is matrix multiplication which cannot be copyright per se.

@raylopez99 10 месяцев назад

Also non-compete agreements have to be reasonable and in California are generally not enforced much by law except in specific circumstances.

@dtibor5903 10 месяцев назад

Absolutely true, but to recreate the same training data costs a lot.

@vvvv4651 10 месяцев назад

nobody can remember 1tb of data out the door buddy 😂. true tho.

@dtibor5903 10 месяцев назад

@@vvvv4651 it's more important how the training data was organized, structured, formatted and the training methods. If these informations would be really that secret, other LLMs would be far far behind.

@theobserver9131 10 месяцев назад

@@vvvv4651 there are a few special people that remember absolutely everything they see. They're usually fairly challenged cognitively, but they can remember a whole phonebook just by reading it once. Have you ever heard of rain man?

@EyesOfByes 10 месяцев назад

13:13 Glad I'm not the only one thinking that was Sam

@okman9684 10 месяцев назад

Imagine downloading the full version of gpt4 from your internet

@florin604 10 месяцев назад

😅

@romanowskis1at 10 месяцев назад

Easy with fiber to home, i think it should take few hours to full save on ssd.

@michaelpoblete1415 10 месяцев назад

the problem is running on what hardware.

@monad_tcp 10 месяцев назад

I would say that it happened would be overall a good thing. It's too much of a powerful thing to be in the hands of a few persons. I don't believe anyone has magical ethic to be able to decide or "protect" humanity from any bad outcome. Actually the other way around, in trying to do good, but without the input of the rest of humanity, they for sure are going to end up doing evil.

@johnmoore8599 10 месяцев назад

Tavis Ormandy found Zenbleed where the CPU was exposing data from the system. I think hardware vulnerability security testing is in its infancy and he's one pioneer using software.

@SurmaSampo 10 месяцев назад

Travis is rockstar in the field!

@honor9lite1337 10 месяцев назад

@@SurmaSampois he still at Google?

@Quast 10 месяцев назад

8:25 Finally we know what John Doe looks like!

@jjj8317 10 месяцев назад

The goal is to build things in America, Canada, Europe etc by said people. The thing is, Chinese Canadians are also Canadian, and Chinese Americans are also Americans. It is not possible to ignore the issues that arises from people who have links or are literally part of the Chiense state in the aforementioned countries. Also, there is nothing wrong with being proud of your roots, and being proud of having a direct association with the Peoples Republic of China. You dont really want Chinese nationalist to actively manage a data center when there are other people who sre perfectly capable. I think people who cant differentiate the PRC snd chinese people are an issue, just like it is true that companies dealings with critical tech should be aware of people who have links to other states.

@stefanstankovic4781 10 месяцев назад

I'd rather not have any nationalist actively manage a data center, thank you very much. ...assuming we're using the term "nationalist" in a fanatical/irrational sense here.

@bruceli9094 10 месяцев назад

I think the future is India though. They are currently the world's biggest population.

@SalivatingSteve 10 месяцев назад

I think tech companies who pull the H1-B visa scam to save a few bucks on payroll are especially at risk of IP theft from foreign actors.

@jjj8317 10 месяцев назад

@@bruceli9094 A bit of the same issue. Huge nationalism issue that puts india or sikh values over Canada or America. In Canada there are riots where these two beat fuck out of each other. There has been assasinations and terrorists attacks. You have to prioritize the needs of the country above everything. I can tell you as an immigrant that some of the people who move to north america are a testament of bad screening practices. In Canada there has been cases of Chinese nationals who were somhow allowed to worked in defense programs, and took blue prints from frigates and signaling codes and handed them to the chinese state. In the case of the UK, they had a dude who worked in their nuclear program steal blue prints and recreated the bomb in Pakistan. So the whole it doesnt matter if a person is loyal to the country is ridiculous

@jjj8317 10 месяцев назад

@@stefanstankovic4781 You want to assure that you have your tech companies and data centers filter out of people who have direct ties with foreign states. Canada has suffered a lot of security breakdowns due to a lack of oversight and security clearence. It is very simple: you dont have to like american or western doctrine, but as long as you are western, you will be targeted. so you dont want people whose entire goal is to disrupt the enviroment you work and live in to control your data.

@theobserver9131 10 месяцев назад

Not being an IT guy, I'm a little bit confused. I thought that open AI meant open source code, which I thought means that anyone can copy and use it and even modify it?

@ianmathwiz7 29 дней назад

OpenAI hasn't been very open since GPT-3.

@behindyou702 10 месяцев назад

Love the way you present your research, can’t believe I wasn’t subscribed!

@marcfruchtman9473 10 месяцев назад

A little bit misleading... obviously "Nothing" happens. It's like asking, what if Actor A steals the open source content for public plays... There are so many open source near equivalents to GPT-4 now. And the data is simply out there to be scraped -- without having to do any hacking at all.

@av_oid 10 месяцев назад

Steals? Isn’t it OPEN AI? Or should it be called ClosedAI?

@binkwillans5138 10 месяцев назад

Open the pod bay doors, HAL.

@nneeerrrd 10 месяцев назад

Yeah, someone * cough cough * China Iran Russia

@yeshwantpande2238 10 месяцев назад

You mean to say it's not yet stolen by traditional thiefs ? And will GTP4 help in stealing itself?

@glennac 10 месяцев назад

“Isn’t it ironic?” - Morissette

@svankensen 10 месяцев назад

Great video as always, but... you didn't answer the main question in the title. You went about how it would happen, not what consequences could be.

@johnbrooks7350 10 месяцев назад

It’s crazy to me that these models are so huge. I do wish many of these would be released entirely to the public. Even with the risks, I think open source and open development lead to the best long term production for everyone

@Fs3i 10 месяцев назад

Llama-2 is the biggest open source model. It’s very mid.

@H0mework 10 месяцев назад

@@Fs3i Goliath-120B is based on llama and I heard it's very good.

@magfal 10 месяцев назад

@@Fs3iit's not open source, it's relatively permissively licensed.

@henrytep8884 10 месяцев назад

Yes lets give everyone nuclear weapons....NO WE DON'T DO THAT

@johnbrooks7350 10 месяцев назад

@@henrytep8884 homie…. So only give private companies nuclear weapons??? What the hell is this ancap logic

@szaszm_ 10 месяцев назад

I wonder whether NN model parameters fall under the copyright law, or if not, whether there's anything protecting it from copying. It's not really art, and it's not clear whether it's even a human creation.

@nwalsh3 10 месяцев назад

While I refuse to call things like ChatGPT for AI, I can't deny that the security and usage scenarios fascinate me to no small degree. Partly because of my work background in security, but also of how, with little regards to what they type in, these text generators are being used. When companies activly have to go out in their internal communication channels and say "don't put personal or business data into [insert system here]", then you know that access, use and filters on people are basically non-existent. Some years back MS did a video on how the carious security layers in their datacentres are supposed to work (or was it AWS?). A good watch but, as with all things, a bit rosy. I worked at a company that had what they called a "secure facility". It was in fact so secure, that when a cleaner was going to clean in one of the server rooms, they yanked out a cable to run their machine... and 3/4 of the servers just stopped responding. Very Secure indeed.

@SurmaSampo 10 месяцев назад

Cleaners are the natural predator of DC's.

@SalivatingSteve 10 месяцев назад

The janitor unplugging a critical server sounds like my ISP Charter Spectrum.

@nwalsh3 10 месяцев назад

@@SalivatingSteve It wasn't just the server... it was a section of server racks that went. :D AND it was not an isolated incident either.

@NATANOJ1 10 месяцев назад

i worked in several it offices, there was always someone who had a similar story where a cleaner just pulled a plug to clean in the server room.

@damien2198 10 месяцев назад

That s why OpenAI is planning to have their own hardware, who control the hardware controls the model (that would only be able to run on that specific hardware)

@nekogami87 10 месяцев назад

Pretty sure they don't ? The CEO opened a new company and used the name OpenAI to sell it to investors, but i'm pretty sure that new entity has nothing to do with OpenAI (and is fully for profit)

@sumansaha295 10 месяцев назад

unless they are running their models off of quantum computers it makes no difference, At the end of the day it's still just matrix multiplication in a specific order.

@dtibor5903 10 месяцев назад

@@sumansaha295matrix multiplications do not need quantum computers,

@whothefoxcares 10 месяцев назад

someone like 3lonMu$k could teach machines that greed is good.

@vvvv4651 10 месяцев назад

haha this popped up on my feed right after fantasizing possibly leaked no limits gpt models. well done.

@hermannyung7689 10 месяцев назад

the only way to prevent model being stolen is to keep pushing new and better models

@lilhaxxor 10 месяцев назад

TLDR: Databases with user and business information are far more valuable. I honestly doubt anything will happen. You need a whole infrastructure and competent staff to make use of these large models. Stealing those is completely pointless. You can't even really do ransomware with it (albeit you mentioned personal data might be used in the training set, there are ways to alter such data enough to remove personally identifiable information). There is honestly nothing to worry about here in my opinion.

@ikuona 10 месяцев назад

Just copy it on floppy disk and run away, easy.

@GungaLaGunga 9 месяцев назад

Basically as the compression gets better, all of human knowledge can be copy pasted onto any device in seconds.

@JoseLopez-hp5oo 10 месяцев назад

Secure multi-party computing allows sensitive data to processed in secret without revealing the plaintext, however this is more to protect medical data for research and such. To protect a language model or some other complex business logic is best not to put the code in the hands of the attacker and use the glovebox / API methods to interact with with the sensitive IP without revealing it. Everything is so easy to hack, all your XPUs belong to me!

@jcdenton7914 10 месяцев назад

Ignore this, I am doing research and my own comment will show at the top when I revisit this. 13:53 Model Leeching: An Extraction Attack Targeting LLM's attacked a small LLM 14:39 Membership Inference Attacks on Machine Learning: A Survey 14:50 Reconstructing Training Data from Trained Neural Networks Goes onto how extracting training data and lead to copyright lawsuits Insider threats 16:10 "Two Former Twitter Employees and a Saudi National Charged as Acting as Illegal Agents of Saudi Arabia URL not shown. 16:58 Verizon 2023 Data Breach Investigation Not sure if useful but it's recent

@joelcarson4602 10 месяцев назад

And your interface for the model is not going to parse the model's parameters using a Commodore 64 either. You will need some serious silicon to really make use of it.

@Dissimulate 10 месяцев назад

The most humorous part of that deer picture was the word humorous in the caption.

@Narwaro 10 месяцев назад

I have yet to see any positive impacts of any of this stuff. Im kinda deep into the state of the art of reasearch in this field and its really not that impressive. The only thing I can see is that it replaces many stupid people in useless job positions. Which is yet to be seen if positive or negative.

@thebluegremlin 10 месяцев назад

entertainment

@Steven_Edwards 10 месяцев назад

There are so many open source LLMs trained on public resources that it is a moot point. Proprietary will never be able to keep up with open as far rate of improvement. When last I checked there was something like a dozen different LLMS most of them coming out of China but plenty coming out of other places in the world they've all been trained on different data sets many are up to gpt3-5 equivalence, exponentially faster than it took OpenAI to get to the same level. Honestly the big bottleneck is the same for everyone and that is inference. Processing prompts is an expensive proposition. I've seen used with home systems of up to 1pb of compute with GPUs that still are not performant enough to be realtime. As of right now only the largest in online services and state actors can afford inference that performs reasonable, that is the only thing that prevents true Democratization of AI at this point.

@MadAtreides1 10 месяцев назад

but can you really steal something that's been already stolen?

@Т1000-м1и 7 месяцев назад

0:10 btw that Andrej guy is pretty controversial

@VEC7ORlt 10 месяцев назад

What will happen? Nothing, nothing at all, world will not implode, internet will be fine, LLM will give same half assed answers as before, maybe some stock numbers will fall and poor ceo heads will roll, but I'm fine with that.

@g00rb4u 10 месяцев назад

Get that hacker @01:02 a space heater so he doesn't have to wear his hoodie indoors!

@Manbemanbe 10 месяцев назад

Good to see SBF taking that Home Ec class from prison there at 13:15 . You gotta stay busy, that's the key.

@Urgelt 10 месяцев назад

Purely open source models are not far behind Chat-GPT, and are advancing rapidly. We are approaching a tipping point: AI that is able to goal-seek and self-optimize, at which point curation of training data will no longer be much of an obstacle. AI will do it. The cat is almost out of the bag. It's probably too late to contain it. One obstacle remains: compute cycles. Training requires a lot of them. But advances are coming there, too - more compact models and better, cheaper chips tailored for training. AI is moving at blinding speed now. Anything proprietary you could steal will soon be obsolete - and even open source models will quickly surpass what was stolen. AI will fall into hands we might prefer not get it. No security protocols could prevent it, I'm thinking. What happens next, I can't even begin to guess.

@astk5214 10 месяцев назад

I think i would love for open-source unix skynet

@ronaldmarcks1842 10 месяцев назад

Yan Xu has created a somewhat misleading graphic. For both GPT-2 and GPT-3, the architecture doesn't involve separate *decoders* in the way that some other neural network architectures do (like the Transformer model, which has distinct encoder and decoder components). Instead, GPT-2 and GPT-3 are based on the Transformer architecture, but they use only the decoder part of the original Transformer model. What Yan probably refers to are not decoders but *layers*: GPT-2 has four versions with the largest having 48 layers. GPT-3 is much larger, with its largest version having 175 billion parameters across 96 layers.

@thomasmuller6131 9 месяцев назад

it sounds like sooner or later everyone has their own personal LLM and there is no money to be made with providing the service itself.

@redo1122 9 месяцев назад

This sounds like you want to present a plan to someone

@flioink 10 месяцев назад

That's totally happening in the near future!

@GavinM161 5 месяцев назад

Hasn't IBM been doing the encryption at 'line speed' for years with their mainframes?

@wrathofgrothendieck 10 месяцев назад

Just don’t forget to steal the 40k computer chips that run the model…

@JeremyPickett 10 месяцев назад

We were called paranoid and delusional for decades, all the while pushing the industry forward bit by bit :) Stegnograohy , differential analysis, red/blue/purple teams are well known. The industry of secrets is beyond fascinating. It is impossible to make even a fair short list of the geniuses in this space, I will suggest Diffie, Bruce S., Rivest et al among giants. I did, and continue to do, my insignificant parts, and the only person ive ever been intimidated by was Whitfield. Keep up the excellent content, lets all practice humility, and stay curious 🌻

@benjaminlynch9958 10 месяцев назад

I’m not terribly worried about any of these models being stolen or otherwise made non-proprietary by malicious actors. State of the art models only remain state of the art for a few months. We went from GPT 1 to GPT 4 in just 5 years. We went from DALL-E to DALL-E 3 in 33 months. Worst case scenario is that the stolen ‘foundational’ model becomes obsolete in 12-18 months, and likely much sooner unless it’s stolen immediately after being released. And that assumes that competing models don’t surpass it either.

@Valkyrie9000 10 месяцев назад

Which is exactly why nobody steals Lamborghinis older than 6 months old. They'll just build a faster/better one. /s

@MO_AIMUSIC 10 месяцев назад

Well, consider how big is the file, steal the parameter would be impossible to be unoticed. and even it is possible, would required physical move of the storage instead of transfer it over internet.

@lobotomizedamericans 10 месяцев назад

I'd fucking *love* to have a personal GPT4 or 5 with all BS ethical guard rails removed.

@buzzlightyear3715 10 месяцев назад

"The time has come." It would be surprise a number of nation states havn't been stealing the LLM today😂

@jimster1111 10 месяцев назад

you wouldnt download a GPT4

@MyILoveMinecraft 10 месяцев назад

Honestly with the importance of AI and the significant advantage of those who have full access to AI in compared to those who don't NOTHING about AI should be propietery. Especially openAI still pisses me off. AI was promised to be open source. Now we are further from that than ever (despite much off the foundations actually being created as open source code)

@simonreij6668 9 месяцев назад

"just as chonk" i have a man crush on you

@grizwoldphantasia5005 10 месяцев назад

FWIW, I think the problem of stealing intellectual property is overblown, because if you rely on copying someone else's work, you have fewer resources to develop your own knowledge in the field, you are always one or two generation behind, and you don't know what to copy until the market decides what is successful. A business which relies on copying will never develop the institutional knowledge of all the hard work which is never published and can't be copied. A business which wants to do both has to put a lot more resources into the redundant efforts. A State-sponsored business might look like it has solved the money problem, but money is not resources, it is only access to resources, and States can only print money, not resources. The more inefficient a State-sponsored business is, the higher the opportunity cost, the fewer other fields can be investigated or exploited. It's one reason I do not fear CCP expats stealing proprietary IP; it weakens the CCP overall. The more they focus on copying freer market leaders, the more fields they fall behind in.

@greatquux 10 месяцев назад

This is a good point and one he has brought up in some other videos on computing history.

@bilalbaig8586 10 месяцев назад

Copying is viable strategy when you are significantly behind the market. It allows you to keep pace with fewer resources. It might not be something China may be satisfied with but other players with fewer resources like North Korea or Iran would definitely find value in.

@durschfalltv7505 10 месяцев назад

IP is evil anyway.

@obsidianjane4413 10 месяцев назад

Except most development is based upon prior work. When you have a bot that can churn thru a million patents and papers it can put A, B, and Z together better than any human, or even collection of humans can. The intellectual theft problem isn't in the stealing of the LLM, its the theft of the documents or works by the company that builds the training model. Its common to pay for research papers and for books etc. The claim is that they are scraping the internet for these documents without compensation or paying royalties. Yeah, the CCP being able to develop a 5th gen fighter aircraft really weakened them. More insidious is that the authoritarian states like the PRC have institutionalized IP theft. They do this by forcing expats to spy with extorting them with implied threats to family and themselves. Chinese nationals really are a security threat to other countries and companies. That isn't sinophobia, its just reality.

@SpaghetteMan 10 месяцев назад

@@obsidianjane4413 then you'd be stuck in the same quandary as the folks at the Manhattan Project when they were looking for "Jewish Communist Spies", and never suspected that the German-born Englishman Klaus Fuchs was the Soviet Spy after all. "Intellectual theft" is just a politician's word for "Corporate espionage" or "headhunting for skilled experts". Only idiots cut off their own nose to spite their face; there are plenty of ways businesses and industries insulate themselves from IP theft without kicking out highly capable workers from their potential hiring pool.

@nahimgudfam 10 месяцев назад

OpenAI's value is in their industry partnerships, not in their subpar LLM product.

@adamgibbons4262 9 месяцев назад

If all chips had a unique identifier value then couldn’t you encode data to be only executed on a specific set of chips? Then you can simply forget about all the headaches of theft? Data would then be secure once, execute multiple times (on a set list of cpus)

@MostlyPennyCat 10 месяцев назад

I wonder if you could ask gpt to steal itself for you.

@nekoill 10 месяцев назад

Whoever knows better please correct me, but I'm pretty sure the source code of model, most likely alongside dataset (but probably on different storage devices, both physically and virtually), is stored somewhere on a machine that isn't connected to the web at large, if connected to any kind of network at all. That doesn't eliminate the risk of data being stolen, but you need to be physically present at the storage site fairly close to the computer (like *really* close) with a SATA cable shaped in a way that would allow it to serve as an antenna, or something like that. I expect OpenAI to take at least that kind of precaution, but who knows, dumb screwups happen in IT just as well.

@maht0x 10 месяцев назад

there is no "source code" of the model, the model is the output of the training program which takes PB of text as it's input + HFRL (Human feedback, Reinforcement learning) feedback (this bit was missed out of the description and is arguably the hardest to replicate). Search for openAi's "Learning from human preferences" paper

@nekoill 10 месяцев назад

@@maht0x yeah, sounds like it. Thank you for correction. My familiarity with ML/NNs is superficial, I know a couple of high-level concepts and a very coarse approximation of how it works under the hood.

@Bluelagoonstudios 10 месяцев назад

It happened already, they could extract training data from GPT, by repeating a word 50x and it spit out these data. Even personal details from who wrote the data in the LLM. OpenAI closed the door by now. By noting this is against regulations from OpenAI. But is it solid enough? A lot of research has to be done to close off that one.

@johnkraft7461 9 месяцев назад

Remember what happened with the Bomb when only one guy had it ? Strangely, the use of the Bomb stopped when the Other Guy got one too ! Probably a good argument for open source from here on.

@ericraymond3734 10 месяцев назад

Nonsense. LLMs can't be "malicious", they have no agencty at all.

@Lopson13 10 месяцев назад

excellent video, would love to see more security videos from you!

@The_Conspiracy_Analyst 8 месяцев назад

See, using their data (in the form of responses) to train your own AI isn't an "attack" pe se. It's just doing what they did. And it shows culpability (IANALATINLA) that they see using third party data to train the AI is wrong. I bet the measures they took to prevent it happening to them will be subpoenaed and used against them in lawsuits. I mean, if internally they call using data to train without permission to be an "attack", that's pretty damning!!

@Excelray1 10 месяцев назад

Waiting for "Dynamic Large Language Models" (DLLM) to be a thing /jk

@Game_Hero 10 месяцев назад

8:26 Woah there! Did that IA succesfully put text, actual meaningful correct text, in a generated image???

@Veylon 10 месяцев назад

Dall-E actually does okay at that sometimes these days. Hands even have five fingers most of the time and are rarely backwards.

@MostlyPennyCat 10 месяцев назад

Maybe it's cyber thieves complaining it's too slow so they _don't_ encrypt memory! 😮

@luxuriousturnip181 10 месяцев назад

If it is theoretically cheaper to steal the data than to reproduce or create something able to compete with it, the question of the security of the data is a matter of when not if. We should all be asking when this will happen, and an even more troubling question is if that when has already passed.

@SalivatingSteve 10 месяцев назад

I would split up their model among machines based on subject areas of knowledge. Each server running its own “department” at what I’m dubbing ChatGPT University 🎓

@Name-ot3xw 10 месяцев назад

I disagree that the theft of data sets is a murky thing. The only justification for the grey area is a handful of tech CEOs claiming to be above the law. As seen with Uber, our legislatures are exactly dumb/corrupt enough to go along with the idea.

@scarvalho1 8 месяцев назад

I love this video. Excellent and interesting title, and very good research.

@JordanLynn 9 месяцев назад

I'm surprised Meta's (facebook) Ollama isn't mentioned, their model was literally leaked onto the internet, so starting with Ollama 2 Meta just releases it to the public. It's all over huggingface.

@SpiritsBB 10 месяцев назад

Maybe not - I’d say everyone would rather build the model themselves than go through this hassle. If it’s 80% as good, that means it’s not good enough.