A very good friend of mine did some recent work upgrading storage for the research division of a very large pharmaceutical corporation. Their security protocols were good, but also inflexible, creating motivation to work around restrictions that slowed the upgrade down to a near standstill. The financial incentives, combined with a sense of hubris resulted in several major potential risks of security being temporarily bypassed in ways that weren't fully auditable. If an insider was waiting for the moment when exfiltration of very expensive and proprietary data and software was possible, then they got their chance. Security is always in tension with getting work done and there's no such thing as perfect security.
Even on my level, typing my password every time I wake up my computer gets on my nerves. Encrypted files are fun as well. I have removed security numerous times only to swing the other way worrying about malware etc. This is on my personal PC. I worked for decades at an aerospace company that had sign in and log on requirements that were super annoying to repeat many times a day. Then I had to change my password all the time it seemed. Everyone had to do it every 3 months or so. To the point I had rolled the entire alphabet as the last character and was well into the upper case when i retired.
@@fxsrider I know that same frustration with frequently having to change passwords at aerospace companies, having worked for a couple of them myself. It was an open secret at one of them that everyone left post-it notes with their most recent password under the keyboard.
as an engineer I've never once seen a company that wasn't compromised by China. China has a lot of people trying and small US companies are such easy fodder. People act like best security practices simply existing somewhere makes the tech world safe... but if you graphed population vs competency of IT, it would look like wealth in the US - almost all of the high competency is in a very small number of people. The other 99% are abysmal. It's hard to be smart enough about security now, there are so many attack vectors and corporations see it as an expensive cost with a low risk high punishment, so they justify not paying for it. And tbh the amount of money to compete for those few people who are actually very competent might not be worth it to the company.
Llama 2 is now almost at the level of GPT-3.5, even without breaches, Llama 3 might be at the level of GPT4, in that case, isnce Llama series is open source, then the question of what would happen if GPT4 is stolen might become moot and academic since anyone can just download open source Llama which at some point in the near future might reach the level of GPT4.
@@ebx100 this video's topic is about the ramifications of GPT4 getting stolen. With a stolen model, you dont even have the option to pay for it, you go straight to jail.
@@ebx100 it doesn't prevent people from commercializing it covertly, to prove it would require you to prove a certain work was done by a specifc AI, something that we curretnly cannot
Commoncrawl doesnt provide data only for machine learning, it's for research of all sorts. And the 45TB number is inaccurate, the dataset is measured in PB
This already happened in the Image generation space when the NovelAI model got leaked from a badly secured Github account, downloaded and used as a (somewhat) foundational model for a lot of anime image generation models.
Small correction: SSH is used for remotely operating (unix and linux) machines, for API and web traffic it's more common to use TLS (also called SSL coloquially , technically ssl is older)
Just a matter of time we see a "Folding at home" equivalent project that can train a single distributively and decentralized. Then it isn't about theft, but what can be done with such a tool....
It is darkly hilarious to watch AI companies spend enormous effort and resources to to fend off the theft of their models, when the models themselves were build off of stolen and public-domain data.
I think the AI companies are probably more afraid of an open source competitor that makes all of these protections irrelevant. There's no need to steal something built on publicly accessible information with enough time and effort.
0:44 I don't know how successful OpenAI would be in enforcing the proprietary nature of their model if it leaked. It's built upon mountains of stolen and misappropriated data after all.
If AI is based on what’s on the internet. It’s gonna be the dumbest thing around. Garbage in garbage out. It’ll probably become sentient and commit suicide from depression.
What people think large language models are: Skynet, HAL-9000 What large language models really are: Your keyboard's predictive text if it read the entirety of Reddit
This x1000. The fear mongering over AI is way overblown. The models are useless without new human-created data to feed into the system. My CS professor pointed out that if people stop posting on Stackoverflow or Quora because now they’re using ChatGPT instead, then it will just regurgitate old info and get outdated very quickly. It turns into this weird bootstrap paradox feedback loop where “knowledge” effectively stagnates.
That's because AI don't store their answers. I don't know how many times it has to be explained that AIs are not lookup tables. They're not compression, lossy or otherwise. That's made up by the NoAI crowd to try to pretend a generative AI can't produce original work. it was literally never true. Compression doesn't let you retrieve something that wasn't in the original dataset.
The effectiveness of generating training data off of existing public models has been really impressive. The open source community has been embracing it for obvious reasons to some real results. As of right now fine tuning off of generated data is where it's most used
generation of spam that is so complex that it will fool 90% of laymen. Help with the creation of fake bank login landings, and fake shopping sites. There's all kind of stuff that's possible. Voice spoofing, fake news generation, propaganda creation.
For LLM to be truely embedded all around people's lives, it needs to be open sourced. There are many importatnt things can be done with GPT-4, like using it to automate corporate paperwork, to use it to aid peer review of scientific research, summerizing and investigating documents etc. What Microsoft is doing will never do these. The closed source nature also ensures that there can't be anything better than what they got, essentially inhibiting any proper growth and application.
The data is available and there are open source models with close to equivalent performance. The problem is the cost curve for more advanced queries. The leaders in AI will likely be determined by access to efficient hardware, not anything else. Worrying about protecting weights, while it shouldn't be ignored, is the wrong direction.
@@SalivatingSteve I don't think improvements to current hardware architectures are going to get AI past the coming hardware wall. You are going to be looking at something different.
You have to remember he mentions state actors many times during the presentation, so a lot of the hardware / software / resource limitations for anonymous hackers don't really apply. State actors can easily have servers to store Petabytes of information and have multiple hi speed connections for download.
I think the reason data has to be exfiltrated slowly is that it probably sits behind hardware that limits the speed of any outgoing network connection.
@@aspuzlingit has to be done stealthily with lots of connections masked to look like normal traffic, because trying to download a massive amount of data to a single user would raise red flags.
A verbatim copy of those 1TB weights would not be valuable for very long as I am sure OpenAI are continually updating and refining it and I am sure they already have the next big thing in the pipeline. It would just be a momentary snapshot with a fixed knowledge cutoff
the training data, however, is so precious it would warrant a massive ransom, as its public release would see every IP holder suing the company, especially since in several jurisdictions you are required to protect you copyright against violations and not suing OpenAI might eventually be interpreted by a judge as not caring if your work appears in any AI
What if someone steals the collective writing of humanity, every book, news article, reddit post ever written, and uses it to train a model they then consider propietary trade secret? Can you really 'steal' something that was already stolen and hoarded?
Apparently the use of synthetic data is supposed to avoid DRCM or copyright issues as well as speed up processing, but I had to look up synthetic data: en.wikipedia.org/wiki/Synthetic_data
@@TwistersSK8Uhhh... not the same as a machine "reading" the book. Isn't that obvious? Pirates made a similar argument that copying digital files isn't theft because the owner still has the original copy. Maybe you should try stealing this book?
8:45 Look, it isn't that we here in the US don't appreciate the contributions of Chinese nationals (and others too) to our infrastructure projects. We do. The issue is that if you have family, real estate, or other ties to China, or if you LIE about those ties, then you are susceptible to being manipulated, blackmailed, or otherwise vulnerable to coercion by regimes that can snap their fingers and send your parents or children into a gulag. That's why you guys get your clearances held up. It's not that we don't like you guys, it's that we have to face the cold hard facts about what happens when someone gets their arm twisted by the Ministry of State Security.
the companies would imediatly be massively sued if the training data is leaked, as it gives every party with works in it the possibility of suing the company, it is an unwinnable battle as hundreds potentialy tens of thousands of IP holders will sue chat GPT and openAI
Data at rest is only encrypted for the layers below the encryption process. If done by the OS, the client of the OS sees the data in clear. So which layer does the encryption is important. For encryption of data in use. Intel SGX is a very common way to secure cloud playloads, however an application vulnerability on the code running in SGX negate the security properties of SGX. This is why languages such as Rust should be used and the number of lines running inside the enclave needs to be limited as much as possible to limit the attack surface. A Man in the process for such enclave is very hard to detect.
Meh. The LLM datasets are less important than the algorithms that build them. GPT is just a chatbot. A big, good, training set is valuable for its functionally and the cost it took to build. Lots of datasets are being built these days. They are going to be like cyptos. The first one was valuable, but then everyone made one and the value of all dropped. Chatbots are good at "talking", as in it can predict what a human would say based upon the keywords in the prompt input. But the model does not "know" or "think" anything. Most of them are dumb. There best utility is in making serendipitous connections of concepts and ideas from masses of data.
What do you think a human mind is, but lots of chatbots talking with each other, supervising each others output, correcting, analysing, reviewing, rating, amending in a way that creates the epiphenomenom of intelligence?
I would think Quora is a massive source of conversational Q&A made available and contributes to the dataset - unfettered. Adam D’Angelo is a senior board member basically at both companies. Also, what OpenAI did with going live on such a simple interface was 100% stroke of genius. I firmly beleive this format allowed for not only training, but providing a very solid baseline of what humanity cares about OF the data set - else there is just way too much data to model on. This 2x bootstrapped a 'scope' to start from and trained errors out based on the acceptance of the result to a query. This is prob some secret sauce as to why they're able to iterate so fast. Its the end user.
The biggest risk to GPT "theft" is simply an employee walking out the door with the knowledge of GPT. In California you cannot stop an employee from using what they remember. You can stop them from taking files with them however. It's a delicate balance but in general, "information wants to be free" and it's hard to keep stuff proprietary. At the core, GPT is matrix multiplication which cannot be copyright per se.
@@vvvv4651 it's more important how the training data was organized, structured, formatted and the training methods. If these informations would be really that secret, other LLMs would be far far behind.
@@vvvv4651 there are a few special people that remember absolutely everything they see. They're usually fairly challenged cognitively, but they can remember a whole phonebook just by reading it once. Have you ever heard of rain man?
I would say that it happened would be overall a good thing. It's too much of a powerful thing to be in the hands of a few persons. I don't believe anyone has magical ethic to be able to decide or "protect" humanity from any bad outcome. Actually the other way around, in trying to do good, but without the input of the rest of humanity, they for sure are going to end up doing evil.
Tavis Ormandy found Zenbleed where the CPU was exposing data from the system. I think hardware vulnerability security testing is in its infancy and he's one pioneer using software.
The goal is to build things in America, Canada, Europe etc by said people. The thing is, Chinese Canadians are also Canadian, and Chinese Americans are also Americans. It is not possible to ignore the issues that arises from people who have links or are literally part of the Chiense state in the aforementioned countries. Also, there is nothing wrong with being proud of your roots, and being proud of having a direct association with the Peoples Republic of China. You dont really want Chinese nationalist to actively manage a data center when there are other people who sre perfectly capable. I think people who cant differentiate the PRC snd chinese people are an issue, just like it is true that companies dealings with critical tech should be aware of people who have links to other states.
I'd rather not have any nationalist actively manage a data center, thank you very much. ...assuming we're using the term "nationalist" in a fanatical/irrational sense here.
@@bruceli9094 A bit of the same issue. Huge nationalism issue that puts india or sikh values over Canada or America. In Canada there are riots where these two beat fuck out of each other. There has been assasinations and terrorists attacks. You have to prioritize the needs of the country above everything. I can tell you as an immigrant that some of the people who move to north america are a testament of bad screening practices. In Canada there has been cases of Chinese nationals who were somhow allowed to worked in defense programs, and took blue prints from frigates and signaling codes and handed them to the chinese state. In the case of the UK, they had a dude who worked in their nuclear program steal blue prints and recreated the bomb in Pakistan. So the whole it doesnt matter if a person is loyal to the country is ridiculous
@@stefanstankovic4781 You want to assure that you have your tech companies and data centers filter out of people who have direct ties with foreign states. Canada has suffered a lot of security breakdowns due to a lack of oversight and security clearence. It is very simple: you dont have to like american or western doctrine, but as long as you are western, you will be targeted. so you dont want people whose entire goal is to disrupt the enviroment you work and live in to control your data.
Not being an IT guy, I'm a little bit confused. I thought that open AI meant open source code, which I thought means that anyone can copy and use it and even modify it?
A little bit misleading... obviously "Nothing" happens. It's like asking, what if Actor A steals the open source content for public plays... There are so many open source near equivalents to GPT-4 now. And the data is simply out there to be scraped -- without having to do any hacking at all.
It’s crazy to me that these models are so huge. I do wish many of these would be released entirely to the public. Even with the risks, I think open source and open development lead to the best long term production for everyone
I wonder whether NN model parameters fall under the copyright law, or if not, whether there's anything protecting it from copying. It's not really art, and it's not clear whether it's even a human creation.
While I refuse to call things like ChatGPT for AI, I can't deny that the security and usage scenarios fascinate me to no small degree. Partly because of my work background in security, but also of how, with little regards to what they type in, these text generators are being used. When companies activly have to go out in their internal communication channels and say "don't put personal or business data into [insert system here]", then you know that access, use and filters on people are basically non-existent. Some years back MS did a video on how the carious security layers in their datacentres are supposed to work (or was it AWS?). A good watch but, as with all things, a bit rosy. I worked at a company that had what they called a "secure facility". It was in fact so secure, that when a cleaner was going to clean in one of the server rooms, they yanked out a cable to run their machine... and 3/4 of the servers just stopped responding. Very Secure indeed.
That s why OpenAI is planning to have their own hardware, who control the hardware controls the model (that would only be able to run on that specific hardware)
Pretty sure they don't ? The CEO opened a new company and used the name OpenAI to sell it to investors, but i'm pretty sure that new entity has nothing to do with OpenAI (and is fully for profit)
unless they are running their models off of quantum computers it makes no difference, At the end of the day it's still just matrix multiplication in a specific order.
TLDR: Databases with user and business information are far more valuable. I honestly doubt anything will happen. You need a whole infrastructure and competent staff to make use of these large models. Stealing those is completely pointless. You can't even really do ransomware with it (albeit you mentioned personal data might be used in the training set, there are ways to alter such data enough to remove personally identifiable information). There is honestly nothing to worry about here in my opinion.
Secure multi-party computing allows sensitive data to processed in secret without revealing the plaintext, however this is more to protect medical data for research and such. To protect a language model or some other complex business logic is best not to put the code in the hands of the attacker and use the glovebox / API methods to interact with with the sensitive IP without revealing it. Everything is so easy to hack, all your XPUs belong to me!
Ignore this, I am doing research and my own comment will show at the top when I revisit this. 13:53 Model Leeching: An Extraction Attack Targeting LLM's attacked a small LLM 14:39 Membership Inference Attacks on Machine Learning: A Survey 14:50 Reconstructing Training Data from Trained Neural Networks Goes onto how extracting training data and lead to copyright lawsuits Insider threats 16:10 "Two Former Twitter Employees and a Saudi National Charged as Acting as Illegal Agents of Saudi Arabia URL not shown. 16:58 Verizon 2023 Data Breach Investigation Not sure if useful but it's recent
And your interface for the model is not going to parse the model's parameters using a Commodore 64 either. You will need some serious silicon to really make use of it.
I have yet to see any positive impacts of any of this stuff. Im kinda deep into the state of the art of reasearch in this field and its really not that impressive. The only thing I can see is that it replaces many stupid people in useless job positions. Which is yet to be seen if positive or negative.
There are so many open source LLMs trained on public resources that it is a moot point. Proprietary will never be able to keep up with open as far rate of improvement. When last I checked there was something like a dozen different LLMS most of them coming out of China but plenty coming out of other places in the world they've all been trained on different data sets many are up to gpt3-5 equivalence, exponentially faster than it took OpenAI to get to the same level. Honestly the big bottleneck is the same for everyone and that is inference. Processing prompts is an expensive proposition. I've seen used with home systems of up to 1pb of compute with GPUs that still are not performant enough to be realtime. As of right now only the largest in online services and state actors can afford inference that performs reasonable, that is the only thing that prevents true Democratization of AI at this point.
What will happen? Nothing, nothing at all, world will not implode, internet will be fine, LLM will give same half assed answers as before, maybe some stock numbers will fall and poor ceo heads will roll, but I'm fine with that.
Purely open source models are not far behind Chat-GPT, and are advancing rapidly. We are approaching a tipping point: AI that is able to goal-seek and self-optimize, at which point curation of training data will no longer be much of an obstacle. AI will do it. The cat is almost out of the bag. It's probably too late to contain it. One obstacle remains: compute cycles. Training requires a lot of them. But advances are coming there, too - more compact models and better, cheaper chips tailored for training. AI is moving at blinding speed now. Anything proprietary you could steal will soon be obsolete - and even open source models will quickly surpass what was stolen. AI will fall into hands we might prefer not get it. No security protocols could prevent it, I'm thinking. What happens next, I can't even begin to guess.
Yan Xu has created a somewhat misleading graphic. For both GPT-2 and GPT-3, the architecture doesn't involve separate *decoders* in the way that some other neural network architectures do (like the Transformer model, which has distinct encoder and decoder components). Instead, GPT-2 and GPT-3 are based on the Transformer architecture, but they use only the decoder part of the original Transformer model. What Yan probably refers to are not decoders but *layers*: GPT-2 has four versions with the largest having 48 layers. GPT-3 is much larger, with its largest version having 175 billion parameters across 96 layers.
We were called paranoid and delusional for decades, all the while pushing the industry forward bit by bit :) Stegnograohy , differential analysis, red/blue/purple teams are well known. The industry of secrets is beyond fascinating. It is impossible to make even a fair short list of the geniuses in this space, I will suggest Diffie, Bruce S., Rivest et al among giants. I did, and continue to do, my insignificant parts, and the only person ive ever been intimidated by was Whitfield. Keep up the excellent content, lets all practice humility, and stay curious 🌻
I’m not terribly worried about any of these models being stolen or otherwise made non-proprietary by malicious actors. State of the art models only remain state of the art for a few months. We went from GPT 1 to GPT 4 in just 5 years. We went from DALL-E to DALL-E 3 in 33 months. Worst case scenario is that the stolen ‘foundational’ model becomes obsolete in 12-18 months, and likely much sooner unless it’s stolen immediately after being released. And that assumes that competing models don’t surpass it either.
Well, consider how big is the file, steal the parameter would be impossible to be unoticed. and even it is possible, would required physical move of the storage instead of transfer it over internet.
Honestly with the importance of AI and the significant advantage of those who have full access to AI in compared to those who don't NOTHING about AI should be propietery. Especially openAI still pisses me off. AI was promised to be open source. Now we are further from that than ever (despite much off the foundations actually being created as open source code)
FWIW, I think the problem of stealing intellectual property is overblown, because if you rely on copying someone else's work, you have fewer resources to develop your own knowledge in the field, you are always one or two generation behind, and you don't know what to copy until the market decides what is successful. A business which relies on copying will never develop the institutional knowledge of all the hard work which is never published and can't be copied. A business which wants to do both has to put a lot more resources into the redundant efforts. A State-sponsored business might look like it has solved the money problem, but money is not resources, it is only access to resources, and States can only print money, not resources. The more inefficient a State-sponsored business is, the higher the opportunity cost, the fewer other fields can be investigated or exploited. It's one reason I do not fear CCP expats stealing proprietary IP; it weakens the CCP overall. The more they focus on copying freer market leaders, the more fields they fall behind in.
Copying is viable strategy when you are significantly behind the market. It allows you to keep pace with fewer resources. It might not be something China may be satisfied with but other players with fewer resources like North Korea or Iran would definitely find value in.
Except most development is based upon prior work. When you have a bot that can churn thru a million patents and papers it can put A, B, and Z together better than any human, or even collection of humans can. The intellectual theft problem isn't in the stealing of the LLM, its the theft of the documents or works by the company that builds the training model. Its common to pay for research papers and for books etc. The claim is that they are scraping the internet for these documents without compensation or paying royalties. Yeah, the CCP being able to develop a 5th gen fighter aircraft really weakened them. More insidious is that the authoritarian states like the PRC have institutionalized IP theft. They do this by forcing expats to spy with extorting them with implied threats to family and themselves. Chinese nationals really are a security threat to other countries and companies. That isn't sinophobia, its just reality.
@@obsidianjane4413 then you'd be stuck in the same quandary as the folks at the Manhattan Project when they were looking for "Jewish Communist Spies", and never suspected that the German-born Englishman Klaus Fuchs was the Soviet Spy after all. "Intellectual theft" is just a politician's word for "Corporate espionage" or "headhunting for skilled experts". Only idiots cut off their own nose to spite their face; there are plenty of ways businesses and industries insulate themselves from IP theft without kicking out highly capable workers from their potential hiring pool.
If all chips had a unique identifier value then couldn’t you encode data to be only executed on a specific set of chips? Then you can simply forget about all the headaches of theft? Data would then be secure once, execute multiple times (on a set list of cpus)
Whoever knows better please correct me, but I'm pretty sure the source code of model, most likely alongside dataset (but probably on different storage devices, both physically and virtually), is stored somewhere on a machine that isn't connected to the web at large, if connected to any kind of network at all. That doesn't eliminate the risk of data being stolen, but you need to be physically present at the storage site fairly close to the computer (like *really* close) with a SATA cable shaped in a way that would allow it to serve as an antenna, or something like that. I expect OpenAI to take at least that kind of precaution, but who knows, dumb screwups happen in IT just as well.
there is no "source code" of the model, the model is the output of the training program which takes PB of text as it's input + HFRL (Human feedback, Reinforcement learning) feedback (this bit was missed out of the description and is arguably the hardest to replicate). Search for openAi's "Learning from human preferences" paper
@@maht0x yeah, sounds like it. Thank you for correction. My familiarity with ML/NNs is superficial, I know a couple of high-level concepts and a very coarse approximation of how it works under the hood.
It happened already, they could extract training data from GPT, by repeating a word 50x and it spit out these data. Even personal details from who wrote the data in the LLM. OpenAI closed the door by now. By noting this is against regulations from OpenAI. But is it solid enough? A lot of research has to be done to close off that one.
Remember what happened with the Bomb when only one guy had it ? Strangely, the use of the Bomb stopped when the Other Guy got one too ! Probably a good argument for open source from here on.
See, using their data (in the form of responses) to train your own AI isn't an "attack" pe se. It's just doing what they did. And it shows culpability (IANALATINLA) that they see using third party data to train the AI is wrong. I bet the measures they took to prevent it happening to them will be subpoenaed and used against them in lawsuits. I mean, if internally they call using data to train without permission to be an "attack", that's pretty damning!!
If it is theoretically cheaper to steal the data than to reproduce or create something able to compete with it, the question of the security of the data is a matter of when not if. We should all be asking when this will happen, and an even more troubling question is if that when has already passed.
I would split up their model among machines based on subject areas of knowledge. Each server running its own “department” at what I’m dubbing ChatGPT University 🎓
I disagree that the theft of data sets is a murky thing. The only justification for the grey area is a handful of tech CEOs claiming to be above the law. As seen with Uber, our legislatures are exactly dumb/corrupt enough to go along with the idea.
I'm surprised Meta's (facebook) Ollama isn't mentioned, their model was literally leaked onto the internet, so starting with Ollama 2 Meta just releases it to the public. It's all over huggingface.
Maybe not - I’d say everyone would rather build the model themselves than go through this hassle. If it’s 80% as good, that means it’s not good enough.