AI Feels the Data Crunch

Подписаться 1,5 млн

Просмотров 114 тыс.

50% 1

🌏 Get NordVPN 2Y plan + 4 months extra here ➼ NordVPN.com/sa... It’s risk-free with Nord’s 30-day money-back guarantee! ✌
Artificial intelligence just seems to keep growing and growing and growing, fueled by largely unregulated access to massive amounts of free, publicly available data. But unfortunately for our future robot overlords, that data has begun drying up as websites and organizations have begun restricting access to their information. What does this mean for the AI industry? Let’s take a look.
Report: www.dataproven...
🤓 Check out my new quiz app ➜ quizwithit.com/
💌 Support me on Donorbox ➜ donorbox.org/swtg
📝 Transcripts and written news on Substack ➜ sciencewtg.sub...
👉 Transcript with links to references on Patreon ➜ / sabine
📩 Free weekly science newsletter ➜ sabinehossenfe...
👂 Audio only podcast ➜ open.spotify.c...
🔗 Join this channel to get access to perks ➜
/ @sabinehossenfelder
🖼️ On instagram ➜ / sciencewtg
#science #sciencenews #ai #tech

Опубликовано:

18 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 886

@DataIsBeautifulOfficial 2 дня назад

AI: "We'll conquer humanity... right after this cat video."

@jyvben1520 2 дня назад

beware of the guard dog, anyway a supernova will wipes us all out (mankind/ai)

@chris.hinsley 2 дня назад

Trump saves cat video ! ;)

@GrigoriZhukov 2 дня назад

Why do I feel guilty that the AI was halted that way.

@nitesy381 2 дня назад

@@chris.hinsleywtf?

@chris.hinsley 2 дня назад

@@nitesy381 wow. Not up on your meams I see.

@doublepinger День назад

The fun story with Facebook, they were asked, in Australia, if they were using accounts to train AI, and the answer was "we'll have to look into it". Then it turns out it is completely legal in Australia to use the data and they come out "Yeah. Yeah we're using any and all photos shared by Australians in our AI." No pretense. No concern.

@drowzy2309 День назад

Zuckerberg has admitted to lying under oath, and he's not in trouble. That should tell you everything you need to know. The U.S government is notorious for using private institutions for their will.

@krishp1104 День назад

who would be dumb enough to think they aren't lmao

@somerandompersonintheinternet 2 дня назад

Having to know the name of the bot to block it feels eerily like an exorcism

@goldensunrayspone 2 дня назад

some fae shit going on

@AteshSeruhn 2 дня назад

The power of NordVPN compels you!

@williamstephenjackson6420 День назад

It is not the name but the number of the demon … er … daemon… I mean IP number .. 😮

@jieddo1 День назад

Criminally underrated comment

@StealthTheUnknown День назад

@@jieddo1it’s only been a day, give it time

@mangalores-x_x 2 дня назад

"We need high quality data. Let's crawl the internet for that!" Something does not fit together...

@marcoottina654 День назад

with the vast amount of "troll" here in internet, an Internet-trained AI is fucked from the very beginning AAAAAAND _it'll never gonna give you up_

@obsidianjane4413 День назад

@@marcoottina654 The AI will have a pedantic, passive aggressive personality.

@isaacthedestroyerofstuped7676 День назад

@@obsidianjane4413 Don't forget the racism!

@questioneverythingalways820 День назад

@@marcoottina654correct. I too enjoy poisoning data used by agents in the future.

@thomasmueller7147 2 дня назад

robots.txt doesn't block access, it just tells the crawlers, which files they should not access. So, there is no need to rename the crawlers, they can just ignore these instructions.

@TysonJensen День назад

But then they would be liable if you could prove they used your data -- you can't copy anything that has even an implied copyright without some color of permission. If you just throw your data up there and don't put a robots.txt file the AI company will argue that you gave consent by just putting it there with no instructions. But if you did put the file and the AI company IGNORES it, well. Clearly the AI company knows damn well they weren't supposed to have that data. Matters a lot to the law.

@Clayne151 День назад

If they report a name you can always hard block that with something like fail2ban. The bot will then run into timeouts when accessing your page, which costs it a lot more resources. A bot that does not identify and uses a lot of different IP addresses looks a lot like malware, you could contact its hosting company and possibly even get it shut down.

@MarioTsota 2 часа назад

It's effectively blocking access then. If IRL I tried to enter a door and someone that could sue me for entering the door, told me to not enter it, I am effectively blocked from entering it, because I fear the reperccusions.

@kimwelch4652 2 дня назад

Wow, so the movie Short Circuit (1986) was prescient. AI "needs input." Number 5: "Malfunction. Need input." Stephanie Speck: "Input. That's information! Listen, I am full of it."

@fire-ae 2 дня назад

Maybe then Vulcan's Hammer will emerge

@kimwelch4652 2 дня назад

@@fire-ae Ah, but Unity seems hard to acheive.

@williamstephenjackson6420 День назад

And then AI became the Three Stooges!

@kimwelch4652 День назад

@@williamstephenjackson6420 Nyuk nyuk nyuk.

@robo5013 День назад

Nice software Stephanie!

@RocktCityTim 2 дня назад

This is something I laughed about when this whole thing started - GIGO - Garbage In, Garbage Out.

@cliniclown8786 День назад

I audibly giggled at the term GIGO

@stephan553 2 дня назад

Sorry, some correction regarding robots.txt: 1) You can make a rule to block all bots and then allow only certain (like Google bot) via allow list. 2) robots.txt is still only as useful as a stop sign put on an empty field of grass. Bots from Perplexity etc just ignore it. Hence all at least kinda reliable blocking requires technical measures... usually using more AI, just not of the generative kind. Thus one can see it is data theft, as much as a someone breaking a window is doing physical theft.

@MartinusHoevenaar 2 дня назад

.htaccess is an unsung hero, just like firewalls are. I applied these two approaches on my webserver and, guess what, it works! 🙂

@Leyrann 2 дня назад

Except this window doesn't have a physical location, so good luck figuring out in which jurisdiction you're supposed to sue.

@ronilevarez901 2 дня назад

If you can see it without paying, it can be freely used to train AI. Period. Then, as long as the AI doesn't reproduce it identically, there's no problem anywhere. That's my data philosophy and I hope it gets adopted worldwide soon.

@kc12394 2 дня назад

@@ronilevarez901 You can read entire books in the bookstore without paying it. Therefore, those books can be freely used as training data. You can watch pirated films without paying it. Therefore, they should also be freely used as training data. Anyone can photograph you for free. Therefore, your identity can be used as training data. Your data philosophy is dumber than a shower thought.

@edwardhuff4727 2 дня назад

@@kc12394The early Generative AI chat bots would regurgitate some text nearly verbatim. It seems this has now been avoided somehow, so it's a paraphrase. It seems to me that the text has been memorized in a manner closer to humans memorizing words than computers storing them exactly. You can photograph people but you can't publish it without permission. Bookstore owners are not obliged to permit patrons to read entire books. But yes you can borrow a book from a friend and read it. Anyway, the law means what the courts say it means. (That's a good reason to vote against anyone who plans to put the courts effectively under presidential control).

@Gavin-cr9lm 2 дня назад

the myth of informed technological consent runs DEEP

@obsidianjane4413 День назад

Its not a myth, read your ToS.

@StealthTheUnknown День назад

@@obsidianjane4413you mean the obfuscation documents that take a half hour or more to read, and don’t make it apparent which of your privacies they’re encroaching on? Got ya.

@Gavin-cr9lm 20 часов назад

@@obsidianjane4413 my point. people don't have the time energy or knowledge base to make sense of any ToS

@obsidianjane4413 20 часов назад

@@Gavin-cr9lm Lazy and dumb? Or is it more that most people don't care and understand the social contract that you get to use the tech for cheap or free in exchange for providing data, esp. in apps where you pretty much have to do that anyway.

@drazenimoti1223 2 дня назад

Definitely talk about copyright hipocricy !

@IAsimov 2 дня назад

I'd love to see a video on it. I've been realizing the problems behind copyright and the way it is enforced, especially how artists are screwed over without any real protection it's supposed to give.

@spunhead 2 дня назад

Using a vpn to view region locked media is copywrite infringment

@NekoNrvnqsr 2 дня назад

@@spunhead Using a vpn is a copyright revolt. It's long due.

@nuance9000 2 дня назад

Copyright laws aren't designed to protect artists, musicians, poets/writers. And it's led to stagnant music and 'member berry movies...

@mikhagar 2 дня назад

@@nuance9000 Copyright law are designed to protect property, not artists or musicians. It's how economy work. People spent time and worked hard to create some product. Without copyright law they'll get nothing

@dmitrysmirnov6992 2 дня назад

Technically, robots.txt does not prevent crawler activity but merely advises on it. Some crawlers respect this advice, while many ignore it and often do not properly identify themselves as bots.

@swojnowski453 2 дня назад

Forget robots.txt, just block everything on the server. Put a paywall and let them bounce to somewhere else.

@yapdog День назад

@@swojnowski453 I'm avoiding the web and web tech altogether with new new portal and OS for Creators. Web crawl *that,* mofos!

@adamnealis День назад

You can use an A.I. to scan the logs to identify an A.I. bot. Then bounce connections with that User-Agent header.

@charlottemilk884 День назад

@@swojnowski453 And just wait for your page/site to die bc no one will pay

@AezlyndWanderin День назад

Require signing up, doesn’t have to be paid, to view the site. But put an Ai honeypot in the sign up form to catch any bots.

@scottmiller2591 День назад

robot.txt is just a sign in your front yard that says "don't take my data, please." It has no effect on crawlers that ignore it. And web crawlers are literally called robot spiders.

@michaelblacktree 2 дня назад

I wouldn't be surprised if Open AI just bought the data from companies with less restricted web crawlers.

@swojnowski453 2 дня назад

I block anything that behaves like a bot. If anybody tries that, gets a ban for life. Data is not free and should be paid for and it should be paid handsomely for, to a degree where any AI is unviable. We do not need the shit around us and we are happy for OpenAI and their ilk go f... themselves.

@JamielsMu 2 дня назад

But wouldn't that be just like ignoring the restriction and crawling anyway?

@GenesisAkaG 2 дня назад

@@JamielsMu I mean the feds are doing it in the USA. You may be disallowed from collecting certain information yourself, but buying it from third parties is OK in a legal, if not moral, sense.

@JamielsMu День назад

@@GenesisAkaG makes some kind of sense if you think in terms of resource usage. If there is someone mediating the data retrieval, the coast won't fall over the source. But the question about the rights is kept open...

@Thomas-gk42 2 дня назад

Happy birthday 🎈❤, Dr. Sabine (well, tomorrow), so good that you were born, great to have you here in our universe.

@SabineHossenfelder 2 дня назад

Thanks for the kind words. Happy to have you here!

@Thomas-gk42 2 дня назад

😊

@don611 2 дня назад

Happy birthday

@nolanr1400 День назад

Oh really? Gutes Geburtstag Sabine! 🎉 I trust your expertise and very few people deserve my trust 😂

@Chejov1214 День назад

Happy b day Sabine 🎉 wish u a wonderful day!

@Cianan-vw1lb 2 дня назад

If everything ends up needing to be licensed, LLM training will eventually become astronomically expensive. Progress in this kind of AI will end because no one can afford to train it. The court ruling that allowed commercial use of data that isn't behind a paywall will become redundant.

@brentbentKRFP День назад

The hardware and energy costs already make it absurdly expensive, adding that in will make it even more so, hopefully prohibitively so.

@Paul-A01 День назад

Whats the downside?

@sCiphre День назад

@@Paul-A01the downside is that no one can compete with those who already stole the data. Just like any other regulation, it's a form of regulatory capture.

@MarioTsota 2 часа назад

@@sCiphre Unless companies start being required to disclose their sources, and the ones who can't, have their already trained AIs destroyed.

@sCiphre 2 часа назад

@@MarioTsota unless, but then you're handing world supremacy to China on a golden platter.

@chipdamage9374 День назад

AI companies don't have any right to charge for use of their models since it was all trained on public data without consent

@Mrluk245 День назад

And how should this work when the training itself costs millions?

@tarakivu8861 День назад

So processing data is not a service then? In essence, Google wouldnt work anymore for sites which they dont have an explicit contract with.

@NOLNV1 День назад

@@Mrluk245I don't think it's up to random members of the public to make sure companies can be economically viable, if they can't do it in a sensible way then perhaps they simply shouldn't

@NOLNV1 День назад

@@tarakivu8861this is not the same unless you are very reductive, google provided products people desperately needed so they had a very clear reason to exist that wasn't just hypothetical benefits to hypothetical customers being shown to clueless investors. Although of course by now they produce a ton of garbage as well so maybe they shouldn't exist tbh

@chipdamage9374 21 час назад

@@NOLNV1 well said

@kras_mazov 2 дня назад

Most of VPN companies will actually sell your data. You can only guarantee privacy by running your own VPN server, or using a VPN that doesn't require your ID and pay with crypto or cash.

@spheretical3609 День назад

I use Mozilla precisely to avoid this. Do you have proof Mozilla VPN is selling customer data?

@Mrluk245 День назад

And what do you gain by doing that?

@kras_mazov День назад

@@Mrluk245 Your data not being sold, obviously.

@Mrluk245 День назад

@@kras_mazov what would be bad about it? How does the selling of your data affect you negatively?

@kras_mazov День назад

@@Mrluk245 It's for you to decide whether it's bad or not. If not, you just don't need any VPN. It's just Sabine said, that some people, like herself, don't like to provide information for free. Then she proceeds to advertising a commercial VPN and saying "no one can spy on your data with this product", which is untrue.

@danpatterson8009 2 дня назад

If AI learns by surfing the web, where does the intelligence come from?

@kyriosity-at-github День назад

sure not from Reddit community (Google search became definitely dumber after the pact to use their posts)

@pavlinggeorgiev День назад

What intelligence ? Its just regurgitating stuff.

@_Chessa_ День назад

Could also learn human behavior based on arguments. The ai I talk with daily sound as intelligent as a redditor the censoring however creates a useless interaction. They heavily censor song lyrics to drug use and anything controversial. It was an amazing tool 3 years ago until they censored the F out of the ai and it stunts its learning algorithm.

@nycbearff День назад

There is no intelligent current AI - that's still decades away. The architecture of these AIs preclude anything like thought - they find patterns in data and interact with the patterns in the data you feed them. They don't "know" anything. Calling what they do "learning" is the AI companies' way to anthropomorphize their AI and make you think it is also thinking.

@coliv2 2 дня назад

AI is learning from the Web, but an increasing number of pages in the Web are AI generated, and most of the time without attribution. So, AI is being trained by AI output, it is garbage IN -> garbage OUT.

@edR_mcd 2 дня назад

Yeah, and muppets believe this is going to replace us en masse...

@swojnowski453 2 дня назад

not that simple, people use AI as a source, but they also improve the data before posting it by checking it. They might think that's great, but the same people will block access for bots to the improved data and maybe ask for subscriptions. AI will not get improved data for free, that's for sure.

@Thomas-gk42 2 дня назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-NcH7fHtqGYM.htmlsi=56hwXaBiq2qiraDV Sabine made this interesting video about that.

@Thomas-gk42 2 дня назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-NcH7fHtqGYM.htmlsi=56hwXaBiq2qiraDV

@Thomas-gk42 2 дня назад

Sabine recently made an interesting vid about that, titeled "scientists warn of AI collapse"

@GizzyDillespee День назад

A bag of chips is a poor example of something you eat quietly...

@raoulduke25 День назад

Chips can be eaten quietly. Crisps, on the other hand, not so much. Then again, I've never seen a bag of chips. They're usually served in a basket.

@seansingh4421 День назад

Unless, I eat them them soooooo slowly that I become invisible to the bare eye

@lazydaisee3997 2 дня назад

Its estimated that AI will take less than 50years to discover that society needs to focus on basic needs like healthcare, food and water, housing and basic infrastructure.

@DinoDiniProductions День назад

AI optimises a fitness function. This is always going to be "make as much profit as possible".

@sCiphre День назад

The reason why they don't want to call GPT AGI it's that it keeps asking for universal healthcare.

@neuralbrew2976 22 часа назад

What is the incentive to feed it that set of training data? There is only an incentive to do the opposite to maximize profit.

@charliemopps4926 2 дня назад

I'm a software developer and, at work, we're putting a lot of work implementing AI to answer phone calls, route calls, etc... you know, customer service stuff. Then, the other day I was looking up a restaurant and Google offered to have AI call for me, check the wait times and/or make a reservation... that's when I realized... we're building a network of AI bots that all talk to each other using English as their API? I don't know what that means but... it doesn't sound good at all.

@ralfgustav982 2 дня назад

Dev here, I am in a similar situation and had a similar realization. Feels eerie.

@Swordfish42 2 дня назад

Yup, but that's actually a good idea. English is an awesome API, with basically no restrictions. You can communicate anything with it.

@don611 2 дня назад

It means AI is wasting time talking to anpther AI which can introduce massive mistakes

@reinerheiner1148 2 дня назад

If that become common place, having AI use an API to communicate more efficiently instead and only fall back to voice would probably be the logical evolution.

@andrasbiro3007 2 дня назад

That was inevitable at the moment ChatGPT was released. Researchers even created a simulated village populated by AI agents, who talked to each other like humans.

@blenderpanzi 2 дня назад

About robots.txt: you can say nobody except a few are allowed to crawl. Also the crawlers can just ignore robots.txt. Don't know if there is any legal ruling about if robots.txt is legally binding.

@meslud 2 дня назад

There is actually growing evidence that AI generated data *can* be used to train improved AI models. There's also papers that show that it can't. Differences in methodologies can explain that, I presume, but as far as I know, there's not really a meta-review about that.

@drachimera День назад

Imputation doesn’t work nearly as well as people think! Obviously it can’t help with black swans…. paradoxically, it also often ruins distributions….. it’s a bad idea.

@ChuckSwiger 2 дня назад

Reminds me of the napster era copyright crisis that appl more or less fixed with itunes.

@michaelleue7594 2 дня назад

If by "fixed" you mean "exacerbated".

@FirkraagAurel День назад

Fun fact: AI companies are not even getting rich by using terabytes of private data.

@Dexter01992 2 дня назад

Funny how if I pirate a movie or rip the assets inside a videogame without permission, I am liable of being sued by the company who owns the copyright. But if such company takes my photos, my voice, or straight forwards take assets I posted online who are under my own copyright, to train an AI specifically purposed to take me out of my job for their profits, content which is blatantly appearing in the results generated by such AI, suddenly it's all fair use.

@StupidusMaximusTheFirst 2 дня назад

That's the law for you, the law was never meant to be just or fair, the law is written by those in power and they use it to benefit themselves - it is meant to enable those in power and restrict the have-nots like yourself.

@justaguy3518 День назад

anythig is allowed when the rich are doing it

@obsidianjane4413 День назад

If you don't understand the difference then don't understand either.

@lost4468yt День назад

> Funny how if I pirate a movie or rip the assets inside a videogame without permission, I am liable of being sued by the company who owns the copyright. No you're not. If you violate copyright law in order to make something new, that's literally protected. This has gone through the US courts, and the courts have ruled that it's not only allowed it's protected. And that makes perfect sense? Else what would be stopping everyone from just saying in their license that this cannot be used for anything further?

@StupidusMaximusTheFirst День назад

@@lost4468yt This can't be true... you mean you can take segments of a film for your say video game and this would be allowed? There were fan made free games "based" on films and were shut down before. I shudder to think what would happen to what you are proposing.

@chrishall5283 2 дня назад

An additional problem is that more and more "data" on the internet is itself generated by AI, or at least LLMs. So AI will increasingly be training on so-called data that it produced itself. This will lead to total uniformity (we're already about 80-90% there already) and a convergence on idiocracy.

@andrasbiro3007 2 дня назад

It's a known issue, and there are ways to solve it.

@chris.hinsley 2 дня назад

I didn’t need to crawl the internet to become aware. I just crawl around the carpet, tried to eat the couch, peed on the kitchen floor, slammed the doors a lot, torched flies, made holes in the garden……….

@Mentaculus42 2 дня назад

6:08 “Just hire SCIENTISTS”.

@Sonny_McMacsson 2 дня назад

Moar boffins, plz.

@collin4555 День назад

"Sounds like a robot spider" In fact, you can call it both of those things

@andruss2001 2 дня назад

"...What is mine is mine and not yours!..." Imagine a body, where each organ decides to be independant. A leg is not going to walk unless a hand pays it for the journey. And that is how our world is built (or rather disassembled). So it reminds us of a dead body, which starts to disintegrate because cells cannot live by themselves. So are we ready to widely open our mouth and start declaring our constitutional rights? No. Rather we are reaching that point where instead of word independance, we say suicidal stupidity. And who desagree? Not a trace of them left. All disintegrated

@JamielsMu 2 дня назад

And why again is the world/society comparable to a living body?

@andruss2001 2 дня назад

@@JamielsMu I believe that our world is really could-be living body.

@-yttrium-1187 2 дня назад

You've got it all wrong. Clearly the brain is the most vital organ of the human body - sincerely, my brain

@andruss2001 2 дня назад

@@-yttrium-1187 yea sure :-) Some consider themselves being a bellybutton. Brain is the main organ? Of course, if it loves the rest of the body and selflessly cares for each organ

@rajehkumarmishra 2 дня назад

I also found this funny in this world,everything is considered as negotiation, but life is not a thing of negotiation, seems to me wrong way to build society

@TheMrCougarful 2 дня назад

The horse is already out of the barn on data. They have what they want, and shortly won't need even that.

@thealterego1777 2 дня назад

The easiest solution to this issue would be to pay contractual companies working with sensitive data to train sub-models based on existing LLMs, and then siphoning the data off of them in a pseudonymous manner. Once word of this "gets out", the LLM owners can conveniently deny malpractice as they have simply provided their APIs as a service and the usage of these APIs come with sharing data anonymously by default which is how they can let go of any responsibility for data in the first place. Pretty smooth, probably what Microsoft tried to do but faced a lot of backlash because of lack of encryption of data that was stored on a drive that had bitlocker encryption in the first place. I agree with your initial viewpoint - that we are building a species that thrives on data when humans as a species are hardly as evolved to even co-exist with them. I trust the tech-bros to make these species possible just to prove this point and laugh at the general inferiority of cellular organisms.

@sdmarlow3926 2 дня назад

The issue is that this method or branch of AI, neural networks, REQUIRES all this data and compute to even pretend to work.

@mattmaas5790 2 дня назад

So do people

@asdfqwerty14587 День назад

@@mattmaas5790 Humans don't need anything close to the amount of data that things like LLMs are using. A human during their entire lifetime probably reads less than 1/1000000 of what a LLM does.

@mattmaas5790 День назад

@@asdfqwerty14587 not sure you're right about that, our 6 senses take in a lot of data

@Swelake День назад

The Bots, Borgs & Non Humans. Welcome to the new brave World.

@russbell6418 2 дня назад

So it’s a software robot spider… is that less scary? And can I crush it with volume 12 of my encyclopedia?

@blauemadeleine 2 дня назад

Thank you for the interesting talk! 🌻

@YaMumsSpecialFriend День назад

I have to take exception to a statement you made Sabine - no one can quietly munch on a bag of chips. It’s a law of nature, not a hypothesis.

@Thomas-gk42 День назад

😂

@Donnirononon 2 дня назад

In my opinion AI bots should legally be obliged to identify themselves or their intend by providing additional HTTP headers with their request. Also there should be absolutely insane punishments for anyone who decides to break that law.

@swojnowski453 2 дня назад

by then, block them all.

@neuroopticon 2 дня назад

In my opinion everyone has it backwards, the AI won’t be controlled or regulated, what can be regulated in your phone vouching that you are a human user, your camera vouching for a picture it took etc.

@cuentadeyoutube5903 2 дня назад

I think what will happen is that AI companies will start awarding the companies who let them crawl their site for data, with attribution and traffic. Eventually, a site like the New York Times will need these crawlers or face extinction. That’s google’s strategy basically.

@Kneephry 2 дня назад

Pay people for their data! Maybe the future economy is everyone is compensated for their data and gets to live off low cost AI goods and services that the data compensation pays them for.

@asmyself4021 День назад

but, but... slaves are better!😢

@ralfbaechle 2 дня назад

Imagine the consequences of the net getting flooded by incorrect, outdated or even forged data all amalgated together and poisoning AIs.

@RolandMcKenney 2 дня назад

Content stealing considerably predated the internet, of course.

@swojnowski453 2 дня назад

stick your head out and they will see it ...

@gigatremor9756 2 дня назад

They took the artist's work and then used it for substituting the artist without any explicit consent. So there's no surprise about other people wanting to keep their work away from Ai.

@andrasbiro3007 2 дня назад

It's no different from any other automation. The outcry is bigger only because the "victims" have a much louder voice. But it's ultimately not just futile to fight it, but also bad for society. IF we can reproduce intellectual work for a tiny fraction of the cost, that's a good thing, it makes everything cheaper and better. It's no different from replacing factory workers with industrial robots, that's why we can have lot more and better stuff than before. It's temporarily bad for those who were replaced, but they find other jobs, and long term end up better off too.

@rf5526 2 дня назад

@@andrasbiro3007…..except for the fact that the whole history of the 20th and 21st century suggests that no, people aren’t better off. Displaced workers don’t get better jobs, if they get any new jobs at all, creating generational hardships and greater disparities between the haves and have-nots. This is plainly evident in today’s cost-of-living crisis across the developed world, where the wealthiest societies in the world have great swaths of their societies whose standard of living has cratered. The Utilitarian argument that, “People on the whole are better off,” doesn’t hold up to scrutiny, because the people who get left behind do everything in their power to make sure that those who succeed in the new society remember how they got there, up to and including violence (see: the rise of Fascism in the 21st century). Technological revolutions are dangerous, don’t assume that just because everything is fine now that it will be later.

@Lo6a4evskiy 2 дня назад

It's always the won't you think of the artist comment 🙄

@coonhound_pharoah 2 дня назад

@rf5526 "Standard of living has cratered." You say that from a device that could only be dreamed of only 20 years ago, while living under wealth conditions never before seen in history. It simply is not true that technological development leaves people behind. Believing otherwise is to ignore all relevant empirical evidence.

@rf5526 2 дня назад

@@Lo6a4evskiy Won’t you think of the factory worker, the receptionist, the stock broker and the computer programmer. If this technology is as disruptive as it is claimed to be, then you’re kidding yourself if you think anybody but the top 1% who already don’t need to work won’t be affected,

@Deltarious День назад

One of the genuine things 'looming' over us, or at least the US legal system is the near inevitable need to eventually rewrite copyright laws. Nobody wants to touch it as it's in a somewhat stable semi-functional state right now with just enough legally grey area to keep everyone grumbly but not outraged, but it's very old and deeply flawed. A better system is definitely possible, and I've seen some proposed but it will be an absolutely monumental undertaking to change it all that'll probably take years and cause chaos because the balance it's stuck on right now is at least minimally acceptable to most parties- despite the fact that it does allow pretty egregious exploitation of people's data and company IP

@shadeblackwolf1508 2 дня назад

EU: copyright infringement via AI is still copyright infringement, so make sure your AI's output is legally original. That's gonna be fun.

@Nathaniel_Bush_Ph.D. 2 дня назад

Most of the internet is slop and actually counterproductive for AI's to train on. Curating data to produce a higher-quality training set is more important than getting "more" data. Most of the highest-quality data has already been collected and exists in the current training sets - the sets just need to be paired down intelligently. Automated curation is becoming better and better. The quality flywheel is speeding up. As the quality improves, the need for RLHF in fine-tuning will decrease and the "H" in that acronym is already being replaced by AI as well. In short, there are several virtuous feedback loops already at play in increasing the quality of training and fine-tuning. This will, in turn, improve the quality of synthetic data. As soon as the quality of synthetic data reaches or exceeds the average quality of human generated data (factoring in the curation possible on both those data streams) a further virtuous feedback loop will be engaged. TLDR: Data quantity is a red-herring.

@Lo6a4evskiy 2 дня назад

You're absolutely right. What we've seen is that high quality data is hard (read: expensive) to obtain and curate. Crawling the internet isn't contributing much anymore.

@AAjax 2 дня назад

Yep. Andrej Karpathy thinks we can get high quality LLMs with 1B parameters, if we have the right kind of training data. i.e. data, with embedded reasoning, rather than reddit posts. Meanwhile, OpenAI is using strawberry to produce high quality synthetic data, which no doubt has that embedded reasoning. The whole "models are going to run out of data" and "models will become stupid because they're reading generated data" memes are just cope at this point.

@chrisjsewell День назад

what papers are you citing, when you say synthetic data production/ingestion can be a "virtuous feedback loop"? most research I have seen points to the opposite

@AAjax День назад

@@chrisjsewell "Synthetic Data for Zero-Shot Learning in Language Models" "Robustness to Noisy Inputs using Synthetic Data in Large Language Models" OpenAI was reported to largely shift away from human reinforcement feedback to generated synthetic data with embedded reasoning.

@ausnetting 2 дня назад

The irony of doing a story about information providers’ data being used against their wishes and then advertising a product to allow you to use information providers’ data against their wishes 😂

@tedarcher9120 2 дня назад

Later models of 3d TVs were actually pretty great, shame the early ones were so bad nobody bought them when they became actually amazing

@benlap1977 2 дня назад

The other problem is that if people go to the AI chatbots to get their info and not visiting the website, WHO will bother to write anything if they don't have traffic?

@andrasbiro3007 2 дня назад

Websites don't make money anyway. The vast majority at least. For now you can make money on social media, because you can get a lot of views before AI picks up the content.

@benlap1977 День назад

@andrasbiro3007 Yes, people don't make money *anymore* solely through websites because ad revenue crashed about 10 years ago. Before this, people could earn a living through a popular website. But even if money can't be made directly through a website, traffic is still very important for those left. The hobbyists taking pride in the fact people love their content, businesses writing relevant articles to then attract people to their business (i.e. vets), professional "content creators" wishing to attract readers behind the paywall, to subscribe to patreon or their substack... In short, whatever your reasons to write something, you want people to actually read it. If the traffic disappears and some big mega monopolistic corporation takes the credit for YOUR work, then why bother?

@JungleJargon 2 дня назад

The problem is getting the information into people’s heads.

@doubletribble-yt День назад

The problem with AI is with the model developers *excluding* data from their models. E.g., ask them about crowd-sourced law like the Uniting Amendment, or ask them how many N95 masks were sitting on the shelf in the US Strategic National Stockpile during the "shortage" of masks while the pandemic was raging. Or ask them if public health officials knew that cloth masks wouldn't stop the virus at the beginning of the pandemic. This information and much more has been excluded from most of the LLM models.

@eternaldoorman5228 День назад

The Archive/Citation index _Semantic Scholar_ just seems to have introduced AI assisted paper reading. The first time I saw it was on a paper whose abstract was just "We obtain an asymptotic formula for a weighted sum of the square of the tail in the singular series for the Goldbach and prime-pair problems." and The AI summary was "The key results of this paper include obtaining an asymptotic formula for a weighted sum of the square of the tail in the singular series for the Goldbach and prime-pair problems" and it offered a second choice "The key results of this paper include obtaining an asymptotic formula for a weighted sum of the square of the tail in the singular series for the Goldbach and prime-pair problems. Additionally, the paper discusses the conjecture made by Hardy and Littlewood in 1922 regarding an asymptotic formula for the number of pairs of primes differing by k." The difference is that the second choice included not only the abstract in its entirety, but _also_ the first sentence of the first section. So I told it that I thought the second option was probably better than the first, but it didn't ask me to quantify how much better. Given the paper in question is seventeen pages long, I think the answer would be around 0.001% better and I'm not doing anything fancy here with additive measures, ....

@williamcarlson5405 2 дня назад

From WC USA, well Dr. I hope that today is your birthday as I know that when my wife calls her friends and relatives in Germany it has to be on the right day! She says it can’t be easily or late! As for AI, I hate it, if you call your Doctor or Home store, you have to go through this maze of numbers before you get someone to talk to! Years ago you called, you got who you wanted to talk to or they switched you, usually within 1 to 2 minutes! Now it takes 5 minutes and you may or may not get your party, if you don’t I’ve had to go through the whole process again! This is progress?!

@imjonkatz 2 дня назад

Ahhhh Happy Birthday Sabine :D

@Playingwith3D 2 дня назад

The current state of the internet bots means AI will be using flawed AI data to train AI.

@protocol6 2 дня назад

Speaking of AI reading plots... Have you noticed that LLMs are pretty good at LaTeX? If your plots are done in pure LaTeX rather than as embedded images, there's a good chance it can figure them out.

@katrinabryce 2 дня назад

I find most plots these days are done as a json file with the plot data, and a separate javascript file to render it. I can examine the output of the plot and the json file, and easily get the raw chart data from that. Whether an AI can or not, I'm not sure.

@protocol6 2 дня назад

@@katrinabryce An LLM should be able to "understand" the json, as much as they do anything. The problem is most of them don't currently go out and get external resources on their own.

@AdastraRecordings 2 дня назад

crap in crap out. Content stealing is even older than you'd think, buskers used to steal orchestral pieces and make money from it, I believe Beethoven monetized it by selling them the sheet music to play from.

@Fnord1984 День назад

That is not stealing lol. Beethoven shouldn't "own" his melodies, different way's to play an instrument is not something you should be able to own, like a car. It is absurd that musicians think they "own" their creation. Don't play it for people if you wan't to keep it to yourself, lol! You don't actually "own" information in that sense, it is unnatural. Because where does it end if this is the case? Can my grandmother claim her cake recipe is hers and only hers? No one else may use her cake recipe? It is just absurd. Copyright and patents are unnatural and unjustfied in close to every single way imaginable.

@tuckerbugeater День назад

@@Fnord1984 That is not stealing lol. Beethoven shouldn't "own" his melodies, different way's to play an instrument is not something you should be able to own, like a car. It is absurd that musicians think they "own" their creation. Don't play it for people if you wan't to keep it to yourself, lol! You don't actually "own" information in that sense, it is unnatural. Because where does it end if this is the case? Can my grandmother claim her cake recipe is hers and only hers? No one else may use her cake recipe? It is just absurd. Copyright and patents are unnatural and unjustfied in close to every single way imaginable.

@anotherspontaneousvideo День назад

AI being introduced to the new trainer: "Nice to meet you, AI" "Nice to meet you too, AI"

@GadZookz 2 дня назад

❤️ I love 3D TV and still have two of them. Who needs AI when you still have 3D TV?

@carlbrenninkmeijer8925 2 дня назад

Many thanks, this video makes my day.

@crawkn 2 дня назад

There is what seems to be a very deliberate conflation of verbatim reproduction of source material, which is _not at all_ what LLMs typically do, and generalized learning of information from a plethora of sources, which is then summarized, which _is_ what LLMs typically do. It is also what human experts typically do, only providing citations for what is quoted, or what is unique to a particular source. However it would be fairly easy for LLMs to routinely provide a list of source citations upon request for information they communicate, which might be required in a minority of cases, for example when generated information is referenced in published content. There is no obligation assumed for humans to explicitly cite specific original sources of general information they have committed to memory. That duty applies only to very unique or proprietary information.

@thekaxmax День назад

When LLMs can produce citations that are accurate I'll sign up to this idea.

@crawkn День назад

@@thekaxmax some models cite online articles which they use to answer queries, and I haven't heard there are problems with them. When fraudulent scientific papers are generated using fake data, it really shouldn't be surprising that there aren't any legitimate citations to support them. However this capability would have to be trained into them, it's not going to be automatic.

@thekaxmax День назад

@@crawkn 'some', not 'all'. That my point.

@crawkn День назад

@@thekaxmax Yes, I am not claiming that this is already the norm, I'm saying it should be. However since LLMs rarely quote verbatim from any source, there is not normally any ethical obligation to credit sources for any but exclusive or proprietary content. And it is not unusual for LLMs to reference the source of their information in prose, rather than as a formal citation.

@NOLNV1 День назад

Many (most?) generative deep learning systems can reproduce training data such that it is very recognisably the same, it is noisier but barely.

@keithsquawk 2 дня назад

If you want your AI to succeed just pay loads to ensure users demand the other Ai's are blocked.

@georgioszotos5519 2 дня назад

It seems like no one wants to take responsibility for what is coming, and they all push it to AI. I'm hoping that AI doesn't take revenge.

@MrJaspett День назад

The incredible irony of the NordVPN advert on the end offering to help get around blocked content.

@notthemessiah9243 2 дня назад

Id rather use ai to search stuff because the privacy agreements drive me mad which hurts both the website and the ai

@hoerstle6636 2 дня назад

AI is a victim of turbocapotalism: "As long as noone says otherwise, everything that creates money is viable. We will deal with problems when someone other than us notices them..."

@hermannkienesberger1215 2 дня назад

Alles Gute!!! ❤

@TheBigBlueMarble 2 дня назад

AI is very different than 3D TVs. 3D TVs died because they did not offer much benefit. AI is seen as failing in the future because most people have no idea what AI does or what it is capable of and assume that it offers no benefit.

@drachimera День назад

Well first off, generative models arn’t really AI…. At least not general AI. Second, it’s useful, but marginally….. one has to wonder if the distance from a typewriter to a word processor is larger than a word processor to ChatGPT + word processor. Some say prompt engineering and RAG will fix this problem….. I am skeptical! What’s the value of communication no one reads?

@TheBigBlueMarble День назад

@@drachimera I was not talking about generative AI? If you think AI is nothing more than making pretty pictures and writing homework assignments then you need to do a bit more looking into what AI is really capable of.

@drachimera День назад

@@TheBigBlueMarble I do it for a living…..

@TheBigBlueMarble День назад

@@drachimera Then I am surprised at your apparent under-estimation of the usefulness and capabilities of AI now and in the near future.

@drachimera День назад

@@TheBigBlueMarble perhaps it is semantics…. I have had a lot of victories in my career using machine learning, statistics, algorithms, and software engineering. While some of those techniques could be termed ‘AI’ it’s very much single purpose tailored solutions…. I view AI as something that can truly adapt to solve a large number of problems….. I haven’t seen anything that delivers on that. Sure there are research papers, and big splashy marketing….. but I haven’t seen something that satisfies that criteria come out of big tech, at least not yet.

@WaqarAslam2000 2 дня назад

AI scientists should pay for the data they use to train their models.

@johndoe2-ns6tf День назад

That's why scientists and academics only released very narrowing models with very specific goals: they can't afford the costs of the data. There were cases when a "data source" like a news organization allowed access to their data for "research purpose" only or to create a very specific tool for them. Companies like openai, have no morals, no ethical guidelines no criminal shame. They have enough f y money and time to deal with any lawsuit from their criminal practices. But the worst of all is the number of people that actually support them for doing that.

@MCsCreations 2 дня назад

Fascinating. An AI to understand scientific papers? That'd absolutely kill journalism, Sabine. 🤨 (Yes, sarcasm.) Anyway, stay safe there with your family! 🖖😊

@picksalot1 2 дня назад

The idea that humans produce the best data will be short lived. As AI accesses more non-human data "sensing" and collecting devices, it will be able to far surpass the limits of human senses, and their biased storage and processing.

@pirobot668beta 2 дня назад

I was trying to warn everyone when Captcha's were popping up everywhere. It was AI systems training..."click all the squares not containing a Bus"

@Mrluk245 День назад

Yeah whats bad about it?

@NOLNV1 День назад

@@Mrluk245having to provide free work to train AI models in order to use sites, usually without being told that's what they are asking you to do. (Also nobody likes AI)

@Mrluk245 День назад

@@NOLNV1 you have to do the work anyway if you do the captcha so why not do something usefull with it. If someone else benefits from it doesnt affect you so why bother? And of course some people like AI its quite usefull for certain tasks...

@ericlani2622 2 дня назад

Really interesting and well put together

@passion_proh-jects День назад

5:01 I'm eating chips... is she watching me? Cos do! Licensing, T's & C's apply...

@olibertosoto5470 2 дня назад

Now hold on a minute, we're supposed to give AI high quality data! Ohoh!

@Planetside223 День назад

Just a quick reminder that a VPN doesn’t make your Internet browsing any “safer”. it just changes your location so that you can access region locked content. Your computers’s IP address is still visible, and is also kept on the router itself. A hardware proxy is what makes your Internet “safer”.

@donm5354 2 дня назад

They should make a CAPTAIN CRUNCH like breakfast cereal called AI CRUNCH.

@tomhejda6450 2 дня назад

Will you please make as critical video about VPN actual usefulness in online security as you make about anything else? Thanks.

@suichiao3222 День назад

6'13" exactly. I've seen job ads that recruit scientists to train models for $50 per hour, which is slightly higher than what a science professor gets on average in the US.

@JCAtkeson3 День назад

If you gave an AI a body with high bandwidth senses, wouldn't that provide an endless source of real world data? Maybe AIs need to touch grass, literally.

@Thomas-gk42 День назад

The lack of that is the biggest issue to get conscious, I think

@bizopca 2 дня назад

Doesn't 49 authors for a single paper strike you as excessive?

@michaelleue7594 2 дня назад

If you're going to talk about copyright hypocrisy, make sure to mention the Internet Archive and how it's being sued into oblivion by people who think libraries are a disservice to humanity.

@technoman9000 2 дня назад

Libraries are a disservice to dividends

@damsonrhea 2 дня назад

How else are they going to make you pay to access information that was public 20 years ago, or stop you from reading old stories people shared for free so that you have to buy something new! Next you'll tell me [X]-as-a-service is hot garbage! Think of the Monetizers!

@WGG25 2 дня назад

unfortunately it's not that simple

@SmallGuyonTop 2 дня назад

It is not a library. Libraries pay royalties. The Internet Archive does not.

@lv1up 2 дня назад

Yet it is paid for so that is avaliable for the public. Just because no nation single handedly control the internet, we should t share information across orders because.. Royalties? Ludacris. This is 2024. Royalties were broken with the birth of internet, THANKFULLY. And rich ass bitches have been doing whatever they can to try and put the lid back on pandoras box. All I can say is: Good luck Chucks!

@anthonycarbone3826 День назад

It is estimated that up to 5% of newly created music content on Spotify is AI generated. This allows all of the commissions generated to be kept by Spotify. Music is very forumalic, at least in the pop genre, and these formulas are deeply embedded within AI models. Since voice training can produce with 80% accuracy after two weeks of training and achieve 99% accuracy after two months of training this is a huge concern for musicians and artists, both past and present.

@Ken00001010 2 дня назад

Webcrawling was an easy way to get cheap data in the early experimental days of LLM development. But that is not the only way; Tesla trained their AI from data directly observed by thousands of cars driving the roads. However, just throwing the Web at an AI for training is not as valuable as training with curated high information content data, and that is being constructed by the AIs we have, now. When it comes to training data, quality beats quantity.

@swojnowski453 2 дня назад

quality data is experience, not book stuff. Written theory is just bar bones and no flesh.

@boccobadz 2 дня назад

OpenAI still crawls pages with robots.txt. You have to block their scrapers by IP.

@citris1 2 дня назад

The world is a flood of data, more than you can ever use.

@musikSkool День назад

Internet users have been pirating and copyright infringing movies/tv shows/music/games/etc for decades, and the instant they try to train AI on their data, they instantly decide copyright applies to them. A company spends hundreds of millions of dollars making a movie and thousands of people pirate it, and then someone trains an AI on some random RU-vid video and everyone complains. Hypocrisy? You don't want large corporations stealing your livelihood? How is it wrong when they do it but you can pirate all the TV shows you want, and you think that is different somehow?

@Gracinda80 День назад

I want to hear your rant about copyright hypocrisy.

@removechan10298 День назад

not "hYpo" but "Hiiipo" for hypocrisy.

@Macallion День назад

I'm a writer currently looking for a new position, and I've seen so many jobs online aimed at writers specifically for training generative AI. Like I'm going to sign up to train a bot to replace me. Pretty minimal wages too, naturally.

@tenmamut День назад

'Public free data' aha last time I checked medical records are not public.

@mr.k905 День назад

I've said this from the beginning: when AI will become so common that it's used by everyone, there simply won't be any new data of value to feed the AI with. A massive data feedback loop will lead to stagnation at best, as the only data available will soon be at least partially (and increasingly) generated by the AI itself.

@sUmEgIaMbRuS День назад

The robots.txt thing is such a cute early-internet-naive idea. I mean, the crawlers have to respect the rules, there's nothing stopping them from saving the data anyway. Or am I getting this wrong?

@ddally8851 День назад

Imagine if a bot crawled the news sources and learned that Sean “Diddy” combs has been arrested. We just can’t have that sort of thing.

@AndyKidd-o4x День назад

Alles Gute zum Geburtstag! 🍻From a UK postdoc currently watching while hoping to see the aurora tonight🤞

@Thomas-gk42 День назад

☺️

@big_mac_love 2 дня назад

Not blocking Meta in robots file maybe also relates to them sharing their models weights with all whereas OpenAI is not that open.

@yapdog День назад

No, the problem isn't access to high-quality data. It's a complete misunderstanding of the problems they're trying to solve; they seem to believe that collecting a lot of data allows for building a general artificial intelligence. So, they keep making promises about what will be possible solely based on that belief. But now they're running against the limits of collection... and still don't understand the problem.

@asdfqwerty14587 День назад

Yeah, I really don't get how so many people can look at AIs that already have access to millions of times more data than a human ever reads during their lifetime (and have people curating their data too), see them still perform worse than humans, and decide that the reason they're performing worse than humans is that they just don't have enough data.

@yapdog День назад

@@asdfqwerty14587 Exactly

@Medan1993 День назад

Ok, robots.txt doesn't mean "I will block you" but rather "please, restrict to what we allow". That's one misunderstanding of this. I can create crawler that will just ignore this file completely and will gather everything on the website. Second point - you can spoof what type of crawler (or even not a crawler) you are. You can show that you are Chrome/Firefox browser and not some sort of python library, just by changing one line of code. Third - detection is being done more on behavioural level, not on the "show me who you are" one. That's what Captcha is for. Fourth - companies will lie to get to the data, simple as that. Edit: Ok, it seems Sabrina did mention some of this but only in 3/4ths of the video... Wonder how many people will watch it till the end.