No video :(

WaveNet by Google DeepMind | Two Minute Papers #93

Подписаться 1,6 млн

Просмотров 130 тыс.

50% 1

Let's talk about Google DeepMind's Wavenet! This piece of work is about generating audio waveforms for Text To Speech and more. Text To Speech basically means that we have a voice reading whatever we have written down. The difference in this work, is, however that it can synthesize these samples in someone's voice provided that we have training samples of this person speaking.
__________________________
The paper "WaveNet: A Generative Model for Raw Audio" is available here:
arxiv.org/abs/...
The blog post about this with the sound samples is available here:
deepmind.com/b...
The machine learning reddit thread about this paper is available here:
www.reddit.com...
Recommended for you:
Every Two Minute Papers episode on deep learning: • AI and Deep Learning -...
WE WOULD LIKE TO THANK OUR GENEROUS PATREON SUPPORTERS WHO MAKE TWO MINUTE PAPERS POSSIBLE:
Sunil Kim, Julian Josephs, Daniel John Benton, Dave Rushton-Smith, Benjamin Kang.
/ twominutepapers
We also thank Experiment for sponsoring our series. - experiment.com/
Thanks so much to JulioC EA for the Spanish captions! :)
Subscribe if you would like to see more of these! - www.youtube.com...
Music: Dat Groove by Audionautix is licensed under a Creative Commons Attribution license (creativecommon...)
Artist: audionautix.com/
The thumbnail background image was found on Pixabay - pixabay.com/hu...
Splash screen/thumbnail design: Felícia Fehér - felicia.hu
Károly Zsolnai-Fehér's links:
Facebook → / twominutepapers
Twitter → / karoly_zsolnai
Web → cg.tuwien.ac.a...

Опубликовано:

6 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 136

@russellcox3699 8 лет назад

Morgan Freeman will soon narrate my life.

@levoGAMES 7 лет назад

It's actually possible. We just need someone to do it.

@faselblaDer3te 7 лет назад

It's already possible: 1. Sample Morgan Freeman's voice and feed it into a deep learning TTS engine like the one in the video 2. feed it texts generated by image description AI, which also already exists and works stunningly well 3. take a selfie every now and then with your smartphone and feed it to the image description AI 4. PROFIT The parts are already there. Some just has to assemble them.

@DennisAllums 6 лет назад

I know you comment was 7 months ago but are you saying he or I could do what you just said now. Where is the deep learning TTS engine at? Surely you don't mean google's text to speech.

@sourabreddy2k15 6 лет назад

Well even on a gpu it would take a painful 90*60*60*(lifetime_hours) minutes to synthesize his voice then narrate your life so ....

@arunghontale3189 5 лет назад

There are already some open source implementations. Take a look at this: github.com/keithito/tacotron.

@muditjain7667 8 лет назад

Freakin amazing. We could use this to create music from bands that no longer exist!

@TwoMinutePapers 8 лет назад

Looking forward to listening to that for sure! :)

@wandereroftheabyss-o4l 3 года назад

I think there is Jukebox.

@2Cerealbox 8 лет назад

I think this is the first time I've been legitimately blown away by these results. This field is moving so quickly, it's hard to keep up.

@CircularEntertain 8 лет назад

Imagine if it could take to inputs: what to say, and a drama description. Open world video games could use this to generate tons and tons of voice lines for insignificant characters. It would of course require the method to synthesize within just a few seconds (though it could pre "render" all the dialogue at the expense of the HDD space) Smash mouth said it best _So much to do so much to see_ Amazing paper, love it.

@euclid2 6 лет назад

its becoming a lot faster! watch his new video on it

@y__h 8 лет назад

Thank you man, awesome intro to this paper. More machine learning papers please :)

@TwoMinutePapers 8 лет назад

Copy that. They're on the way! :)

@TwoMinutePapers 8 лет назад

Thanks so much JulioC EA for the Spanish captions! :)

@SymEof 8 лет назад

What a time to be alive..!

@TwoMinutePapers 8 лет назад

Indeed! :)

@aonoymousandy7467 7 лет назад

Simon Janin i agree, the future will be filled with unimaginable technology

@halloyooh 6 лет назад

what if powers use this to control more us, because it is possible....so technology is growing without control...

@WhiteDragon103 8 лет назад

When I saw this paper come out, I knew you were going to make a video about it. I'm glad you did. This stuff is mind blowing. I'm both excited and scared about what we will see 10 years from now.

@y__h 8 лет назад

Some crazy people will impersonate some really powerful world leaders.

@TwoMinutePapers 8 лет назад

We can already do it with video, and now with voice. Let's hope for the best. :)

@JohnBastardSnow 8 лет назад

I'm having an existential crisis from the exposure to this channel.

@giantneuralnetwork 8 лет назад

The breathing and mouth sounds are extremely unsettling, I suggest anyone who hasn't heard them to check out the blog post and listen to the babbling clips. We're descending into the uncanny valley for speech synthesis, hopefully with enough momentum to make it up the other side! Combine this with what people at OpenAI (apparently) are up to, unsupervised dialogue/convo generation trained by reading reddit, and you've got your own personal intelligent and articulate agent. I'm loving every new generative work that comes out, there's a beauty to the recursive procedure the algorithm uses to generate new sequences. Thanks for bringing these works to a wider audience! +1 for more machine learning papers as well, when you can :-)

@joshbreidinger2616 7 лет назад

People at OpenAI are working on unsupervised dialogue from reddit text? That sounds awesome! I couldn't find it on Google, could you send me a link?

@adydick 7 лет назад

Wow, you are talking about the exact things that I don't have the time to look into, thank you so much, please don't stop!

@TwoMinutePapers 7 лет назад

Thanks for the kind words, happy to have you around! :)

@adydick 7 лет назад

When I see this kind of innovation I get very enthusiastic and optimistic about the future for humankind, I really hope it turns out ok.

@quenz.goosington 8 лет назад

Thanks! This is quite possibly my favorite paper yet!

@TwoMinutePapers 8 лет назад

It's definitely one of the greats out there! :)

@RicoElectrico 8 лет назад

There's a hiss/crackle present in the Wavenet samples. I wonder if it's because of the way they generated output or it's a characteristic of the model. Also, not every concatenative TTS is created equal. Take IVONA for example (which is used in Amazon Echo). It sounds way more natural than one presented, I'd say it's on par with Wavenet.

@betofc89 7 лет назад

This is the best RU-vid channel ever made!

@TheMightyMagic 8 лет назад

DeepMind is doing some incredible stuff. Karoly, why do you think big companies like Google and Baidu publish so much of their cutting-edge research rather than keeping it proprietary?

@TwoMinutePapers 8 лет назад

To the best of my knowledge, DeepMind publishes everything (minus datasets that they are not allowed to at DeepMind Health). If I remember correctly, I read this in their mission statement. Great stuff! :)

@cliffordmjordan 8 лет назад

Fantastic! This is already generating some new ideas and applications for these "dilated convolutions". Thanks!

@easter.bunny.6 7 лет назад

Trained using classical music but result with contamporary music, interesting.

@clehaxze 8 лет назад

I'm exciting to see if deep neural nets to pop up in the next few generations of Vocaloid and CeVIO.

@thatblindnerd5070 5 лет назад

張安邦 Voctro Labs has been (and still continues to) be working on this very thing with projects like WowTune, and now Voiceful. Vocal synthesis using deep mind technology is certainly on its way! Unfortunately right now it's not out for the public to test out but it's coming.

@wfraffle 5 лет назад

Ikr, I would fan so hard over new gen vocaloids

@zakuro8532 4 года назад

Vocaloid 6 omg

@KevinFelstead 8 лет назад

There is a subtle difference between being spoken to and being spoken at. This system needs more input to produce a truly accurate simulation of natural speech. The continuous social (speech modifying) cues we pick up from the faces of the people we interact with make a huge difference, after all we learn to speak by watching faces. The feedback from our own face muscles , mimicking what we see, (sometimes just neurons firing not necessarily with motor output) generate the emotional feedback , imagine how we might modify speech , tone and content by detecting the corners of the eyes when a warm smile is presented, or not as the case may be.

@johnbollenbacher6715 6 лет назад

Kevin Felstead Except that people who are blind from birth have no difficulty learning how to speak.

@kamathlaxminarayana301 6 лет назад

Listening to stories in your grandma's voice forever!!!! How cool would it be!

@eightrice 7 лет назад

I'm so psyched they decided to go convolutional. Everyone is doing it, it's just so on and lit right now, and I don't expect that it will stray from this trend for a while. Can't wait to have my ex's voice modeled and saying all the nasty stuff I didn't have the balls to ask her to say when I had the chance. This will be awesome!

@berendbeumer9204 8 лет назад

Awsome video once again, have read the paper of deepmind after watching your video. Super nice progress of deepmind :)

@TwoMinutePapers 8 лет назад

That's great to hear, that is officially a success - it sparked the flame of curiosity, really cool! :)

@tomkent4656 4 года назад

The future of TTS.

@lflee 8 лет назад

From a layman sci-fi buff point of view - can we use similar technique to build videos (movies), VR scenes or physical things like sculptures/architectures things like that?

@joesmith4546 6 лет назад

李立峰 Absolutely, all of the things you mentioned have trainable data; or more concisely data that forms patterns, the relationships between elements of which might even be easier in some ways than generating speech. Go on Wikipedia, and read up about Markov chains, which are essentially just chains of probabilities for different states. Also, while technically not a neural network, it’s worth mentioning the Waveform Collapse library (WFC) on Github if you are interested in topics related to procedural generation and you’d like to learn more.

@MsGnor 7 лет назад

New to your fabulous channel. Congratulations, love it.

@gafeht 7 лет назад

How will we be able to distinguish between things someone has actually said and things someone has used someone else's voice to say using this kind of technique? It seems like a disaster in a world that already has a difficult time keeping facts straight.

@RonyPlayer 6 лет назад

gafeht either we come up with ways to spot a generated piece of audio, or audio will not be considered good evidence. Same thing for video.

@ScooterCat64 Год назад

We will go back to how it was before the internet age. We will have to trust people on words and years of friendship instead of computer data

@storytellerjack22 7 лет назад

I predict that this will change the face of gaming. Until recently, I could only picture how to generate locations and characters procedurally, but I was worried that music, voice acting, and foley, would always have to be recorded laboriously. Mario Maker is the most successful game in the new genre of game design games, but I look forward to the genre taking over and becoming the new normal for triple-A quality titles crafted by the player. I hope to see gaming become a storytelling experience as much as a "listening," story role playing experience. If scripts can become incarnate automatically. We're likely to see a sea change in the film industry, of fans creating the their own sequels, making their own versions of the prequels, and even the translation of books word for word directly into a film or series. When they surpass us in our capacity for storytelling, I hope they also help us to get outside, cultivate friendships, and stay healthy.

@LarlemMagic 7 лет назад

Everyone is going to implement this everywhere, really cool!

@benjaminlavigne2272 6 лет назад

The avocado is a pair shaped fruit with leathery skin, smooth edible flesh and a large stone.

@tomatoso27 8 лет назад

awesome! I was waiting for audio and music to come to the NN realm. didn't expect it so soon though!

@XGamersGonnaGameX 7 лет назад

I think it will be awhile before a computer will read a book in an entertaining way, most books require the reader to speak in a way that coincides with the context that the words are written in, The computer would have to grasp many parts of language and literature, not just speech.

@DDryTaste 3 года назад

You spoke much more relaxed back in 2016

@DerUltraGamerr 8 лет назад

For those who might be interested: someone already did an implementation of WaveNet in Keras which can be found at github.com/usernaamee/keras-wavenet. I haven't reviewed it by myself yet, but it looks promising.

@Mosfet510 7 лет назад

I can't wait to see what's coming down the line in AI, a very exciting time in that area! Great video, thank you.

@motherbear55 3 года назад

How do they generate the first few samples from scratch? It seems like the network is autogregressive, but it's not clear to me how the very first samples are created if there's nothing to convolve. Does the network also learn to generate the first few samples from nothing, or maybe something like random noise?

@Chr0nalis 8 лет назад

Finally proper speech synthesis.

@abdullahadam8219 2 года назад

Really useful, thanks so much for this :)

@CasperVanLaar 3 месяца назад

I am studying ai and try to apply it to neuroscience. It's crazy that wavenet was invented 8 year ago. The chat bot's voices are so natural now.

@nxt_tim 6 лет назад

Can't believe this is already 1½ years old. Feels like it was announced a few months ago.

@ekaterinakatjakurae2035 7 лет назад

Hey, I really like this video. So, this means I could theoretically feed it the voice of some famous person with lots of audio examples and it then learns that voice and lets me do text-to-speech in that voice? That would be amazing but it would also create a lot of weird situations in the future where we'll never be able to tell whether or not some voice recording is authentic or fake. But we could also teach it Obama's voice or Christopher Hitchens' voice and, like you suggested, make it read an audiobook which would be really cool. I wonder when this becomes widely available for the average user, I would love to try this stuff out.

@PolinomPolynets 8 лет назад

so to read one 10 hour book would take about 6 years )

@conwayying4595 7 лет назад

Very interesting and informative videos, keep it up!

@Masquerpet 7 лет назад

Subscribed immediately!

@TwoMinutePapers 7 лет назад

Thanks for watching and welcome to our growing club of Fellow Scholars! :)

@remisharoon 4 года назад

Is this video speech generated by Wavenet?

@TheAIEpiphany 3 года назад

"knows what it had done several seconds ago" is not quite correct. The original WaveNet used a window of only 250-300 ms.

@AviPars 7 лет назад

Love this channel

@valshaped 4 года назад

The avocado is a pear-shaped fruit with leathery skin, smooth edible flesh, and a large stone.

@BigMTBrain 8 лет назад

Developments like WaveNet pretty much guarantee that the prolific artist Prince, who left behind enough material to train such an AI, will live again--synthetically in a WaveNet-trained AI--to produce new lyrics and music as only Prince could. \o/

@felidadae 8 лет назад

Can be found more samples of wavenet sound synthesis? on the blog there are only few...

@PharoahJardin Год назад

This is the first time I'm hearing your real voice! I was pretty sure your more recent videos had an artificially generated voice and this video confirms it!

@senmou5516 5 лет назад

Thanks for this video. Subscribed

@midclock 4 года назад

I suppose that speech generators would be perfect when AIs will be able to feel the context of what they are talking about, and adjust the intonation depending on the topic, audience type, and things like that..

@tiagotiagot 7 лет назад

I wonder how long until there is a copyright-free competitor to Pandora that seamlessly play an unique improv jam session between all your favorite artists 24/7...

@saikiranputta2582 7 лет назад

It would be great if you can continue with more generic ML/Deep Learning papers like this one. Unlike graphics papers that you seem to concentrate more on! But great content none the less! Kudos!

@TwoMinutePapers 7 лет назад

Thank you for the feedback. No worries, they're coming right up! :)

@Sushilkumar92 7 лет назад

A Rival to Wavenet has arrived - Google's Tacotron

@StevenCasteelYT 4 года назад

Thank you sir.

@samuelmideksa5939 6 лет назад

The url you provided for the paper does not work. Please can you provide another url? I wanted to read the paper to know more about the wavenet archtecture.

@TwoMinutePapers 6 лет назад

Thank you for the feedback. Try this: arxiv.org/abs/1609.03499

@samuelmideksa5939 6 лет назад

That works thanks

@timmiltz2916 6 лет назад

very funny about the vacation :)

@sallerc 7 лет назад

Great video, subscribed.

@TwoMinutePapers 7 лет назад

Happy to have you in our growing club of Fellow Scholars! :)

@composerla 8 лет назад

I'd pay for a CD of the music!

@edansw 6 лет назад

cnn is actually quite popular for sequence 1D input

@gotel100 7 лет назад

where can i read the original ML papers?

@planno6280 7 лет назад

The end for AudibleBook.com

@ethiesm1 8 лет назад

Love it-- Más

@MrDonkov 8 лет назад

The speech really has improved a lot. Music also as far as sound is concerned. However the composition makes no sense at all, it even sounds scarry to me. Like someone lost it´s mind and went to play piano.

@the0mighty0burrito 7 лет назад

Ydonkov The purpose wasn't to make music, is was to replicate the sounds that the instrument made using AI learning.

@kooshikoo6442 6 лет назад

It sounded like a contemporary piano piece. As in contemporary music.

@DarkDiripti 8 лет назад

I googled it, and thus far nobody has used it with all of Morgan Freeman's movies. I'm seriously disappointed in you, internet.

@labeau_6 Год назад

I am from Iraq, welcome, how do I find a site wave net please ❤

@swapanjain892 8 лет назад

Here comes the wave..

@caner19959595 8 лет назад

do you think CNN's can cover LSTM's?

@citiblocsMaster 6 лет назад

5:51 Not quite: one paper down the line, and.... it's faster than real time

@firion0815 4 года назад

You were right, maybe not one paper down the line but still amazingly fast arxiv.org/abs/1910.11480 deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech

@Tiocazutfarsa 7 лет назад

Are you a TTS robot ?

@TwoMinutePapers 7 лет назад

No. Beep boop.

@Greedygoblingames 4 года назад

Crazy! What are us mere humans going to do with our time in the future? :D

@robo1540 7 лет назад

Károly Zsolnai Fehér??? Csak nekem tűnik magyarnak???

@noxim_ 8 лет назад

Shouldnt this be "two+ minute papers"?

@2Cerealbox 8 лет назад

It was actually 2 at first, but that's an overly limiting number. You can barely read an abstract in that time. So I think you have to take it in the same way that you'll tell someone you'll be ready in a minute.

@smarthalayla6061 7 лет назад

Great! When can we expect the human like sex robot?

@eviljohnnybravo7575 8 лет назад

I can only imagine the horror that Audible.com must be experiencing right now.

@BoStanfordify 7 лет назад

Right? Their whole business model out of the window. Soon free automatic, natural sounding text to speech. AI eating the world...

@SinanAkkoyun 5 лет назад

it clearly listened to liszt

@swapanjain892 8 лет назад

Damn!!

@ColacX 7 лет назад

Wow.

@virginboi4654 5 лет назад

These and DeepFakes.....world upside down

@zeeshanqureshi9252 5 лет назад

Lets change it to " Six Minute Papers" xD

@artman40 6 лет назад

You know, I think I prefer Software Automatic Mouth instead.

@jonathanbush6197 7 лет назад

"Everything is working as intended." Oh good. I'll tell Elon his worries are unfounded. I just hope they let you return from your vacation :-)

@StevenCasteelYT 4 года назад

Here from Daddy's Car.

@combatplayer 7 лет назад

finally we can have MMOs where everyone doesn't exclusively speak in text format

@AMR-bf8nx 8 лет назад

Daisy Daisy, give me your answer do...

@catdisc5304 7 лет назад

AM R exactly what I was thinking about lol

@sutipd4973 7 лет назад

Bojler van elado

@hokiepokie333_CicadaMykHyn 3 года назад

You think this is cool... Just wait until you or one of your loved ones, becoming a Targeted Individual.

@Nutritional-Yeast 7 лет назад

I had to dislike the video, because I was referred to a lowly dirty peasant "fellow scholar" slave race person. In the future videos , the narrator of these videos should refer to his audience simply as the learned, and not speak to them like a father would speak to his son. ---- I feel projects like these are a good indication of what's to come. wink wink You guys have an idea, to what it is, that I am referring to.