Says this was posted 6 days ago but when I go to the site its got different setup, so links are gone or changed etc. So do we assume they have added some of the features into the main installation process like the requirements?
This is wild! It’s crazy how little input audio it requires. Also I just wanted to say thanks. If it weren’t for you I would have never discovered my passion for creating AI voice models!
@@amitnishad0777No, I guess I could do commissions but I haven’t really thought much about it. I also want to improve before I do something like that as I’m to amateurish at data cleaning atm.
I follow your channel since the early days. I´m super happy for your growth and also super happy when you do content like this... for non-tech people to be able to try and have fun with AI. A dedicated video for everyone to follow. Keep up the good stuff!
Voice synthesis with emotions? That’s a next-level breakthrough for personalizing user experiences. Feels like we're inching closer to seamless AI-human conversations.
I love how you do not assume that I know what you know, and bothered explaining the basics. and made time stamps for the more knowledgeable to skip. excellent man!!! so we cant train it properly on a larger audio file (you cant pack enough vocal range in that for professional works..
Dude, I am a retired software engineer/java programmer that only used PC…..RELIGIOUSLY…..so I totally understand what you did…./VERY COOL, you did a fantastic job! When the iPhone 3GS came out and was 9.99 at Best Buy I got it and switch to ALL APPLE and never poked back!!!! hearing what you have to do to get this to work cheese I don’t miss those days of going through all that crap but I know you love it and you have successful. I applaud you for doing what you do you’re a very articulate very intelligent and I think you just did a great job knocking this video out. I wish you all the best in the future thank you much for the demo. I will look into it for Apple if they have something?
That mixing Chinese and English is simply perfect, any Chinese no matter it's Mandarin even Cantonese just speaks like that, the TTS shows no flaw with it's voice, tone and pronunciation, if I play that to my friends and family they can't really spot the common AI characteristics with it.
Man, your channel is the bomb 💣 And right, that "Spanish" reading was a little bit hilarious and awful at the same time. Hope they make more languages available soon. 3 of your videos in a row. New subscriber here!
After a break, I deleted all uploaded files and started again, this time successfully. First error was when uploading programs, stick to the older nominated versions! Don't think that by uploading a newer version, things will be better, they won't ! The program is brilliant and will save me a lot of money. Thankyou! Where I went wrong was creating the virtual environment? You sat to add "conda activate f5"; but you must put in "conda init" first, hit enter, and then add "conda activate f5" Once done, it went smoothly
There was a promise about updates with emotions, right? So far, nothing. With ElevenLabs we need to try some workarounds like: (And she says with great sadness) or something like (She says with great anger) Insert the text - The context helps, this uses more characters but in some tests it was worth it for me.
i alaway wonder why the requirements are never listed first ... xD (specs vram/ram req) the chinese is insane . it always sounds more than the original voice lol
Gotta love installing installers for installing installers in an installer that installs the installer needed for a virtual environment used for installing an installer for a tts program. 👍
I'm glad that this is being developed, even if it's still at a point where I wouldn't even enable it if it was as easy as a toggle, let alone dig into code to get it working.
@@jaredf6205 I think he had like AB testing going on in the thumbnail. One is a normal wavelength thumbnail and the other thumbnail also has a wavelength pic paired with a.. sus anime pic.
This AI is really good...at sounding like a bad audiobook narrator! 😂 It nails those over-the-top emotions, but they don't sound very human. Maybe the problem is that it's trained on audiobooks, where the emotions are often exaggerated. What if we used this "fake emotion" data to our advantage? First, train an AI to recognize those audiobook patterns. Then, train a second AI to spot real emotions in everyday speech from RU-vid, podcasts, etc. The second AI could learn to tell the difference between fake and genuine, and we'd get an AI that truly understands how we express emotions! What do you guys think?
Have you tried the eleven labs reader for audio books? Not all voices are great but i foubd the voice of burt Reynolds to work really well for audiobooks. It also works in different languages
I think that's what a lot of these AI models use. It's called a discriminator, and it's just is to do just that; tell the determine whether a piece (image, audio, etc) is genuine or ai generated. That's the base of my knowledge, I don't know much after that, or if they use it for this voice model.
If you generate anything longer than 10 minutes, you'll notice that the voice model gets worse and worse until it becomes absolute gibberish and then static noise at around an hour
@@captteemo9133 I built the bot from scratch, the basis of my bot is Ollama, for fast communication I used Llama3.2 with 1B parameters. Speech recognition works on Whisper, I used to work with VOSK, VOSK is not inferior by the way, only Whisper allows you to insert punctuation marks into speech. Speech synthesis is based on COQUI TTS - VITS multi-voice model. Unfortunately, it will not work on a smartphone
@@captteemo9133 I built the bot from scratch, the basis of my bot is Ollama, for fast communication I used Llama3.2 with 1B parameters. Speech recognition works on Whisper, I used to work with VOSK, VOSK is not inferior by the way, only Whisper allows you to insert punctuation marks into recognized text. Speech synthesis is based on COQUI TTS - VITS multi-voice model. Unfortunately, it will not work on a smartphone
There was some any language to any language AI voice tool too. Does anyone remember? We can just feed it any language voice and it will learn from it, and after that step it can the be used to generate voice to speak in any language. I believe, It was possible to make it sing too. it even creates a tts file, I believe. So that, we can use that file with any text to speech engine.
It's really cool but I need it to be able to blend multiple voices together to create a new original one. Just copying other people's voices is not really ethical when using voices for commercial purposes.
Crazy! i'm interested in the cross language options, and generally how it handles other non English languages. EDIT: just reached end, so it's Chinese and English support at the moment. All in all, thx for the upload definately checking this out!
Thank you for the effort in explaining this topic, but the video is too long with a lot of unnecessary examples. the point was clear early on, so trimming the extras and making it more concise would really improve both the content and the viewing experience. Hope you'd see this feedback ;)
It did, and it was fun!!! You can find absolutely funny examples over on The Lost Narrator's RU-vid channel. Yeah, it's My Little Pony voice examples from fan actresses, but I say they are some of the best clips I have found.
Yep and does anyone remember adobe voco it could do cloning as well as emotions it was very real for 2016 I bet the big tech already has very advanced stuff in their labs
Looks great, but the only thing i wanted to know was inference speed without processing the reference. What would the potential be for realtime if the reference voice was not being processed as part of the inference?
I haven't looked at it yet, but it shows a spectrogram of the clip¹, so it's possible/probable that it generates the entire clip in one go, I.e. it works on every part of the clip at the same time. If that's the case, it could probably create a 20 second clip in e.g. 15 seconds, but you would still have to wait 15 seconds before you can hear any of it. I may be wrong though. ¹ some text to audio systems generates an image of the spectrogram and then converts the spectrogram to an audio file. The spectrogram is a representation of the audio where time is on the x-axis, the frequency is on the y-axis, and the amplitude is the intensity/color of the pixel.
I think I will wait for LM studio version or fooocus/flux comfy edition. This installation version is so straight I just can't.... 😅 Anyway, Thank you for the tutorial!
What GPU are you running? Your 30 seconds is around 5000 for me. I tried on huggingface with about the same results. Replicate was at about the speeds you were getting.
I write lyrics in every language now and I'm starting to find that there aren't words to describe certain notes of inflection you desire. But somehow, the language model understands that little subtlety that you're looking for. Whether it's a person or not, it does not matter to me. It understands me. Better than any of you lol.
Hmmm it is almost as good as characterai TTS and it is not private! (correct me if characterai tts techonology can be used outside their site) But unfortunately F5 is only for English and Chinese languages...