Longer Speech With Tortoise-TTS 🔊 | Tutorial | Voice Cloning

Martin Thissen

Подписаться 14 тыс.

Просмотров 25 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Наука

Опубликовано:

25 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 134

@PotatoPhreak Год назад

Eleven Labs removed the custom voices for the free model so now I'm here

@becomeagenius9044 Год назад

I only know a little bit of coding, but your tutorial really speeded up the learning process. I would also like to see more videos. To me, it doesn't matter if your videos are long or short. What matters is that as long as we understand what you are teaching. So, yes, your work is good and thank you for your time to teach for free. I appreciate it.

@martin-thissen Год назад

Thanks for your feedback and your kind words, appreciate it! :-)

@ArielTavori Год назад

Wow, actually sounds better than I expected! Never tried 'ultra-fast', thanks! Definitely seems like the best results would be obtained from printing a bunch of variations like you suggested and then editing together the best ones.

@martin-thissen Год назад

Great to hear!!

@wpahp Год назад

Hey man whatever I do I always get the same error, even if trying to upload a single 8 second wav: "RangeError: Maximum call stack size exceeded." Any ideas?

@martin-thissen Год назад

This is most probably because of the browser that you're using. I heard that some issues can occur when using Safari.

@wpahp Год назад

@@martin-thissen yeah it’s either that or my macbook pro, tried it on a pc and worked

@BigBoxAI Год назад

You saved my life with this collab 😂😂 You are my hero now. Thanks for this!!!💪

@martin-thissen Год назад

Wow, what can I say haha. You're welcome! :-)

@SpudHead42 Год назад

I tried this as well but I downloaded everything and did it locally. I wanted to convert an audiobook. I gave up because it took 4 days to do 45 min. But it sounded amazing! I used the William voice. But it glitched a lot and the voice became female for several sentences. Don’t know why. I did set the seed value to 1, but it still did it. Also, I expect yours sounds robotic because your breaking it up instead of using full sentences. It seems to know what a sentence sounds like.

@martin-thissen Год назад

That is valuable information! I'll keep that in mind the next time I record audio samples. Since you're not trying to use a specific voice, but one that sounds good for reading audiobooks, have you tried one of Coqui-TTS's models? I think they can produce speech much faster.

@SpudHead42 Год назад

@@martin-thissen I haven’t. I’ll check it out. Thx.

@manuelherrerahipnotista8586 Год назад

Thanks a lot Martin, Very elegant way to save ram man.

@martin-thissen Год назад

Thank you! :)

@voicecover11 Год назад

I would to thank you Martin :)

@vbjoker Год назад

omg, Thank You so much martin for taking the time to show us how this works. For me I ended up with the same type of results, its a tiny bit robotic, but it really does reflect my tone and speech. The only thing I realized, it has a somewhat of a hard time with hyphenated words. But the timing is really good.

@martin-thissen Год назад

Glad to hear you got some reasonable results. :-) I'm sure this year we will see models that produce even better cloning results.

@blueknightpodcast Год назад

nvm it can run by downgrading some packages !pip3 install einops==0.5.0 !pip3 install rotary_embedding_torch==0.1.5 !pip3 install unidecode==1.3.5

@martin-thissen Год назад

Thanks for sharing this. I updated the Colab notebook right now. :)

@NorugaA Год назад

BROOO, THIS HELPED ME ALOT. TYSM!

@mayatroilo282 Год назад

Super cool 😊 Love following this account!

@martin-thissen Год назад

Thank you so much!! :-)

@fredware1998 Год назад

This is amazing tysm!!! Good job. I'm still waiting for the singing model🤭

@martin-thissen Год назад

I added it to my list and will check if there are existing models that produce nice results :-)

@eminsh Год назад

Thanks man this is cool. I had a project in mind to combine TTS and ChatGPT too so this is helpful.

@martin-thissen Год назад

Glad it was helpful! :-)

@lifemarketing9876 Год назад

It doesn't sound robotic to me. It sounds super realistic.

@martin-thissen Год назад

Oh nice, maybe I was a little too harsh with the results. :D

@rickyroffey Год назад

Do you have any updates on the broken Tortoise Google Collab project. Is there any easy fix so that we can still use collab?

@tauriurbanik5509 Год назад

Hey man, Keep the vids coming. Question. Can I add longer voice also then 6-10sec per clip as training data.

@datnick Год назад

Anyone running out of ram while trying this on collab? I only get a few chunks converted to audio before it crashes due to running out of ram

@wrshpsvtvn Год назад

lol came here for the same reason, used it about 10 days or so ago and everything was fine, could generate a couple dozen clips with no issue. Now after the second one it maxes out ram and crashes.

@Outpacer- Год назад

Same problem for me, can't seem to figure it out

@jayr7741 Год назад

How to solve this problem?

@rossibytes Год назад

yes tried it with long and short samples and same issue.. telling me I should switch to Collab Pro and that my session crashed

@JickFincter 10 месяцев назад

I do very long text lenghts and need some clips regenerated and can't figure out how to do this properly could you show me?

@curtis2962 Год назад

Does it say " NameError: name 'load_voice' is not defined" for anyone?

@dans8478 Год назад

Can it be installed locally on a personal computer? It would be great if you could make a tutorial on this.🙏

@SpudHead42 Год назад

There are instructions on the GitHub. I have no idea what I’m doing and I got it working so it’s not hard. Just be aware it’s even slower.

@martin-thissen Год назад

I've added it to my list. :-)

@dans8478 Год назад

@@SpudHead42 What video card do you have?

@saodesideriotv Год назад

Excellent indeed, the best voice cloning video I've ever seen. Would it be possible to use this model to clone my voice in my own language (Brazilian Portuguese) or does it only work for English? Thank you.

@martin-thissen Год назад

Maybe the model YourTTS could be of interest for you, because it allows to generate Brazilian Portuguese speech as well. Unfortunately the Tortoise-TTS model can only generate English speech. You can insert text of other languages, but it would be pronounced wrong and would sound off (I tried it for German). Since the model was trained with a multi-speaker English dataset, it won't be able to generate proper speech of other languages. The challenge here is to first create a multi-speaker dataset for a particular language, similar to the English dataset used. Then the model would need to be trained or fine-tuned on this dataset.

@musicon_2023 Год назад

@@martin-thissen A question for the language of Spain do you know something

@Docjesco Год назад

Is there a german multi-speaker dataset available?

@ginobattiston1042 Год назад

Hey Martin! Been using this code for a while, thank you! Just a little comment, lately there´s been some errors on the first block of code. Really hoping that you can keep it going in the future, I really need it! Thanks so much again ;)

@uchuucompany Год назад

Thanks for the info! Could you next go over how to use it locally without google please? Maybe in a jupyter notebook if possible.

@martin-thissen Год назад

I've added it to my list! :-)

@tommov2934 Год назад

Nice Video....is there a way to set the length of the clip that will be produced?

@TheLastWhiteKid Год назад

Hey Martin, this is awesome, was just what I needed to help me start working on a TTS model. However, I am wondering if you could do an example video on how I could run this locally on a windows 10 machine? I am on a closed network where I am running this and I am not able to use Google Collab.

@martin-thissen Год назад

I will cover this in my next video. Feel free to subscribe to my channel to not miss when it is released. ;-)

@Quamel Год назад

Hey great videos Martin! I don't have that much experience with coding and google colab, but following along with your videos allowed me to generate some speeches. I am a bit confused on how colab works and what it exactly is. After I have run the code blocks for the installation of Tortoise and uploaded my audio files, do I have to redo these steps everytime I want to generate some speech when I start a new session?

@martin-thissen Год назад

Unfortunately, the answer is yes. In each session, you get remote access to Google Cloud computing resources. You can think of a new session as getting a brand new computer. It already has some programs installed, but you have to install everything you need on top of that. Unfortunately, these sessions are not persistent (otherwise Google would have to keep your session alive in the background, including all the allocated computing resources), so you'll have to repeat these steps every time.

@doomsc-roller Год назад

Does the tortoise understand punctuation of question sentences? Will it change the intonation of the resulting sound,if the text input has question mark or exclamation mark at the end? Also, do you know how long would it take to generate let's say 1000 words speech with some modern GPU which have 32gb of memory?

@mcdoubler Год назад

Thanks I'm just after learning how to use this. I wonder if people could use voice lines from games and anime to create full length abridged dubs.

@martin-thissen Год назад

Thanks! :) I think that when using the Tortoise-TTS model, it depends on the length of the speech to be produced. Since producing several hours of speech with the Tortoise-TTS model definitely takes a lot of time.

@monkeivyisacat Год назад

Hey, great video, really well made and easy to follow. However, in the last cell I am getting an error which goes as follows: TypeError Traceback (most recent call last) in 22 os.makedirs(voice_outpath, exist_ok=True) 23 ---> 24 voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME) 25 26 all_parts = [] 2 frames /content/tortoise-tts/tortoise/utils/audio.py in load_wav_to_torch(full_path) 20 norm_fix = 1. 21 else: ---> 22 raise NotImplemented(f"Provided data dtype not supported: {data.dtype}") 23 return (torch.FloatTensor(data.astype(np.float32)) / norm_fix, sampling_rate) 24 TypeError: 'NotImplementedType' object is not callable Not sure how to fix that. Any advice would be appreciated.

@nirsarkar Год назад

thanks!

@martin-thissen Год назад

Glad you liked it! :-)

@marksackhaarberg1822 Год назад

Awesome!!! Tested a little bit around and it works, but I'm getting these warnings for some reason: "UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")" and "No stop tokens found in one of the generated voice clips. This typically means the spoken audio is too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, try breaking up your input text." (clips are max 10 sec). Any idea? 😅

@martin-thissen Год назад

I have seen that the warning has appeared for me as well. I think you can ignore it, since we don't need the gradients here. Gradients are needed during the training of the model to optimize the weights of the model, but in the video we are already using the well-trained model weights and therefore gradients are not relevant for generating speech. The second warning is likely to occur if the automatically split text segments still contain too much text, so that the generated speech is longer than 10 seconds. Have you checked if there are some words missing in this particular audio sample?

@marksackhaarberg1822 Год назад

@@martin-thissen Thanks for responding bro. Yes, the text were a little longer I guess. And sometimes it turns out fine, sometimes parts of the speech miss and sometimes the voice changes it's gender 😂😂😂

@martin-thissen Год назад

@@marksackhaarberg1822 Haha that's interesting. You could maybe add a seed so that your results are less random. When calling the method "tts_with_preset" you can pass "use_deterministic_seed" as an additional parameter. The whole thing would then look like this (with 42 as your seed, you can change the number): gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, preset=preset, use_deterministic_seed=42)

@marksackhaarberg1822 Год назад

@@martin-thissen wow, thank you so much. Gonna try this out 😁👍

@AI.mposter Год назад

is there a difference between uploading 5 separate files and uploading one large file in terms of learning and quality of audio generated?

@martin-thissen Год назад

Feel free to try it, but I'm pretty sure there is one. From my point of view, there are two reasons for this: 1. The Tortoise TTS model has been trained to produce the speaker embedding for audio samples of about 10 seconds in length. 2. Such models often use the Mel spectogram (which is based on the Fourier transform). So using 10 second audio samples at a time instead of a 50 second audio sample can make a big difference. Read more here: wiki.cimec.unitn.it/tiki-index.php?page=Time-frequency+analysis

@AI.mposter Год назад

@@martin-thissen great. Thanks for explaining!

@inayathussain9236 Год назад

I used all the available ram and the process stopped, maybe I need to upgrade to Collab pro. or maybe there are other alternatives?

@fabersoul Год назад

Hey I'm looking for a voice ai pro like yourself. I'm running a 3d animated show on RU-vid atml. Would you be interested in talking?

@martin-thissen Год назад

Hey, I just watched your channel. Big congrats on the hype you're creating, it's awesome!!! I assume you plan to use cloned voices in your videos, right? I am thinking about how we can make this beneficial for both of us 🤔

@fabersoul Год назад

@@martin-thissen thanks a lot, that's exactly what km thinking. Do you have an email or social media handle where I can msg you?

@martin-thissen Год назад

Sure, just contact me at thissen.martin@gmail.com

@stfujoe01 Год назад

Hi Martin! :) I have a copy of the colab notebook but when I try to run it I get the following error: Encountered error while trying to install package llvmlite. Do you have a solution on that? I already tried to update the requirements but it didn't work. Thank you in advance & have a great day!

@stfujoe01 Год назад

Okay nevermind, I got it already! In case anyone needs help and wants to know I removed 'numba' from the requirements.txt and put it into a seperate code line in my workbook so the run of the first code block doesn't end because of the described error. Even though llvmlite isn't installed correctly the workbook runs as nicely as before! :)

@john_blues Год назад

I have used a few different Colabs of Tortoise TTS. It does the actual tts just fine. But the voice cloning aspect just hasn't been there. It generally sounds like whatever voice that it's trained on. Mainly I get a UK accent, even though the samples are not close to that. I guess my next step is to see if I can train it locally.

@martin-thissen Год назад

Hey there, I replied to your comment under the other video. I'm curious if it will help you generating speech that sounds more like your actual voice. :-)

@john_blues Год назад

@@martin-thissen Thanks I saw that. I am still working with it, but not having much success. I did find that generating the audio multiple times increases the voice quality. The quality, not the cloning. I added a line to the Colab code to generate the sample multiple times.

@martin-thissen Год назад

@@john_blues True, if you don't state a seed the results are non-deterministic and will be different each time. I hope you will find a way to achieve better speech results. :-)

@ickorling7328 Год назад

I have studied this before, though not tried. I heard the trick is using text samples that incorperate MANY variations of vowels, expressions, and intonation from clips saying stuff like 'the quick brown fox..." etc

@bloodedge1528 Год назад

Are there other ways we can make longer TTS without having to break our backs with CRUD-TONS of coding?

@martin-thissen Год назад

I would recommend checking out the Coqui-TTS library, feel free to check out my video on how to use the library: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE--tE0UqE1R8E.html&ab_channel=MartinThissen

@bloodedge1528 Год назад

@@martin-thissen Also I need help with Tortoise TTS. All of the times I've used it previously, it was working like a charm, and even though it's not perfect, I was cloning voices no problem. Now, whenever I run the installation process, I end up getting errors, thereby preventing me from even using Tortoise ever again. It was said that you have to re-install the packages to get it working again, but I'd like to know HOW I can do the re-installations. *Please help me!!!*

@wrshpsvtvn Год назад

The issue I have run into is how to change or edit the voice samples? For instance if I wanted to add more or swap out the samples I've used. If I run the cell again to upload files I get an error. The only way I am able to upload different samples is if I restart and run all again, is there a better way?

@martin-thissen Год назад

Oh yeah, there is definitively a nicer way for this. You can delete your uploaded audio files. They are stored at /tortoise-tts/tortoise/voices/. You can also add audio samples manually to this folder. Alternatively, you could also just change the value of the variable CUSTOM_VOICE_NAME. Then the new files will be uploaded to a different folder.

@likwidmocean Год назад

The second cell in the notebook no longer functions.

@cookieman2028 Год назад

it's working fine for me

@user-ox5ce2hv2s Год назад

Hi Martin your tutorials are excellent, however all the final results come out sounding american but originally they have Irish accents, any way around this ? cheers

@martin-thissen Год назад

I think this is mainly because the Tortoise-TTS model has been trained on a limited variety of English speech. I've heard that recording full sentences in the audio sample helps, also the audio quality has a big impact on the voice cloning result.

@etry99 Год назад

@@martin-thissen ok cheers

@GideonHaitis Год назад

hey weißt du wieso der mir " NameError: name 'text' is not defined" gibt?

@procrastonationforever5521 Год назад

You know, all this stuff not interesting if you cannot run it on your own PC and you still pitch us with the Collab. Please do more stuff for local PC guy. Why on earth I have the 4090 gathering dust, lol? Let's make it generate voice for a waifu I generated for my special "needs", he-he... xD

@HistoryHouse45 Год назад

how long it took to generate 2minute 45 second clip

@philosopherlogic Год назад

linked file is different than file shown in video

@shinrinyokumusic1808 Год назад

Hey , do you think it will take a long time to generate one sentence with 2gb VRAM gpu?

@martin-thissen Год назад

Unfortunately yes, I think it will take quite some time. I'm not even sure if you could load the model with a 2gb VRAM gpu. Here the author recommends a GPU with greater than 16gb VRAM for the Tortoise-TTS model: huggingface.co/jbetker/tortoise-tts-v2/blob/84d641c57ae72bba334f2a2d60ec47a84683e6ae/README.md

@rossibytes Год назад

Trying to use this Collab Notebook and literally follow along with the video and no matter what it crashes on the final stage after about 6 minutes of processing. Apr 13, 2023, 4:13:23 PM WARNING WARNING:root:kernel dfe188d4-45e1-44dd-a2da-2151c4e743c8 restarted Apr 13, 2023, 4:13:23 PM INFO KernelRestarter: restarting kernel (1/5), keep random ports

@TechExpertGuys Год назад

hello, i m getting this error. what can i do with this error? ModuleNotFoundError: No module named 'tortoise.api'

@martin-thissen Год назад

Can you try to restart the session? Make sure that the first cell has been executed successfully, because the Tortoise module is installed there. :)

@shailendrarathore445 Год назад

Hello Mr. Martin make some thing for hindi language also.

@martin-thissen Год назад

When I'm doing a video about multi-language speech generation I will definitively consider hindi language as well.

@geocine Год назад

Will this generate a model so I can do inference offline?

@martin-thissen Год назад

Oh, in the video we are already using a pre-trained model. So what I showed in the video is just the inference of the model. Of course, you can also run this model on your local computer. Since many don't have a GPU at home, I like to use Colab notebooks in my videos. But you can also run all the individual cells on your local computer.

@e.b.7485 Год назад

the funny thing is it removed your german accent and gave you a british one

@mcdoubler Год назад

What does load voice mean, I get an error saying it's not defined.

@martin-thissen Год назад

I think this happens to you in the last cell, right? In the second cell, among other things, the following statement is executed: from tortoise.utils.audio import load_audio, load_voice, load_voices This imports the load_voice method (and afterwards the method is defined). I think it's best to restart your Colab notebook runtime and make sure that all cells have been successfully executed. I hope this helps, otherwise feel free to text me again. 🙂 Alternatively, you could add the following statement to your last cell: from tortoise.utils.audio import load_audio, load_voice, load_voices

@MrLiquidxIce Год назад

is it possible to add low ram command or something

@ohyeah9999 Год назад

I can't follow your step, how to go to Colab?

@martin-thissen Год назад

Hey, I put the link for the Colab notebook in the video description. Just open the following link :-) colab.research.google.com/drive/1g_CssJK34kwRi7VRtFd73WvTLq9UbnZT?usp=sharing

@rettbull9100 Год назад

Generating Longer Speech isn't working. Getting FileNotFoundError Traceback (most recent call last) in 8 9 # Process text ---> 10 with open(textfile_path, 'r', encoding='utf-8') as f: 11 text = ' '.join([l for l in f.readlines()]) 12 if '|' in text: FileNotFoundError: [Errno 2] No such file or directory: '../speech.txt'

@martin-thissen Год назад

It looks like your speech.txt file was not uploaded successfully. Or it is located in another folder. Also, maybe the file name does not match.

@adachannmentalhealthcrisis Год назад

my .ipynb doesn't see the speech.txt

@martin-thissen Год назад

Please make sure you've uploaded the speech.txt file :-)

@TheAiConqueror Год назад

Gutes Video 😁 wieso kann man dir eigentlich nichts spenden auf youtube?

@martin-thissen Год назад

Gute Frage, ich glaube, ich kann das noch nicht einrichten (brauche dafür mehr als 4.000 Stunden watch time). Wenn du magst, kannst du mir gerne ein Bier ausgeben, aber ist absolut freiwillig: paypal.me/martinthissen Aber freue mich echt, dass du überhaupt danach fragst :-)

@TheAiConqueror Год назад

@@martin-thissen Okay 😁 ich hab einfach mal gegoogelt was so ein Bier in Deutschland kostet. Ich glaube das sollte für zwei 🍻 reichen. Grüsse 🖖

@TheAiConqueror Год назад

@@martin-thissen Hey noch ne Frage programmierst du auch auf Auftrag? Ich bin ne Niete in Python 😅👀

@martin-thissen Год назад

@@TheAiConqueror Mega, allerbesten Dank dir, weiß ich echt wertzuschätzen! 🙂

@martin-thissen Год назад

@@TheAiConqueror Habe ich tatsächlich noch nicht gemacht, habe aber in letzter Zeit mehrere Anfragen bekommen. Wenn du magst, kannst du mir gerne mal dein Problem beschreiben. Gerne einfach an meine E-Mail Adresse: thissen.martin@gmail.com

@MrLiquidxIce Год назад

it keps crashing due to low ram

@sudhitsoni1416 Год назад

did you find a solution for this?

@MCtaopham Год назад

Super handsome 😍

@martin-thissen Год назад

Thank you 🙂

@titusfx Год назад

Hi there, do you know how to do it with foreign languages? Thanks

@martin-thissen Год назад

Unfortunately the model can only generate English speech. You can insert text of other languages, but it would be pronounced wrong and would sound off (I tried it for German). Since the model was trained with a multi-speaker English dataset, it won't be able to generate proper speech of other languages. The challenge here is to first create a multi-speaker dataset for a particular language, similar to the English dataset used. Then the model would need to be trained or fine-tuned on this dataset.

@titusfx Год назад

@@martin-thissen indeed, one of the things is to train it first with a pair of voice and its transcription with a format of LibreSpeech and then we could use it. I know how to doit but is a little bit how work every phase, I was hoping to see some shortcut. If I doit I will send you a notebook or something. And if you want to create a video with it go for it.

@SpudHead42 Год назад

It’s English only, but I sampled my gf (with permission) who has a very heavy Japanese accent and it spoke with her voice but a perfect American accent. Creepy as he11, but really cool!

@ريمعلي-ز9ل Год назад

Can I reproduce the audio in Arabic please 🙏🏻

@martin-thissen Год назад

Unfortunately the Tortoise-TTS model can only generate English speech. You can insert text of other languages, but it would be pronounced wrong and would sound off (I tried it for German). Since the model was trained with a multi-speaker English dataset, it won't be able to generate proper speech of other languages. The challenge here is to first create a multi-speaker dataset for a particular language, similar to the English dataset used. Then the model would need to be trained or fine-tuned on this dataset.