I only know a little bit of coding, but your tutorial really speeded up the learning process. I would also like to see more videos. To me, it doesn't matter if your videos are long or short. What matters is that as long as we understand what you are teaching. So, yes, your work is good and thank you for your time to teach for free. I appreciate it.
Wow, actually sounds better than I expected! Never tried 'ultra-fast', thanks! Definitely seems like the best results would be obtained from printing a bunch of variations like you suggested and then editing together the best ones.
Hey man whatever I do I always get the same error, even if trying to upload a single 8 second wav: "RangeError: Maximum call stack size exceeded." Any ideas?
I tried this as well but I downloaded everything and did it locally. I wanted to convert an audiobook. I gave up because it took 4 days to do 45 min. But it sounded amazing! I used the William voice. But it glitched a lot and the voice became female for several sentences. Don’t know why. I did set the seed value to 1, but it still did it. Also, I expect yours sounds robotic because your breaking it up instead of using full sentences. It seems to know what a sentence sounds like.
That is valuable information! I'll keep that in mind the next time I record audio samples. Since you're not trying to use a specific voice, but one that sounds good for reading audiobooks, have you tried one of Coqui-TTS's models? I think they can produce speech much faster.
omg, Thank You so much martin for taking the time to show us how this works. For me I ended up with the same type of results, its a tiny bit robotic, but it really does reflect my tone and speech. The only thing I realized, it has a somewhat of a hard time with hyphenated words. But the timing is really good.
lol came here for the same reason, used it about 10 days or so ago and everything was fine, could generate a couple dozen clips with no issue. Now after the second one it maxes out ram and crashes.
Excellent indeed, the best voice cloning video I've ever seen. Would it be possible to use this model to clone my voice in my own language (Brazilian Portuguese) or does it only work for English? Thank you.
Maybe the model YourTTS could be of interest for you, because it allows to generate Brazilian Portuguese speech as well. Unfortunately the Tortoise-TTS model can only generate English speech. You can insert text of other languages, but it would be pronounced wrong and would sound off (I tried it for German). Since the model was trained with a multi-speaker English dataset, it won't be able to generate proper speech of other languages. The challenge here is to first create a multi-speaker dataset for a particular language, similar to the English dataset used. Then the model would need to be trained or fine-tuned on this dataset.
Hey Martin! Been using this code for a while, thank you! Just a little comment, lately there´s been some errors on the first block of code. Really hoping that you can keep it going in the future, I really need it! Thanks so much again ;)
Hey Martin, this is awesome, was just what I needed to help me start working on a TTS model. However, I am wondering if you could do an example video on how I could run this locally on a windows 10 machine? I am on a closed network where I am running this and I am not able to use Google Collab.
Hey great videos Martin! I don't have that much experience with coding and google colab, but following along with your videos allowed me to generate some speeches. I am a bit confused on how colab works and what it exactly is. After I have run the code blocks for the installation of Tortoise and uploaded my audio files, do I have to redo these steps everytime I want to generate some speech when I start a new session?
Unfortunately, the answer is yes. In each session, you get remote access to Google Cloud computing resources. You can think of a new session as getting a brand new computer. It already has some programs installed, but you have to install everything you need on top of that. Unfortunately, these sessions are not persistent (otherwise Google would have to keep your session alive in the background, including all the allocated computing resources), so you'll have to repeat these steps every time.
Does the tortoise understand punctuation of question sentences? Will it change the intonation of the resulting sound,if the text input has question mark or exclamation mark at the end? Also, do you know how long would it take to generate let's say 1000 words speech with some modern GPU which have 32gb of memory?
Thanks! :) I think that when using the Tortoise-TTS model, it depends on the length of the speech to be produced. Since producing several hours of speech with the Tortoise-TTS model definitely takes a lot of time.
Hey, great video, really well made and easy to follow. However, in the last cell I am getting an error which goes as follows: TypeError Traceback (most recent call last) in 22 os.makedirs(voice_outpath, exist_ok=True) 23 ---> 24 voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME) 25 26 all_parts = [] 2 frames /content/tortoise-tts/tortoise/utils/audio.py in load_wav_to_torch(full_path) 20 norm_fix = 1. 21 else: ---> 22 raise NotImplemented(f"Provided data dtype not supported: {data.dtype}") 23 return (torch.FloatTensor(data.astype(np.float32)) / norm_fix, sampling_rate) 24 TypeError: 'NotImplementedType' object is not callable Not sure how to fix that. Any advice would be appreciated.
Awesome!!! Tested a little bit around and it works, but I'm getting these warnings for some reason: "UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")" and "No stop tokens found in one of the generated voice clips. This typically means the spoken audio is too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, try breaking up your input text." (clips are max 10 sec). Any idea? 😅
I have seen that the warning has appeared for me as well. I think you can ignore it, since we don't need the gradients here. Gradients are needed during the training of the model to optimize the weights of the model, but in the video we are already using the well-trained model weights and therefore gradients are not relevant for generating speech. The second warning is likely to occur if the automatically split text segments still contain too much text, so that the generated speech is longer than 10 seconds. Have you checked if there are some words missing in this particular audio sample?
@@martin-thissen Thanks for responding bro. Yes, the text were a little longer I guess. And sometimes it turns out fine, sometimes parts of the speech miss and sometimes the voice changes it's gender 😂😂😂
@@marksackhaarberg1822 Haha that's interesting. You could maybe add a seed so that your results are less random. When calling the method "tts_with_preset" you can pass "use_deterministic_seed" as an additional parameter. The whole thing would then look like this (with 42 as your seed, you can change the number): gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, preset=preset, use_deterministic_seed=42)
Feel free to try it, but I'm pretty sure there is one. From my point of view, there are two reasons for this: 1. The Tortoise TTS model has been trained to produce the speaker embedding for audio samples of about 10 seconds in length. 2. Such models often use the Mel spectogram (which is based on the Fourier transform). So using 10 second audio samples at a time instead of a 50 second audio sample can make a big difference. Read more here: wiki.cimec.unitn.it/tiki-index.php?page=Time-frequency+analysis
Hey, I just watched your channel. Big congrats on the hype you're creating, it's awesome!!! I assume you plan to use cloned voices in your videos, right? I am thinking about how we can make this beneficial for both of us 🤔
Hi Martin! :) I have a copy of the colab notebook but when I try to run it I get the following error: Encountered error while trying to install package llvmlite. Do you have a solution on that? I already tried to update the requirements but it didn't work. Thank you in advance & have a great day!
Okay nevermind, I got it already! In case anyone needs help and wants to know I removed 'numba' from the requirements.txt and put it into a seperate code line in my workbook so the run of the first code block doesn't end because of the described error. Even though llvmlite isn't installed correctly the workbook runs as nicely as before! :)
I have used a few different Colabs of Tortoise TTS. It does the actual tts just fine. But the voice cloning aspect just hasn't been there. It generally sounds like whatever voice that it's trained on. Mainly I get a UK accent, even though the samples are not close to that. I guess my next step is to see if I can train it locally.
Hey there, I replied to your comment under the other video. I'm curious if it will help you generating speech that sounds more like your actual voice. :-)
@@martin-thissen Thanks I saw that. I am still working with it, but not having much success. I did find that generating the audio multiple times increases the voice quality. The quality, not the cloning. I added a line to the Colab code to generate the sample multiple times.
@@john_blues True, if you don't state a seed the results are non-deterministic and will be different each time. I hope you will find a way to achieve better speech results. :-)
I have studied this before, though not tried. I heard the trick is using text samples that incorperate MANY variations of vowels, expressions, and intonation from clips saying stuff like 'the quick brown fox..." etc
I would recommend checking out the Coqui-TTS library, feel free to check out my video on how to use the library: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE--tE0UqE1R8E.html&ab_channel=MartinThissen
@@martin-thissen Also I need help with Tortoise TTS. All of the times I've used it previously, it was working like a charm, and even though it's not perfect, I was cloning voices no problem. Now, whenever I run the installation process, I end up getting errors, thereby preventing me from even using Tortoise ever again. It was said that you have to re-install the packages to get it working again, but I'd like to know HOW I can do the re-installations. *Please help me!!!*
The issue I have run into is how to change or edit the voice samples? For instance if I wanted to add more or swap out the samples I've used. If I run the cell again to upload files I get an error. The only way I am able to upload different samples is if I restart and run all again, is there a better way?
Oh yeah, there is definitively a nicer way for this. You can delete your uploaded audio files. They are stored at /tortoise-tts/tortoise/voices/. You can also add audio samples manually to this folder. Alternatively, you could also just change the value of the variable CUSTOM_VOICE_NAME. Then the new files will be uploaded to a different folder.
Hi Martin your tutorials are excellent, however all the final results come out sounding american but originally they have Irish accents, any way around this ? cheers
I think this is mainly because the Tortoise-TTS model has been trained on a limited variety of English speech. I've heard that recording full sentences in the audio sample helps, also the audio quality has a big impact on the voice cloning result.
You know, all this stuff not interesting if you cannot run it on your own PC and you still pitch us with the Collab. Please do more stuff for local PC guy. Why on earth I have the 4090 gathering dust, lol? Let's make it generate voice for a waifu I generated for my special "needs", he-he... xD
Unfortunately yes, I think it will take quite some time. I'm not even sure if you could load the model with a 2gb VRAM gpu. Here the author recommends a GPU with greater than 16gb VRAM for the Tortoise-TTS model: huggingface.co/jbetker/tortoise-tts-v2/blob/84d641c57ae72bba334f2a2d60ec47a84683e6ae/README.md
Trying to use this Collab Notebook and literally follow along with the video and no matter what it crashes on the final stage after about 6 minutes of processing. Apr 13, 2023, 4:13:23 PM WARNING WARNING:root:kernel dfe188d4-45e1-44dd-a2da-2151c4e743c8 restarted Apr 13, 2023, 4:13:23 PM INFO KernelRestarter: restarting kernel (1/5), keep random ports
Oh, in the video we are already using a pre-trained model. So what I showed in the video is just the inference of the model. Of course, you can also run this model on your local computer. Since many don't have a GPU at home, I like to use Colab notebooks in my videos. But you can also run all the individual cells on your local computer.
I think this happens to you in the last cell, right? In the second cell, among other things, the following statement is executed: from tortoise.utils.audio import load_audio, load_voice, load_voices This imports the load_voice method (and afterwards the method is defined). I think it's best to restart your Colab notebook runtime and make sure that all cells have been successfully executed. I hope this helps, otherwise feel free to text me again. 🙂 Alternatively, you could add the following statement to your last cell: from tortoise.utils.audio import load_audio, load_voice, load_voices
Hey, I put the link for the Colab notebook in the video description. Just open the following link :-) colab.research.google.com/drive/1g_CssJK34kwRi7VRtFd73WvTLq9UbnZT?usp=sharing
Generating Longer Speech isn't working. Getting FileNotFoundError Traceback (most recent call last) in 8 9 # Process text ---> 10 with open(textfile_path, 'r', encoding='utf-8') as f: 11 text = ' '.join([l for l in f.readlines()]) 12 if '|' in text: FileNotFoundError: [Errno 2] No such file or directory: '../speech.txt'
Gute Frage, ich glaube, ich kann das noch nicht einrichten (brauche dafür mehr als 4.000 Stunden watch time). Wenn du magst, kannst du mir gerne ein Bier ausgeben, aber ist absolut freiwillig: paypal.me/martinthissen Aber freue mich echt, dass du überhaupt danach fragst :-)
@@TheAiConqueror Habe ich tatsächlich noch nicht gemacht, habe aber in letzter Zeit mehrere Anfragen bekommen. Wenn du magst, kannst du mir gerne mal dein Problem beschreiben. Gerne einfach an meine E-Mail Adresse: thissen.martin@gmail.com
Unfortunately the model can only generate English speech. You can insert text of other languages, but it would be pronounced wrong and would sound off (I tried it for German). Since the model was trained with a multi-speaker English dataset, it won't be able to generate proper speech of other languages. The challenge here is to first create a multi-speaker dataset for a particular language, similar to the English dataset used. Then the model would need to be trained or fine-tuned on this dataset.
@@martin-thissen indeed, one of the things is to train it first with a pair of voice and its transcription with a format of LibreSpeech and then we could use it. I know how to doit but is a little bit how work every phase, I was hoping to see some shortcut. If I doit I will send you a notebook or something. And if you want to create a video with it go for it.
It’s English only, but I sampled my gf (with permission) who has a very heavy Japanese accent and it spoke with her voice but a perfect American accent. Creepy as he11, but really cool!
Unfortunately the Tortoise-TTS model can only generate English speech. You can insert text of other languages, but it would be pronounced wrong and would sound off (I tried it for German). Since the model was trained with a multi-speaker English dataset, it won't be able to generate proper speech of other languages. The challenge here is to first create a multi-speaker dataset for a particular language, similar to the English dataset used. Then the model would need to be trained or fine-tuned on this dataset.