Train a VITS Speech Model using Coqui TTS | Updated Script and Audio Processing Tools

Подписаться 2,6 тыс.

Просмотров 11 тыс.

50% 1

Updated the audio processing tools for this notebook.
The VITS training loop will train or fine tune a model using the Coqui framework with phonemized text and speaker embeddings.
This is set up for English. It can be done in other labguages. It is easier for languages that are supported by the espeak-ng phonemizer.
Please read the documentation on the Coqui Github page.
github.com/coq...
VITS Training Notebook
colab.research...
VCTK Hindi model - 22khz audio, 4 speakers
Trained on Mozilla Common Voice and Open Speech and Language Resources datasets to 376,500 steps
**DOWNLOAD TEMPORARILY LOST**
Thorsten-Voice's video on Windows setup
• FREE Voice Cloning in ...
Demucs
github.com/fac...
FFMpeg-Normalize
github.com/slh...

Опубликовано:

3 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 36

@shailendrarathore445 Год назад

Thanks brother..😇😇🥰🥰

@chris7868 Год назад

Hi I hope you saw the download link for the hindi model, if not its in the video description. Let me know if it works at all lol The voice quality isn't very good, but the starting samples weren't clear either. You can probably use it as a base for training a clearer sounding voice though, assuming the character set was mostly correct

@nyny Год назад

I've got YourTTS working amazingly! Some of the samples were almost perfect. I developed a ton of tricks - replacements to the whisper text, not using demux (just diarizing everything and using VAD to cut out samples with audio) and a few other techniques. If you'd like to collaborate I can send an email or something. Saw you left the YourTTS track but maybe worth another look.

@nanonomad Год назад

Yeah please shoot me an email. I probably won't read it for a few days but it'll be there. I'm in poor health so I don't have a lot of time to work on things lately. If you haven't tried deepfilternet for denoise, try poking around its command line application. Easy to use and works really well a lot of the time. Voicefixer is another, but it's a bit of a pain.

@bifrostbeberast3246 Год назад

@@nanonomad Get well soon!

@ryusuikagaku Год назад

Hello, i'm not familiar with epoch and step. How many epoch and step must i use to get better result? And how many time/hour needed to run that amount of epoch/step?

@ianboyles2197 Год назад

i think !pip install numba==0.53 can be used to resolve dependency issue without restarting run time after install.

@nanonomad Год назад

Thanks! I think that did the trick. Much appreciated

@RichardCastuera-d8l Год назад

Thank you for this video! can you create a tutorial on how to deploy the model.

@sheanjay 2 месяца назад

Tutorial on training a opensrl data set 🥹 sinhala voice needed

@gzilla783 Год назад

I'm having issues where it does not make the file directories for me.

@nanonomad Год назад

Sorry I missed this. Which bit? You can make them manually if you have to

@gzilla783 Год назад

@@nanonomad I put them in manually now but now tensorboard does not work and this pops up down the line "AssertionError: [!] No training samples found in /content/drive/MyDrive/vits-vctk-narrator-22k-ds/". I'm new to coding and python, sorry for being a bother.

@nanonomad Год назад

@GZILLA is that when you're trying to run the training cell with trainer.fit() at the end? That suggests that the dataset isn't found or structured correctly. Did you make it using the colab notebook tools? Edit: I just noticed the dataset name and path are the same as I was using. Is yours in a directory on gdrive named vctk-narrator-22k-ds? It doesn't have to be. You can put it wherever, but the cell with the variables at the top has to be changed, and then run after the changes. The variables won't get set until the play button is hit.

@allandclive Год назад

Hello can you do a video on Meta's MMS model for TTS?

@neupanenetra Год назад

Could you please show me, the dataset structure to train couqi-tts model, I am very confused with structure. I need to train Nepali Text to Speech

@nanonomad Год назад

In the VITS videos I use the VCTK dataset format. The dataset format is handled by the data loader. You can write your own loader function, I'm just lazy, so I adjust my datasets to follow the vctk format instead. Audio files are in a sub directory wav48_silence_trimmed/speakername Text files in a sub directory txt/speakername Audio files are 22k mono flac format named ending with _mic1.flac

@TheEpicGoofball Год назад

Any advice getting the Duration Predictor to train properly? My models keep coming out sounding drunk around 41k steps.

@nanonomad Год назад

Back up your checkpoints so they dont get overwritten that way you can A/B test if you need to and not lose progress. You can try a restore run from the ckpt and reinitilize the DP, but it generally seems to train well unless you have a strange dataset. Has it been getting steadily worse? It may just be "that thing" that models seem to do where they fluctuate in output quality (in spite of the graphs looking fine). Could just need more training and to be pulled out at just the right spot in the cycle. Also may need to adjust the loss target or lower LR if you're not getting best loss checkpoints

@jimchat 6 месяцев назад

Hi, i tried running your notebook but there is an error at split audio samples with SOX, it seems like its not the same code as in the video, maybe you updated it since?

@nanonomad 6 месяцев назад

If you need to make a dataset by hand I would recommend using Audacity to split audio, or use a ML voice activity detection model. I guess I broke sox, but it was a crude method anyways. With audacity use the segment on silence tool, then file save multiple and it'll save each segment separately. It will handle overlapping silences to keep sentences whole if you fiddle with the noise floor slider when segmenting

@muthukumaran6382 Год назад

Hey, nice video!, I was trying out the colab but the code is not recognizing the backup folder and specifically not the backup.wav. I'm getting this error, `mv: cannot stat 'backup.wav': No such file or directory`, can you help?

@nanonomad Год назад

I'll try to look at it today, but if it's just one of the processing sections you can probably skip it. Which segment was broken? Or is it more than one part?

@muhammadalfahrezi1745 Год назад

does this work for voice clone?

@sadshed4585 Год назад

is vits or yourtts better for finetuning on a single voice with like 100 quality audio files?

@nanonomad Год назад

Yourtts is a derivative of vits, with the speaker encoder and some arch changes. But I find yourtts very inconsistent. Probably vits using dvectors is what I would try. But 100 samples could be 2 mins or 9 mins so that's a pretty big range. I'd say if you have around 5-9 mins of great audio with a good distribution of phonemes you may be able to make it work. You'll probably have to adjust the eval split because the loader is going to complain about the small dataset

@sadshed4585 Год назад

@@nanonomad I get nonetype has no length at some point during training when trying to continue and it stops. Specifically on rev4 notebook any advice?

@Srinivas_Billa Год назад

Hi, do you know how much vram you need to finetune vits?

@nanonomad Год назад

12gb comfortably, but iirc you can get it going on 6gb if you lower the batch size, lower the learning rate so it trains properly, lower the max text length

@gjin2518 Год назад

it can train using japanese voice?

@nanonomad Год назад

Should be able to. I just did a quick search and found this: github.com/coqui-ai/TTS/discussions/1604 There may be a Japanese language model already trained on the Coqui model hub, but I'm not sure.

@Nabuuug Год назад

(it seems my comments are being taken down automatically and I kept reposting them thinking it was a bug, sorry for the spam, you can delete this one)

@nanonomad Год назад

Weird. I checked and I have all the filters disabled. Your not the first person that I've heard that from though. A lot of my comments vanish, but sometimes reappear weeks later. Quite frustrating

@feixym Год назад

well done ! I want train a chinese model , how should I do

@nanonomad Год назад

Hi You may be able to fine-tune an already trained model if there is a Chinese model on the Coqui hub. You will need to research to see if it was trained using phonemes or characters and continue training with that method. If training a new model from the beginning, you will need to decide if you want to train using characters or phonemes. If using phonemes, you need to see if there is support for the language in the phonemizer (espeak, espeak-ng, or gruut). I think espeak has good Chinese support, but you will want to check the documentation to verify that. Phonemes train easier and faster than characters, so if you can, use phonemes. I posted a video about training a Spanish language model a while ago - that notebook uses phonemes. For Chinese, you would need to change the two letter language code in the phoneme_language line, and set the language code for the dataset Check the Coqui github discussion board, though. A few people have posted about training Chinese models. You may be able to work with them or see how they did things