The Azure and AI Show|| Speech Studio || Text to Avatar || Prathap Reddy MVP || AI Show || Microsoft

Подписаться 155

50% 1

We are excited to announce the public preview release of Azure AI Speech text to speech avatar, a new feature that enables users to create talking avatar videos with text input, and to build real-time interactive bots trained using human images. In this blog post, we will introduce the features, benefits, and technical details of this feature, and show you some examples of how you can use it for various scenarios.
What is text to speech avatar?
The text to speech avatar system is a text to speech feature with vision capabilities, that allow customers to create synthetic videos of a 2D photorealistic avatar speaking. The Neural text to speech Avatar models are trained by deep neural networks based on the human video recording samples, and the voice of the avatar is provided by text to speech voice model.
Why do we build avatars? There are two main reasons:
Traditional video content creation requires a lot of time and budget, including setting up video shooting environment, filming videos, editing, etc. With text to speech avatar, users can more efficiently create video. Users can use the avatar to build training videos, product introductions, customer testimonials, etc., simply with text input.
With the release of Azure OpenAI Service and neural text to speech, interactive conversation is more natural than before. With text to speech avatar, the users can create more engaging digital interactions. You can use the avatar to build conversational agents, virtual assistants, chatbots, and more.
There are three components in an avatar content generation workflow: text analyzer, the TTS audio synthesizer, and TTS avatar video synthesizer. To generate avatar video, text is first input into the text analyzer, which provides the output in the form of phoneme sequence. Then, the TTS audio synthesizer predicts the acoustic features of the input text and synthesize the voice. These two parts are provided by text to speech voice models. Next, the Neural text to speech Avatar model predicts the image of lip sync with the acoustic features, so that the synthetic video is generated.
Follow me on LinkedIn : www.linkedin.c...