DNN Text-To-Speech Natural Voice Model For Robots

What is better than natural speech on an embedded platform? How about offline on-device natural speech on the embedded platform?



In the past few months, we have been working on developing a custom End-To-End Text-To-Speech (TTS) model based on Deep Neural Networks that will allow us to have a completely offline speech generation on embedded platforms like NVIDIA Jetson Nano and even RaspberryPi.

Now we are proud to tell you that we have made it!  


In 2017, Google introduced Tacotron architecture for speech synthesis. We based our model on this with a few changes of our own. In the end, this helped us create a robust model that can run in real-time speeds.

This opens up a lot of opportunities for social robots that can speak naturally on its own without relying on costly cloud services or the parametric/concatenative methods that generate robotic voices. We are still pushing hard to make the model run even on smaller footprints to support more devices.


Architecture of the model



Here you can see the simplistic view of the architecture used by us. The characters are converted to vectors and embedded.

This is preprocessed by the Prenet module and encoded by a custom convolution bank network.

Then it is fed to the Attention cells for location-sensitive processing and the recurrent networks use this data to create the frames for the output spectrogram.

The decoded spectrogram is then fed through a convolution bank again for post-processing to get final spectrogram. Finally, we use the Griffin-Lim algorithm to reconstruct the natural-sounding speech audio data. 


Data Acquisition

In TTS research, most of the models are trained and tested on the LJSpeech dataset by KeithIto. Although this contains over 23 hours of audio data and transcripts, we found that it lacks the vocabulary we expect of a modern TTS system.

To overcome this issue we found a dataset provided by BBC containing news articles published by them in the fields of sports, politics, science, business, etc for 2 years.

Analyzing this data we found that it has a rich vocabulary and seemed like a good addition to the already existing LJSpeech dataset.

The combined and cleaned dataset with our custom voice recording had over 50 hours of audio to help us train the model.


Lexical Diversity of the datasets


Training the model

Once all the data is processed and ready we train the model on our powerful cloud machines that help us accelerate the training and testing process.

We train the model until we see the loss is low enough and the encoder and decoder timesteps of the audio frames align (seen as a diagonal line in the below clip).

But these stats mean nothing if the speech doesn’t sound good. So in a frequent interval, the model generates a sample for us to judge the quality of the training. 



Optimizing the model

This is the most important step as we need an efficient and portable model for our robots.

The trained model graph contains a lot of metadata and extra operations which helps in training the model more but is unnecessary for deployment in the production environment.

To help us with the process we need to consider several factors. How much quality loss can we live with on the final speech? How much memory we can afford for the model to run? Is there any overhead for loading the model in the memory?

Luckily the Tensorflow framework provides few easy options to freeze the weights of the graph and remove training metadata.

Also, we managed to do some Pruning of the model to combine several operations into one, removing redundant operations and even quantization if possible. We get to see up to 3x times performance improvement in loading the graph and for synthesis on CPU and GPU. This allowed for real-time synthesis on embedded platforms like NVIDIA Jetson Nano.




Our final model is able to generate natural-sounding speech and it is also possible to add a few variations of how you want the model to pronounce the words. The model was tested on laptops, desktops, NVIDIA Jetson Nano and Raspberry Pi 3B+.

You can watch our video on YouTube to listen to the samples generated.

Future Work

We are not stopping here and will be pushing hard to improve on this to test newer neural vocoder architectures to make the model even faster and allow for larger flexibility in voice modulation.

Tensorflow’s tflite and NVIDIA’s TensorRT platforms are promising but still require some efficient workarounds to overcome the dynamic shapes the model requires. This field is still in active research and every day we get to see a lot of papers published proposing a better way of generating human-like speech. Keep checking this blog to see our updates!

Contact Us

Smartlife Robotics is looking for partners to work together on future projects. Feel free to leave your ideas on how we can introduce new AI solutions -> Contact US



SmartLife Robotics is a startup that builds AI-powered socially intelligent interactive robots for use in a wide spectrum of fields, https://smartlife.global