Natural sounding open source speech synthesis

by Don Mahurin, 2023-03-05

If you have listened to commonly used open source speech synthesis, you may have wondered why it sounds so primitive compared to other systems.

Listening to espeak, for instance, perhaps you thought that little has changed in 10 or 20 years. Worse than that, you could even compare such speech engines to "SAM" that ran on the Atari or Commodore, 40+ years ago:

https://discordier.github.io/sam/ https://github.com/discordier/sam

But there are modern speech systems which can achieve more natural sounding voices. They are easy to install and relatively fast.

I will show two solutions here, Mimic3 and Coqui, that have the following attributes:

Mimic3 TTS

Installation

pip3 install mycroft-mimic3-tts

Command line use:

mimic3 --play-program play 'This is a test.'

Python use, setup wrapper TTS class, save this as 'tts-mimic3.py'

from mimic3_tts import Mimic3TextToSpeechSystem, Mimic3Settings
import numpy as np

class TTS_mimic3:
	def __init__(self):
		settings = Mimic3Settings();
		settings.voice = 'en_US/ljspeech_low'
		settings.use_deterministic_compute = True
		self._sample_rate = settings.sample_rate
		self.tts = Mimic3TextToSpeechSystem(settings)

	@property
	def sample_rate(self):
		return self._sample_rate

	def text_to_audio(self, text):
		self.tts.begin_utterance()
		self.tts.speak_text(text)
		return [np.frombuffer(data.audio_bytes, dtype=np.int16) for data in self.tts.end_utterance()]

In another file, import this file, and instantiate the class

import tts-mimic3 as tts
tts = tts.TTS_mimic3()

Then create a program to generate and play the resulting speech from text, call it tts.py

import sounddevice as sd
import sys

outstream = sd.OutputStream(channels=1, blocksize=1024, dtype='int16', samplerate=tts.sample_rate)

text = ' '.join(sys.argv[1:]) if len(sys.argv) > 1 else "This is a test."
audio = tts.text_to_audio(text, speaker_audio_file='/tmp/b.flac')
outstream.start()
for a in audio:
        outstream.write(a)
outstream.stop()

Coqui TTS

Installation

pip3 install TTS

Command line use

tts --model_name  tts_models/multilingual/multi-dataset/your_tts --speaker_idx female-en-5 --language_idx en --text 'This is a test.'

Or if you want it to sound like your voice

tts --model_name  tts_models/multilingual/multi-dataset/your_tts --speaker_wav your-voice.wav --text 'This is a test.'

Python use

from TTS.api import TTS
import numpy as np

class TTS_coqui:
	def __init__(self, model = None):
		if model is None: model = TTS.list_models()[0]
		self.tts =  TTS(model)
		self._sample_rate = 16000

	@property
	def sample_rate(self):
		return self._sample_rate

	def text_to_audio(self, text, speaker = None, language = None, speaker_audio_file = None):
		if speaker is None:  speaker=self.tts.speakers[0]
		if language is None: language=self.tts.languages[0]
		data = self.tts.tts(text, speaker=speaker, language=language, speaker_wav = speaker_audio_file)
		data = (np.array(data) * (2**15) ).astype(np.int16)
		return [ data ]

Then import and instantiate this class with the following, and convert to sound with the same calling code as mentioned above

import tts-coqui as tts
tts = tts.TTS_coqui()

Now you can call one or both of these speech engines from code, and if the provided command line tool is not sufficient, use python above: python3 tts.py hello world.

I hope you are able to use the information here and synthesize more natural sounding speech, both from, the command line and in code.