Natural sounding open source speech synthesis
by Don Mahurin, 2023-03-05
If you have listened to commonly used open source speech synthesis, you may have wondered why it sounds so primitive compared to other systems.
Listening to espeak, for instance, perhaps you thought that little has changed in 10 or 20 years. Worse than that, you could even compare such speech engines to "SAM" that ran on the Atari or Commodore, 40+ years ago:
But there are modern speech systems which can achieve more natural sounding voices. They are easy to install and relatively fast.
I will show two solutions here, Mimic3 and Coqui, that have the following attributes:
- natural sounding
- open source
- python interface
- command line interface
pip3 install mycroft-mimic3-tts
Command line use:
mimic3 --play-program play 'This is a test.'
Python use, setup wrapper TTS class, save this as 'tts-mimic3.py'
from mimic3_tts import Mimic3TextToSpeechSystem, Mimic3Settings import numpy as np class TTS_mimic3: def __init__(self): settings = Mimic3Settings(); settings.voice = 'en_US/ljspeech_low' settings.use_deterministic_compute = True self._sample_rate = settings.sample_rate self.tts = Mimic3TextToSpeechSystem(settings) @property def sample_rate(self): return self._sample_rate def text_to_audio(self, text): self.tts.begin_utterance() self.tts.speak_text(text) return [np.frombuffer(data.audio_bytes, dtype=np.int16) for data in self.tts.end_utterance()]
In another file, import this file, and instantiate the class
import tts-mimic3 as tts tts = tts.TTS_mimic3()
Then create a program to generate and play the resulting speech from text, call it tts.py
import sounddevice as sd import sys outstream = sd.OutputStream(channels=1, blocksize=1024, dtype='int16', samplerate=tts.sample_rate) text = ' '.join(sys.argv[1:]) if len(sys.argv) > 1 else "This is a test." audio = tts.text_to_audio(text, speaker_audio_file='/tmp/b.flac') outstream.start() for a in audio: outstream.write(a) outstream.stop()
pip3 install TTS
Command line use
tts --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_idx female-en-5 --language_idx en --text 'This is a test.'
Or if you want it to sound like your voice
tts --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav your-voice.wav --text 'This is a test.'
from TTS.api import TTS import numpy as np class TTS_coqui: def __init__(self, model = None): if model is None: model = TTS.list_models() self.tts = TTS(model) self._sample_rate = 16000 @property def sample_rate(self): return self._sample_rate def text_to_audio(self, text, speaker = None, language = None, speaker_audio_file = None): if speaker is None: speaker=self.tts.speakers if language is None: language=self.tts.languages data = self.tts.tts(text, speaker=speaker, language=language, speaker_wav = speaker_audio_file) data = (np.array(data) * (2**15) ).astype(np.int16) return [ data ]
Then import and instantiate this class with the following, and convert to sound with the same calling code as mentioned above
import tts-coqui as tts tts = tts.TTS_coqui()
Now you can call one or both of these speech engines from code, and if the provided command line tool is not sufficient, use python above: python3 tts.py hello world.
I hope you are able to use the information here and synthesize more natural sounding speech, both from, the command line and in code.