Tigrinya TTS with Piper
Piper is open source Text-to-Speech synthesizer that supports many languages. Piper is particularly interesting because it produces good quality sound with smaller training data. This possible, because piper make use of espeak-ng phonomizer. In addition it can also train any language using any other language as a base. If you have good definition of the language in espeak-ng.
Piper has a training guide you can follow to train for Tigrinya or any other language. Here is a Dockerfile we use.
FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
ENV DEBIAN_FRONTEND noninteractive
RUN rm /bin/sh && ln -s /bin/bash /bin/sh
RUN apt-get update && apt-get install -y apt-transport-https git python3.9 python3.9-venv espeak-ng
RUN apt-get install -y python3-pip
WORKDIR ".."
RUN git clone https://github.com/rhasspy/piper.git
WORKDIR "../piper/src/python"
RUN python3.9 -m venv .venv
RUN source .venv/bin/activate
RUN sleep 2
RUN .venv/bin/pip3.9 install --upgrade pip
RUN .venv/bin/pip3.9 install --upgrade wheel setuptools
RUN .venv/bin/pip3.9 install -e .
RUN chmod +x build_monotonic_align.sh
RUN ./build_monotonic_align.sh
#RUN pip install piper_phonomizer
#RUN cp -r ~/espeak-ng ~/piper/src/python/.venv/lib/python3.9/site-packages/piper_phonemize/
RUN echo $'python -m piper_train.preprocess \
--language ti \
--input-dir ~/data/low/ \
--output-dir ~/data/train/ \
--dataset-format ljspeech \
--single-speaker \
--debug \
--sample-rate 16000' >pre_train.sh
RUN chmod +x pre_train.sh
RUN echo $'python -m piper_train \
--dataset-dir ~/data/train/ \
--accelerator 'gpu' \
--devices 1 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 10000 \
--resume_from_checkpoint ~/data/model/english.ckpt \
--checkpoint-epochs 1 \
--precision 32 \
--quality low' > train.sh
RUN chmod +x train.sh
RUN echo $'./pre_train.sh;sleep 5;./train.sh' > start.sh
CMD ["start.sh"]
Store your training data in ~/data/ directory. You will need your a base model, espeak-ng-data and training data.
We trained on AWS G5.xlarge the training for 10000 ephos, it took 3 days (about $150). On a gaming machine with RTX3080 and 12GB ram GPU, it takes 6 weeks.
Testing model
First install piper phonemizer command line
wget https://github.com/rhasspy/piper-phonemize/releases/download/2023.11.14-4/piper-phonemize_linux_x86_64.tar.gz
tar -xvf piper-phonemize_linux_x86_64.tar.gz
export LD_LIBRARY_PATH=/usr/share/piper_phonemize/lib::$LD_LIBRARY_PATH
Use these commands to generate audio while the model is being trained
echo "ናብ ውሽጢ ቤት መደቀሲኣ ተቓላጢፋ።" | piper_phonemize -l ti \
--espeak-data $ESPEAK_NG_DATA --allow_missing_phonemes | python3.9 \
-m piper_train.infer \
--sample-rate 22050 --checkpoint $TRAIN/lightning_logs/version_0/checkpoints/*.ckpt \
--output-dir $OUTPUT
TODO (piper_train.infer requires piper python environment outside docker). Work around convert to onnx
Export
python3.9 -m piper_train.export_onnx \
train-ti/lightning_logs/version_0/checkpoints/epoch=3015-step=875444.ckpt \
train-ti/tiPiper.onnx