Glow-TTS 논문 리뷰

March 20, 2022 3 minute read

Abstract

이전 parallel TTS 모델들은 외부적인 autoregressive TTS 모델로부터 alignment를 받아야 학습할 수 있었다.
외부적인 aligner 없이도 학습할 수 있는 flow-based generative model Glow-TTS를 제안한다.
Tacotron 2와 비교했을 때 합성속도가 15.7배 빨라졌다.

Introduction

이전 parallel TTS 모델들은 외부적인 autoregressive TTS 모델로부터 alignment를 받아야 학습할 수 있었다. 즉, 외부적인 영향을 많이 받았다.
Glow-TTS는 flow와 dynamic programming을 활용하여 텍스트와 음성간의 alignment를 찾는다.
거의 수정 없이 multi-speaker도 가능하다.

Model

인간이 텍스트를 읽을 때 순서대로 단어를 빼먹지 않고 읽기 때문에 유사한 방법으로 Glow-TTS를 설계했다.

Training and Inference Procedures

학습 때는 log-likelihood를 최대화하는 alignment A와 파라미터 θ를 찾는다. 하지만 이것을 한번에 처리하기에는 계산적으로 다루기 힘들어서 두개로 나누어서 처리한다.

첫째는 현재 파라미터 θ에 대해 가장 probable한 monotonic alignment A∗를 먼저 찾는 것이고,

둘째는 고정된 A∗으로 log-likelihood를 최대화하기 위해 파라미터 θ를 업데이트하는 것이다.

alignment A∗을 찾기 위해서 monotonic alignment search (MAS) 알고리즘을 사용했다.

alignment A∗이 계산한 duration label과 맞추기 위해 duruation predictor를 학습했다. 왜냐하면 인퍼런스에서는 duration predictor가 예측한 duration label을 사용하기 때문이다.

FastSpeech 구조를 따라서 duration predictor를 text encoder 앞에 뒀고, logarithmic domain에서 MSE로 학습했다. 또한, duration predictor가 maximum likelihood에 영향을 주지 않도록 backward pass단계에서 gradient가 통과하는 것을 막기 위해 stop gradient operator sg를 적용했다.

인퍼런스 때는 text encoder와 duration predictor가 이전 distribution과 alignment를 예측하고 그것을 기반으로 latent 변수가 샘플링되고, 디코더가 병렬적으로 mel-spectrogram을 합성한다.

Monotonic Alignment Search

alignment A∗은 dynamic programming으로 계산된다. 시간 복잡도는 O(Ttext × Tmel)이고 MAS가 차지하는 시간은 전체 학습 시간의 2%보다 더 적게 걸린다.

inference에서는 MAS를 사용하지 않고, duration predictor가 alignment를 예측한다.

Model Architecture

Encoder and Duration Predictor

인코더는 Transformer TTS 인코더 구조에서 두 가지 변화를 주었다.
1. positional encoding을 제거하고 self-attention에 relative position representation 적용
2. pre-net에 residual connection 추가
duration predictor는 FastSpeech 논문에서 제안한 구조를 동일하게 사용하였다.

Decoder

Glow-TTS의 핵심 부분
학습 때는 mel-spectrogram을 latent representation으로 잘 변환하고, 인퍼런스 때는 이전의 distribution을 mel-spectrogram distribution으로 잘 변환하는 역할
시간 축을 절반으로 하여 160-channel 피쳐맵으로 만들고, 40개 group으로 만들어서 각각 1x1 convolution을 수행했다.
예를 들어 위의 사진처럼 8-channel feature map [a, b, g, h, m, n, s, t]가 있을 때, group 수가 2이면 [a, b, m, n] , [g, h, s, t]이고, group 수가 4이면 [a, m], [b, n], [g, s], [h, t]가 된다.
Affine coupling layer는 WaveGlow의 구조를 따랐다.

Experiments

데이터셋으로 LJSpeech와 LibriTTS corpus 사용
robustness 테스트를 위해 Harry Potter and the Philosopher’s Stone 책에서 데이터를 수집하였고, 최대 길이는 800을 초과함.
인풋 텍스트 토큰으로 phonemes을 사용하였고, vocoder로 WaveGlow를 사용
다화자 Glow-TTS를 학습하기 위해 speaker embedding을 추가하고 hidden dimension을 증가함.

Results

Audio Quality

Ground truth(GT) mel-spectrograms으로 WaveGlow가 생성한 audio의 MOS가 4.19로 TTS model의 upper limit가 된다.
temperature T가 0.333일 때 가장 성능이 좋았고, 모든 모델이 tacotron2보다 성능이 좋았다.

Sampling Speed and Robustness

Sampling Speed

인퍼런스 속도가 텍스트 길이와 상관없이 약 40ms 로 고정이었지만, tacotron2은 길이에 따라 linear하게 증가한다.
end-to-end setting으로 1분 speech를 합성하는데 얼마나 걸리는지 측정해보았는데 55ms 가 걸림. 전체 시간은 1.5s 인데, Glow-TTS가 4%, WaveGlow가 96%정도 걸림.

Robustness

긴 문장을 합성했을 때 얼마나 robust한지 Google STT API로 CER을 측정해보았다.
Tacotron 2는 character가 260을 초과했을 때 CER이 올라갔지만, Glow-TTS는 CER 값이 유지되었다.

Diversity and Controllability

Glow-TTS는 flow기반의 생성 모델이어서 다양한 샘플을 합성할 수 있다.
왼쪽 사진은 T를 0.667로 고정하고 gaussian noise를 다르게하여 측정한 fundamental frequency (F0)
오른쪽 사진은 gaussian noise를 고정하고 T값을 다르게하여 측정한 fundamental frequency (F0)

duration predictor에 양의 값을 곱해서 음성의 속도를 컨트롤할 수 있다.
사진은 1.25, 1.0, 0.75, 0.5를 곱한 mel-spectrogram이다.

Multi-Speaker TTS

Audio Quality

T가 0.667일 때 MOS가 가장 좋았다.

Speaker-Dependent Duration

같은 문장을 다른 화자로 합성한 오디오의 pitch track
화자에 따라 인풋 토큰의 duration을 다르게 예측하는 것을 알 수 있다.

Voice Conversion

모델이 latent representation과 speaker identities를 분리해서 학습한다.
화자의 GT mel-spectrogram을 latent representation으로 바꾸고, 다른 화자의 identity와 함께 합성한 오디오의 사진이다.

Conclusion

Glow-TTS는 내부적으로 monotonic alignment를 찾는 flow-based 생성 모델이다.
tacotron2보다 평균적으로 15.7배 빠르게 합성한다.
오디오의 속도 컨트롤이나 다화자 세팅도 가능하고 긴 문장에 대해서도 robust하다.

Share on

Twitter Facebook LinkedIn

Sangchun Ha

Glow-TTS 논문 리뷰

Abstract

Introduction

Model

Training and Inference Procedures

Monotonic Alignment Search

Model Architecture

Encoder and Duration Predictor

Decoder

Experiments

Results

Audio Quality

Sampling Speed and Robustness

Sampling Speed

Robustness

Diversity and Controllability

Multi-Speaker TTS

Audio Quality

Speaker-Dependent Duration

Voice Conversion

Conclusion

Share on

Leave a comment

You may also enjoy

원씽

세이노의 가르침

kf-deberta-multitask 모델 공개

함께 자라기