Hybrid CTC/Attention Architecture for End-to-End Speech Recognition 리뷰

August 13, 2022 3 minute read

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Abstract

Hybrid CTC/Attention Architecture로 multiobjective learning
- improve robustness, monotonic alignments
- achieve fast convergence
Joint decoding
- CTC probability를 추가하여 더 robust한 monotonic alignment search를 할 수 있다.

Previous work

A. HMM/DNN

X = T-length speech feature sequence, W = N-length word sequence
Bayes’ theorem와 conditional independence assumption 사용
p(XlS), p(SlW), p(W)are the acoustic, lexicon, and language models.

1) Acoustic model p(X|S)

framewise likelihood function p(x_tls_t) is replaced with framewise posterior distribution.
학습하기 위해 필요한 framewise state alignment s_t 가 HMM/GMM system에 의해 제공됨.

2) Lexicon model p(S|W):

The conversion from W(word sequence) to HMM states is deterministically performed by using a pronunciation dictionary through a phoneme representation

3) Language model p(W)

an m-gram model

B. Connectionist temporal classification (CTC)

conditional independence assumption 사용
C가 될 수있는 π의 probability를 모두 summation

C. Attention mechanism

Compared with the HMM/DNN and CTC approaches, the attention-based approach does not make any conditional independence assumptions.

Content-based attention mechanism

encoder의 hidden vector와 decoder의 hidden vector를 사용

하지만 이 방법은 이전의 attention weights를 사용하지 않기 때문에 sequence의 위치 정보를 고려해주지 못한다.

Location-aware attention mechanism

위의 방법을 확장한 attention mechanism

encoder의 hidden vector와 decoder의 hidden vector 뿐만 아니라 이전 타임스텝의 attention weight도 사용함으로서 더 적절한 attention alignment를 찾을 수 있다.

음성합성에서는 location-sensitive attention 이라고도 하며, multi-head attention이 나오기 전에는 음성인식, 음성합성에서 location-aware attention이 표준으로 많이 사용됐다.

Objective

This is the strong assumption of the attention-based approach. c is the ground truth of the previous characters.

Teacher Forcing?

Teacher Forcing은 decoder의 다음 스텝으로 예측한 결과를 넣어주는 것이 아니라 ground truth를 넣어주는 것이다.

장점

학습이 빠르다

단점

학습과 추론 단계에서의 차이가 있기 때문에 모델의 성능과 안정성을 떨어뜨릴 수 있다. (Exposure Bias Problem)
- 하지만 Exposure Bias Problem이 생각만큼 큰 영향을 미치지 않는다는 2019년 연구 결과가 있다. (T. He, J. Zhang, Z. Zhou, and J. Glass. Quantifying Exposure Bias for Neural Language Generation (2019))

HYBRID CTC/ATTENTION

Multiobjective learning(MOL)

shared encoder를 학습하는데 ctc loss를 auxiliary loss로 사용
- enforce a monotonic alignment between speech and label sequences during training
- forward–backward algorithm in CTC helps to speed up the process of estimating the desired alignment.

Joint decoding

1) Attention-based decoding in general

Let Ω_l be a set of partial hypotheses of the length l.
Ω_0 contains only one hypothesis with the starting symbol,
For l = 1 to L_{max}, beam search is in progress.

문제점

잘못 attention하여 eos 토큰을 빨리 예측하면 hypothesis가 매우 짧아질 수 있다.
같은 부분만 계속 attention하여 동일한 label이 반복되고 hypothesis가 매우 길어질 수 있다.
sequence 길이가 길어질 수록 누적 확률의 값이 작아지기 때문에 length penalty를 고려해줘야한다.

2) Conventional decoding techniques

To alleviate the alignment problem, a length penalty term is commonly used to control the hypothesis length.

where C is the length of sequence C, and γ is a tunable parameter.

문제점

이렇게 튜닝을 해도 매우 길고 매우 짧은 hypotheses를 완전히 배제할 수는 없다.
오히려 input speech에 따라 고정된 비율로 minimum and maximum lengths를 설정하는 것이 효과적
하지만 이 또한 input speech에 비해 매우 짧거나 매우 긴 hypotheses를 막을 수는 없다.

Another approach is the coverage term, η and τ are tunable parameters.

The coverage term represents the number of frames that have received a cumulative attention

greater than τ .

문제점

same label을 반복하는 문제는 해결할 수 있지만, 많은 deletion errors를 발생시키는 eos토큰 문제는 해결할 수 없다.
적절한 하이퍼파라미터를 찾기가 쉽지 않다.

3) Joint decoding

본 논문에서 위의 문제점을 해결하기 위해 제안한 디코딩 방법

기본적인 Attention-based decoding에 CTC probability를 추가

monotonic alignment를 강화하면서 same frames을 반복하는 문제, large jump 문제를 해결할 수 있다.

문제점

attention decoder는 output-label-synchronously, CTC는 frame-synchronously
CTC probabilities를 통합하기 위해서 두 가지 방법을 제안한다.

3.1) Rescoring

a two-pass approach

attention 기반으로 a set of complete hypotheses을 구한다. (beam width만큼 lattice를 미리 뽑아놓는다)
CTC and attention probabilities로 rescoring한다. (기존에는 complete hypotheses에서 누적확률이 가장 높았던 것을 선택했지만, 지금은 CTC probabilities까지 고려해서 rescoring)

3.2) One-pass decoding

we compute the probability of each partial hypothesis using CTC and an attention model.

we utilize CTC prefix probability defined as the cumulative probability of all label sequences that have h as their prefix.

CTC score requires a modified forward algorithm that computes it label-synchronously.

ENDDETECT

스텝을 계속 돌아도 complete hypotheses의 best score보다 높아질 확률이 없다면, 스텝을 중단하여 computation을 줄일 수 있다.
best score in the complete hypotheses recently generated 와 best score in all the complete hypotheses 차이

One-pass search 방법이 rescoring 방법보다 computation도 적고 search error도 줄일 수 있다.

EXPERIMENTS

attention 모델보다 Multiobjective learning(MOL) 모델이 attention alignment를 더 잘 찾는다.

MOL 모델과 joint decoding을 했을 때 좋은 성능을 보였다.
특히 rescoring 보다 one pass decoding을 했을 때 더 좋았다.

Reference

Share on

Twitter Facebook LinkedIn

Sangchun Ha