Sangchun Ha

Sangchun Ha

Done is better than perfect!

Real Time Speech Enhancement in the Waveform Domain 리뷰

June 28, 2022 1 minute read

https://arxiv.org/pdf/2006.12847.pdf

Introduction

speech enhancement는 여러 개의 metrics이 존재하는데 인간의 평가와는 correlation이 크지 않다.

그래서 객관적, 주관적 평가를 모두 진행했다.

Model

Notations and problem settings

single-microphone(mono) speech enhancement에 집중함.
x = y + n, y는 clean speech, n은 noise speech
우리는 f(x) ≈ y 인 함수를 찾는다!
music source separation에 사용했던 DEMUCS 구조를 가져와서 사용한다.

DEMUCS architecture

a multi-layer convolutional encoder와 decoder with U-net skip connections로 구성되어 있다.
encoder, decoder는 서로 대칭인 구조
input, output은 single channel only
seq2seq 구조이고, causal prediction에서는 unidirectional LSTM 사용하고 non causal models에서는 bidirectional LSTM 사용한다.
encoder의 i-th layer 아웃풋과 decoder의 i-th layer 인풋을 skip connection으로 연결
sinc interpolation filter로 resampling 했다.

Objective

all STFT loss functions의 합을 최소화 하자.

M is the number of STFT losses
L(stft)는 different resolution을 가지고 적용하는데, number of FFT bins ∈ {512, 1024, 2048}, hop sizes ∈ {50, 120, 240}, and window lengths ∈ {240, 600, 1200} 이다.

Experiments

Valentini와 Deep Noise Suppression (DNS)에 대해 객관적, 주관적인 평가를 모두 진행했다.
augmentation와 loss 함수에 대해 ablation study를 진행했다.
noisy한 상황에서 ASR performance가 증가했다.

Implementation details

Evaluation Methods

객관적인 평가

PESQ: Perceptual evaluation of speech quality
- ITU-T P.862.2 wide-band version을 사용
Short-Time Objective Intelligibility (STOI)
- from 0 to 100
CSIG (MOS)
- signal distortion을 평가
CBAK (MOS)
- intrusiveness of background noise 평가
COVL (MOS)
- overall effect 평가

주관적인 평가

ITU-T P.835로 MOS를 수행
level of distortion, intrusiveness of background noise, and overall quality를 15 raters로 나눠 평가를 진행했다.

Training

Valentini 400 epochs
DNS 250 epochs
The audio is sampled at 16 kHz

Model

non causal DEMUCS
- U=2, S=2, K=8, L=5 and H=64
causal DEMUCS
- U=4, S=4, K=8, L=5 and H=48 or H=64
모델에 주기 전에 표준 편차로 input을 normalizing하고, 아웃풋단에서 다시 원상복귀 하였다.

Data augmentation

Remix augmentation
- 하나의 batch안에서 새로운 noisy 혼합물을 만들기 위해 noises를 섞는다.
random shift 적용
- 0에서 S초 사이만큼
BandMask
- mel scale에서 20%의 주파수 부분을 제거한다.
- SpecAug augmentation만 동일

Results

현재 SOTA 모델 DeepMMSE와 비슷하다.
또한 DEMUCS가 다른 baseline보다 우월하다.
dry · x + (1 − dry) · yˆ 로 noise 제거와 신호 보존 사이를 trade-off 할 수 있다.

reverb없이 합성을 섞은 것, 인위적인 reverb와 함께 합성을 섞은 것, real audio에 대해 주관적인 평가를 진행했다.

Ablation

remix augmentation은 많이 오르지 않았다.

The effect on ASR models

Share on

Twitter Facebook LinkedIn

Leave a comment

You may also enjoy

원씽

February 11, 2024 3 minute read

세이노의 가르침

January 28, 2024 1 minute read

kf-deberta-multitask 모델 공개

January 19, 2024 less than 1 minute read

함께 자라기

December 31, 2023 1 minute read