Scene-aware Far-field Automatic Speech Recognition


Abstract

We propose a novel method for generating scene-aware training data for far-field automatic speech recognition. We use a deep learning-based estimator to non-intrusively compute the sub-band reverberation time of an environment from its speech samples. We model the acoustic characteristics of a scene with its reverberation time and represent it using a multivariate Gaussian distribution. We use this distribution to select acoustic impulse responses from a large real-world dataset for augmenting speech data. The speech recognition system trained on our scene-aware data consistently outperforms the system trained using many more random acoustic impulse responses on the REVERB and the AMI far-field benchmarks. In practice, we obtain 2.64% absolute improvement in word error rate compared with using training data of the same size with uniformly distributed reverberation times.

Paper

Scene-aware Far-field Automatic Speech Recognition , arxiv.
Zhenyu Tang, and Dinesh Manocha

@misc{tang2021scene,
      title={Scene-aware Far-field Automatic Speech Recognition}, 
      author={Zhenyu Tang and Dinesh Manocha},
      year={2021},
      eprint={2104.10757},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Code

Git repo