WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit [PDF]
In this paper, we propose an open source, production first, and production ready speech recognition toolkit called WeNet in which a new two-pass approach is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single ...
Zhuoyuan Yao +9 more
semanticscholar +1 more source
Recent Advances in End-to-End Automatic Speech Recognition [PDF]
Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).
Jinyu Li
semanticscholar +1 more source
Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings [PDF]
Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained
Leonardo Pepino +2 more
semanticscholar +1 more source
Recent Progress in the CUHK Dysarthric Speech Recognition System [PDF]
Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date.
Shansong Liu +7 more
semanticscholar +1 more source
Unsupervised Cross-lingual Representation Learning for Speech Recognition [PDF]
This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
Alexis Conneau +4 more
semanticscholar +1 more source
WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition [PDF]
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total.
Binbin Zhang +11 more
semanticscholar +1 more source
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition [PDF]
We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients).
Daniel S. Park +6 more
semanticscholar +1 more source
Speech recognition with deep recurrent neural networks [PDF]
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is ...
Alex Graves +2 more
semanticscholar +1 more source
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [PDF]
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences ...
Qian Zhang +6 more
semanticscholar +1 more source
Intermediate Loss Regularization for CTC-Based Speech Recognition [PDF]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification (CTC) objective.
Jaesong Lee, Shinji Watanabe
semanticscholar +1 more source

