AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline [PDF]
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin.
Bu, Hui +4 more
core +2 more sources
Prompting Large Language Models with Speech Recognition Abilities [PDF]
Large language models (LLMs) have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering.
Yassir Fathullah +11 more
semanticscholar +1 more source
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [PDF]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (
Yu Zhang +26 more
semanticscholar +1 more source
End-to-End Speech Recognition: A Survey [PDF]
In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning has brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning.
Rohit Prabhavalkar +4 more
semanticscholar +1 more source
Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition [PDF]
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel ...
D. Rekesh +7 more
semanticscholar +1 more source
SALM: Speech-Augmented Language Model with in-Context Learning for Speech Recognition and Translation [PDF]
We present a novel Speech Augmented Language Model (SALM) with multitask and in-context learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task ...
Zhehuai Chen +8 more
semanticscholar +1 more source
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks [PDF]
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.
Soumi Maiti +5 more
semanticscholar +1 more source
Conformer: Convolution-augmented Transformer for Speech Recognition [PDF]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).
Anmol Gulati +10 more
semanticscholar +1 more source
End-To-End Audio-Visual Speech Recognition with Conformers [PDF]
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
Pingchuan Ma +2 more
semanticscholar +1 more source
Deep Learning Enabled Semantic Communications With Speech Recognition and Synthesis [PDF]
In this paper, we develop a deep learning based semantic communication system for speech transmission, named DeepSC-ST. We take the speech recognition and speech synthesis as the transmission tasks of the communication system, respectively.
Zhenzi Weng +5 more
semanticscholar +1 more source

