Results 11 to 20 of about 2,183,275 (319)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding [PDF]

open access: yesConference on Empirical Methods in Natural Language Processing, 2023
We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.
Hang Zhang, Xin Li, Lidong Bing
semanticscholar   +1 more source

High-Fidelity Audio Compression with Improved RVQGAN [PDF]

open access: yesNeural Information Processing Systems, 2023
Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional ...
Rithesh Kumar   +4 more
semanticscholar   +1 more source

AST: Audio Spectrogram Transformer [PDF]

open access: yesInterspeech, 2021
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels.
Yuan Gong, Yu-An Chung, James R. Glass
semanticscholar   +1 more source

SoundStream: An End-to-End Neural Audio Codec [PDF]

open access: yesIEEE/ACM Transactions on Audio Speech and Language Processing, 2021
We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/
Neil Zeghidour   +4 more
semanticscholar   +1 more source

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models [PDF]

open access: yesInternational Conference on Machine Learning, 2023
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs.
Haohe Liu   +7 more
semanticscholar   +1 more source

High Fidelity Neural Audio Compression [PDF]

open access: yesTrans. Mach. Learn. Res., 2022
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
Alexandre D'efossez   +3 more
semanticscholar   +1 more source

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation [PDF]

open access: yesIEEE International Conference on Acoustics, Speech, and Signal Processing, 2022
Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural ...
Yusong Wu   +5 more
semanticscholar   +1 more source

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio [PDF]

open access: yesInterspeech, 2023
Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages.
Max Bain   +3 more
semanticscholar   +1 more source

AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining [PDF]

open access: yesIEEE/ACM Transactions on Audio Speech and Language Processing, 2023
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from ...
Haohe Liu   +9 more
semanticscholar   +1 more source

CLAP Learning Audio Concepts from Natural Language Supervision

open access: yesIEEE International Conference on Acoustics, Speech, and Signal Processing, 2023
Mainstream machine listening models are trained to learn audio concepts under the paradigm of one class label to many recordings focusing on one task.
Benjamin Elizalde   +3 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy