On the Utility of Self-Supervised Models for Prosody-Related Tasks [PDF]
Self-Supervised Learning (SSL) from speech data has produced models that have achieved remarkable performance in many tasks, and that are known to implicitly represent many aspects of information latently present in speech signals.
Guan-Ting Lin+7 more
semanticscholar +1 more source
Text-Free Prosody-Aware Generative Spoken Language Modeling [PDF]
Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (
E. Kharitonov+10 more
semanticscholar +1 more source
Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech [PDF]
Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which
Yi Ren+6 more
semanticscholar +1 more source
Prosody Is Not Identity: A Speaker Anonymization Approach Using Prosody Cloning
Prosody is closely linked to the identity of a speaker, leading to individual pitch and intonation patterns. Therefore, it is challenging in speaker anonymization to generate speech utterances that both keep the original audio’s main prosodic structure ...
Sarina Meyer+5 more
semanticscholar +1 more source
Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis [PDF]
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser ...
Guangzhi Sun+5 more
semanticscholar +1 more source
Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior [PDF]
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features ...
Guangzhi Sun+7 more
semanticscholar +1 more source
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech [PDF]
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio ...
S. Karlapati+5 more
semanticscholar +1 more source
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [PDF]
Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled.
C. Chien, Hung-yi Lee
semanticscholar +1 more source
Camp: A Two-Stage Approach to Modelling Prosody in Context [PDF]
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic ...
Zack Hodari+8 more
semanticscholar +1 more source
“Textual Prosody” Can Change Impressions of Reading in People With Normal Hearing and Hearing Loss
Recently, dynamic text presentation, such as scrolling text, has been widely used. Texts are often presented at constant timing and speed in conventional dynamic text presentation.
Miki Uetsuki+2 more
doaj +1 more source