STAR: Speech-to-Audio Generation via Representation Learning
📚️ Introduction
STAR is the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems.
Within this space, you have the opportunity to directly control our model through voice input, thereby generating the corresponding audio output.
🗣️ Input
A brief input speech utterance for the overall audio scene.
Example:A cat meowing and young female speaking
🎙️ Input Speech Example
🎧️ Output
Capture both auditory events and scene cues and generate corresponding audio
🔊 Output Audio Example
🛠️ Online Inference
You can upload your own samples, or try the quick examples provided below.
🎯 Quick Examples
Click examples below to try!
🗣️ Speech Input | 📝 Caption |
---|
Click examples below to try!
🗣️ Speech Input | 📝 Caption |
---|