STAR: Speech-to-Audio Generation via Representation Learning

📚️ Introduction

STAR is the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems.

Within this space, you have the opportunity to directly control our model through voice input, thereby generating the corresponding audio output.

🗣️ Input

A brief input speech utterance for the overall audio scene.

Example:A cat meowing and young female speaking

🎙️ Input Speech Example

🎧️ Output

Capture both auditory events and scene cues and generate corresponding audio

🔊 Output Audio Example


🛠️ Online Inference

You can upload your own samples, or try the quick examples provided below.

🎯 Quick Examples

Click examples below to try!
🗣️ Speech Input 📝 Caption