STAR: Speech-to-Audio Generation via Representation Learning

📚️ Introduction

STAR is the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems.

Within this space, you have the opportunity to directly control our model through voice input, thereby generating the corresponding audio output.

🗣️ Input

A brief input speech utterance for the overall audio scene.

Example：A cat meowing and young female speaking

🎙️ Input Speech Example

Input Speech Example

0:00

🎧️ Output

Capture both auditory events and scene cues and generate corresponding audio

🔊 Output Audio Example

Generated Audio Example

0:00

🛠️ Online Inference

You can upload your own samples, or try the quick examples provided below.

🗣️ Speech Input

🎧️ Generated Audio

🎯 Quick Examples

Click examples below to try!

🗣️ Speech Input	📝 Caption