AAS2F: Ambiguity-Aware Speech-to-Face Synthesis with Speaker-Conditioned Diffusion Models

Steps to use the demo:

  1. Upload or record a speech audio clip to generate face images conditioned on the speaker's voice. Please provide at least 5 seconds of speech. Note that it works best with English as the model is trained on English speech, but should work with other languages as well.
  2. Click the 'Generate' button to start the generation process.
1 15
1 10
10 50
0 9999