AAS2F: Ambiguity-Aware Speech-to-Face Synthesis with Speaker-Conditioned Diffusion Models

Steps to use the demo:

Upload or record a speech audio clip. Please provide at least 5 seconds of speech.
Note that it works best with English, but should work with other languages as well.
After you are done recording/uploading the audio, click the 'Generate' button to start the generation process.
After a few seconds, the generated images will be displayed on the right.