Microsoft can Now Create a Deepfake From a Photo and an Audio File

By Kim McDonalds On Apr 22, 2024

Microsoft Research Asia has presented its VASA-1 AI model. This allows you to create an animated video of a person based on a single photo and a piece of audio.

The model is presented as a way to create realistic avatars, such as video messages. “It opens the door to real-time appointments with lifelike avatars that emulate human conversational behaviour,” the accompanying research report said. Of course, the same model could also be used to get just about anyone to say whatever you want.

Trump Threatens with Tariffs if EU Stops Buying Oil and Gas

Dec 20, 2024

Apple Disputes Allegations of Blood Minerals from Congo

Dec 19, 2024

The VASA framework uses machine learning to analyze a static image and create realistic video images. The AI model does not clone voices but animates based on existing audio input. So, in one possible scenario, you could record an audio message and then animate it realistically for an avatar. The AI model adds realistic head movements, tics and other behaviours.

Deep fakes
Making deepfakes in itself is not new. However, most existing technologies rely on multiple photos or longer video files. Using a single photo to add emotions and other behaviours is quite new. The VASA-1 model also seems good at lip-syncing and showing (generic) tics and head movements. For its training, Microsoft Research based itself on the VoxCeleb2 dataset, a series of about a million video clips of over six thousand celebrities extracted from various YouTube videos.

The model’s code will not be released because VASA-1 could be exploited. Especially in combination with a cloned voice, miscreants could use them to fake video meetings and thus try to extract money, for example. The danger of disinformation is also never far away.