Side Eye AI: Extracting Audio from Silent Images and Videos

In a remarkable fusion of science fiction and reality, scientist Kevin Fu and his team at Northeastern University have brought the concept of extracting audio from static images to life through the power of artificial intelligence (AI).

Their groundbreaking creation, known as Side Eye, is a machine learning tool that has pushed the boundaries of image analysis. With Side Eye, a still image can reveal the gender of the speaker in the room where the photo was taken, transcribe spoken words, and even pinpoint the location. Remarkably, this tool can also be applied to muted videos, unlocking a world of hidden audio content.

Fu describes Side Eye’s potential by posing a scenario where someone records a TikTok video, mutes it, and adds music. The curiosity arises: what were they actually saying? Side Eye can uncover these hidden dialogues and shed light on off-camera conversations, adding a new dimension to multimedia analysis.

The magic behind Side Eye lies in its ability to harness image stabilization technology found in most smartphone cameras. These cameras utilize springs submerged in liquid to stabilize images, compensating for the photographer’s shaky hand.

When someone speaks close to the camera lens during a photo, it generates subtle vibrations in the springs, altering the path of light. Side Eye capitalizes on this by extracting audio frequencies from these vibrations, made possible by the rolling shutter technique employed in most cameras.

This innovative technology, however, comes with both promise and peril. While Side Eye is still in its early stages and requires extensive training data to reach its full potential, its misuse could pose significant cybersecurity threats.

On the flip side, in the right hands and with further advancements, Side Eye could serve as a powerful digital tool for law enforcement agencies, aiding in crime investigations by providing valuable digital evidence. As this AI-driven tool evolves, it opens up new possibilities and challenges, reshaping the way we analyze and interpret visual content.