Hackers Can Use Gen AI to Manipulate Live ConversationsCriminals Could Steal Billions, Reroute Airplanes or Modify Live News
Hackers can use generative artificial intelligence and deepfake audio technology to hijack and manipulate live conversations, IBM security researchers demonstrated.
The researchers used the "surprising and scarily easy" audio-jacking technique to intercept a speaker's audio and replace snippets of an authentic voice with a deepfake. "Rather than using generative AI to create a fake voice for the entire conversation, which is relatively easy to detect, we discovered a way to intercept a live conversation and replace keywords based on the context," they said.
All the researchers needed to clone the voice was three seconds of audio.
They instructed a large language model to process audio from two sources in a live phone conversation and asked it to watch for specific keywords and phrases - in this case, the phrase "bank account." When the model detected the phrase, it replaced the authentic bank account with a fake one.
The LLM acted as a man in the middle, monitoring the live conversation. The researchers used speech-to-text to convert voice into text, which allowed the LLM to understand the context of the conversation. "It is akin to transforming the people in the conversation into dummy puppets, and due to the preservation of the original context, it is difficult to detect," they said.
The threat is not just financial manipulation in which hackers could trick victims into depositing billions into their accounts. The technique could be used to censor information, instruct a pilot to modify air routes, and change content in live news broadcasts and political speeches in real time.
Developing the AI system to carry out the task posed little challenge, even if implementing an attack would require criminals to have social engineering and phishing skills, the researchers said.
Building the proof of concept "was surprisingly and scarily easy. We spent most of the time figuring out how to capture audio from the microphone and feed the audio to generative AI," they said.
The researchers did encounter some barriers that affected the persuasiveness of the attack. One was that the cloned voice needed to account for tone and speed to blend into authentic conversation.
Second, latency in the GPU caused a delay in conversation, as the proof of concept needed to access the LLM and text-to-speech APIs remotely. But the researchers addressed the issue by building artificial pauses. They trained the model to use bridge phrases to plug any gaps created by having the model process the manipulation.
"So while the PoC was activating upon hearing the keyword 'bank account' and pulling up the malicious bank account to insert into the conversation, the lag was covered with bridging phrases such as "Sure, just give me a second to pull it up," the researchers said.
Hackers, they added, would need to have a "significant" amount of computing power locally available to make these attacks realistic and scalable.