Moving Sonar Onto the Body: Cornell’s AI Eyeglasses Read Silent Speech
New research from Cornell integrates a silent speech interface (SSI) into a pair of smart eyeglasses.
Wearable devices rely on a human-machine interface that assumes standard physical capacities such as speech, touch, or motion. While this form of machine interaction is suitable for most consumers, people with disabilities may find it difficult or impossible to operate standard wearables.
To make wearables that accommodate more people, researchers are investigating new human-machine interfaces.
Doctoral student Ruidong Zhang wears the EchoSpeech eyeglasses. Image courtesy of Cornell Chronicle
This week, a team from Cornell University published a paper describing a pair of smart glasses equipped with a silent speech interface (SSI) for users who cannot vocalize sound. In this article, we’ll discuss silent speech interfaces and the accessible wearable prototype from Cornell.
What Are Silent Speech Interfaces (SSIs)?
Silent speech interfaces allow people to interact with machines without vocalizing words. While technologies like AI assistants (e.g., Apple’s Siri) work through audible vocal communication, SSIs instead interpret communication through speech-related movements.
Example of an SSI consisting of multiple different sensor inputs. Image courtesy of SpringerBriefs
SSI technology recognizes speech through the movements of the mouth and tongue rather than sound. To do this, SSIs rely on a variety of different sensors, including vibration sensors that are placed near the mouth to detect the vibrations of a person's mouth as well as cameras that track and classify speech-related movements. In many cases, this information is then processed by a machine learning algorithm, which interprets the movements of the mouth and translates them into words.
While most of the population will likely not find a use for SSIs, the technology is essential for people who have lost their voice due to illness or injury, allowing them to communicate more easily. For example, medical patients with vocal cord damage or neurological disorders that affect speech could benefit greatly from SSIs.
Cornell Develops Camera-free SSI Eyeglasses
This week, researchers from Cornell made major strides in SSI technology with the creation of SSI-based smart eyeglasses.
The system, dubbed EchoSpeech, is a novel, minimally-intrusive SSI technology that uses low-power active acoustic sensing to capture subtle skin deformations caused by silent speech and converts this information into actionable data. The smart eyeglass prototype builds off Cornell's previous research on a similar acoustic-sensing wearable ("EarIO") that tracked facial movements from within the ear.
A user setup for EchoSpeech. Image courtesy of Zhang et al.
The system relies on a series of speakers and microphones mounted on the frame of a pair of glasses to emit inaudible sound waves toward the skin. The outputted waves produce echos that travel along multiple paths and are interpreted by the system to infer the silent speech of the wearer. EchoSpeech can run entirely on a standard smartphone, requires only one to six minutes of training data, and operates in real-time at a low power consumption of 73.3 mW. The team's in-house deep learning algorithm then analyzes an echo profile in real time with approximately 95% accuracy.
The system was evaluated with a user study of 12 participants and successfully demonstrated an ability to recognize 31 isolated commands and three- to six-figure connected digits with a word error rate (WER) of 4.5% (std 3.5%) and 6.1% (std 4.2%), respectively. Additionally, the system's robustness was tested in scenarios including walking and noise injection.
More Private, Low Power, and Accessible
Most SSI technology uses facial cameras that gather data—from both the user and the person the user is communicating with. In addition to posing privacy concerns, wearable cameras collect high-bandwidth video data.
Because EchoSpeech removes the need for wearable video cameras, the device only captures audio data, which requires much less bandwidth than image or video data, and can be sent to a mobile phone via Bluetooth in real time. Private information never leaves a user's control because data is processed locally on a smartphone instead of being processed in the cloud. Audio-only sensors are more battery efficient, too, the researchers say: acoustic sensing yields 10 hours of operation in contrast to the 30 minutes with a camera.
The Cornell team says they see EchoSpeech benefitting a number of use cases, from silently mouthing a passcode to unlock one's smartphone to skipping songs on a playlist. The device can also be paired with a smartphone to talk to others in places where verbal speech is inconvenient or impolite, like a loud restaurant or quiet library. The researchers say the interface is compatible with a stylus and design software like CAD to eliminate the need for a mouse and keyboard.