Sound spatialisation

The impact of sound spatialisation in automatic speech recognition for the hearing impaired

Type of project: Exploratory research

Disability concerned: Hearing impairment

Topics: Autonomy, Communication

Status: Completed

This exploratory research aims to assess the usefulness of a digital display with spatialized Automatic Speech Recognition (ASR) for hearing-impaired people. We attempted to develop a quick AR prototype, then proceeded to implement two web prototypes and to test them with actual users.

Hearing-impaired people are isolated from discussions, severely impacting their social and professional lives. In this regard, they could benefit from Automatic Speech Recognition to help catch up during meetings. Currently, ASR is facing a few challenges. Most common systems, such as these embedded within online meeting software, fail to offer significant benefits to hearing-impaired people.

The textual output is usually poorly located at the bottom of the screen, preventing users from seeing the facial expressions of the speakers.
That textual output does not seek to convey a feeling of the location of the participants, and all the interventions are simply stacked one over the others as people speak.
Finally, while remote meetings may benefit from ASR, it is much less common to use similar technologies in co-located meetings. Using ASR in such a context is difficult, mostly because participants cannot be isolated and « spatialized » properly with a single input source.

Current researches work on spatialization algorithms and in the perspective of successful outcomes, we were interested in anticipating their benefits. We thus designed and implemented a “wizard-of-oz” prototype, along with a “within-subject” test protocol to evaluate the interest of hearing-impaired participants for this technology.

The interface itself is a Node.js application using embedded Speech Recognition’s browser libraries and Websockets. It translates actual inputs on-the-go and proved to be sufficiently accurate for us to consider using it in both English and French.

The test protocol relies on a single independent variable: the device where the interface is displayed. While we intended to display the interface on AR glasses, we had to settle for tablet vs smartphone. We led two test sessions with 5 users. This study yielded mostly positive feedback, managed to highlight interesting trends and new avenues that we did not think of. However, the limited number of participants, along with further variations introduced during the test (language, smartphone real estate, etc), prevent us from providing proper quantitative results.

Leading such research within a Web Agency is a difficult task, as most employees do not benefit from enough leeway to adapt the impediments of research. As an example, recruiting participants was a much harder task than anticipated, and we had to downsize our expectations.

Implementing a successful interface was quick. We see two issues with Speech Recognition APIs: (1) they are provided by companies with ethical issues, open source technologies significantly lag behind. (2) they usually fail to recognize technical terms and names. The NReal glasses proved challenging. We could not afford sufficient learning time and ultimately failed to use them. Devices backed by robust companies might offer a smoother learning curve.

The exchanges between partners offered invaluable insights. Three examples:

Visible devices might increase the feeling of social stigma and be discarded by their users.
Our interface seemed more engaging than some specialised applications (Ava was mentioned)
Reading facial expressions is a must for hearing-impaired people. The larger real estate provided by the tablet was needed, yet the smartphone allowed them to see the speakers.

We would like to invest more time to project our interface directly within a set of AR glasses, as planned initially, since we believe this could solve both the real estate issue of the smartphone and the loss of focus of the tablet. The reliance on Microsoft and Google to recognize speech is also a hurdle we’d like to overcome, by furthering our exploration of open source alternatives. Our partner from SignX also mentioned that using generative AIs to summarise live speeches might be very handy, and we’d like to experiment on that point, using open source technologies here too, whenever possible. From a research perspective, fine-tuning our protocol (introducing a distinction between deaf and partially deaf, ensuring the balance of both phases, etc) and finding a sufficient number of participants would offer a clearer understanding of their needs. This understanding would help everyone, from private companies to academic researchers.

Drawing of a group of people with coloured speech bubbles above their heads.

Contact

Liip SA

Donato Rotunno

donato.rotunno@liip.ch