The biggest challenges in the design of speech and audio interfaces
A frequently repeated urban story is that someone talks about an arbitrary topic, say “cars” and minutes later he sees a car-advertisement on Facebook. Or that you talk about “Greg” and seconds later you see a “Add friend” notification for someone named Greg. This highlights a problem in the design of speech and audio user-interfaces; How does the user know if an app is listening?
Let’s break this analysis into three parts; psychological effects, technological feasibility and finally, the user-interface design challenge.
From a psychological perspective, a significant contribution to such user experience can be understood through coincidence and related knowledge. From your basic information and recent posts, Facebook can narrow down your most likely interests with scary accuracy. If it then randomly selects adds from your most likely interests, most of the time you will not give these adds a second look. That one time, that one-out-of-a-thousand, when the add matches what you’ve just been talking about, then you notice the add. It is striking because it seems unlikely. It is unlikely, like 0.1% of a chance. But given that you scroll through Facebook a thousand times, it is actually very likely that once in a while you will find a surprising match. We are just not very good at recognizing such coincidences. Instead, we’re programmed to find causal relations even in random occurrences. This is thus clearly a psychological bias or fallacy.
Let us then consider the feasibility of an app, like the Facebook app, listening to you to serve advertisements. Facebook certainly has strong motivators to serve more targeted apps and it does not have a good track record in protection of users’ privacy. It would most likely be breaking the European GDPR privacy regulations, but the effectiveness of GDPR has not yet been confirmed. Thus it certainly seems possible that Facebook would risk breaking the law, given it’s less than perfect track record.
The features or modules which an app would need to serve such a listening functionality include at least voice activity detection (VAD), speech recognition and topic modelling, as well as communicating results to the cloud. The purpose of voice activity detection is to determine whether the microphone signal is speech or not. This module would be running all the time and therefore it must be a very simple function such that it does not require much battery power. The idea is that by doing a simple task all the time with little power, we can avoid doing the complex and power-hungry task of speech recognition most of the time. The second module, speech recognition, would then convert the recorded speech into text. High quality speech recognition is computationally expensive and requires a lot of battery power. In this application, however, we do not need quite so high quality, since for topic modelling, it is sufficient that we pick up a few words here and there. For example, if you hear “bla-bla bla-bla, Toyota bla-bla Ford bla-bla”, then you can be fairly sure that the topic is “cars”.
The question is then whether such modules consume a noticeable amount of power? It is hard to give a number which would be meaningful to a reader. Instead, consider a simpler functionality which is present in most phones, namely, personal digital assistants. When activated and configured, in Android phones, you can access the personal digital assistant by saying “Hey Google!”. It similar as the above mentioned listening functionality in the sense that it requires a voice activity detection as a first module. This runs at low power and does not present a problem. When speech is present, it then runs a keyword spotter, to find out whether it finds the terms “Hey Google”. This can be done with relatively low accuracy, because missing one keyword is not too bad; The user then just has to repeat “Hey Google” again. This can all be implemented on a phone with a relatively low penalty on power consumption. For sure, however, you will notice an impact on battery life — the keyword spotter does require some CPU power.
The big difference to the above listening functionality is that the keyword is known in advance, whereas in topic modelling, the number of words which you should recognize can be very large: How many different product categories can you think of? Heuristically, you need a keyword spotter for every category. This is a much more difficult problem and correspondingly, the impact on battery drain would be significant. That would hardly go unnoticed. Alternatively, the app could send data to the cloud for processing to reduce CPU load. That would correspond to transmitting all speech continuously to the cloud, which would be “very illegal”, but more importantly, transmission and coding speech continuously requires both bandwidth and CPU power. Again, this would hardly go unnoticed.
For me, the most plausible explanation is thus that this is entirely a psychological fallacy. We find causation in coincidences. Though Facebook would benefit from listening and it has a poor track record in protecting privacy, it is technologically infeasible to implement such listening. Certainly a competent engineer or researcher would discover such abuse.
Why is this “Facebook is listening” legend then so pervasive and persistent? I think the root cause is bad user-interface design. User-interfaces do not have a way of reporting to the user that the microphone is recording. To the user, it sounds plausible that an app would be recording. The microphone could be recording even when the device is in your pocket or somewhere else in the room. In difference to tactile interfaces like keyboards, touch-screens and push buttons, microphones can operate at a distance, even unintentionally. It is therefore therefore more important than with tactile interfaces that microphones report to the user when they are operating.
There’s a parallel between the user-interface in microphones and cameras; They are auditory and visual recorders, respectively. In comparison, however, digital cameras by convention have sound-effects similar to the shutter in SLR cameras, which are even mandatory in Japan with self-imposed regulation. The difference to cameras is however that microphones operate continuously whereas cameras take snap-shots. A single ping from the microphone would hardly be sufficient. Video-cameras, in comparison, conventionally feature a red light when recording.
The user-interface design task is then to give feedback about an active microphone. The user needs to know whether a microphone is recording or not. Like the red light indicating that the webcam is recording or a mailbox with a flag up, we need a signal that the microphone is active. A visual indicator is probably not enough since a microphone can record even when the device is hidden. The only solution I can think of is then an acoustic indicator, a continuous feed of ambient sound, perhaps? Something audible but non-intrusive. I don’t know if that is possible. Or perhaps it’s trivial. But I do think that such feedback would be a considerable improvement to devices with speech interfaces and therefore it is in my top-1 of unsolved problems in the design of speech user-interfaces.