Sense Home Energy Monitor
electrical signal dishwasher

How is Speech Recognition Similar to Disaggregation?

Many members of the Sense team come from a speech recognition background… but it’s not actually as different from energy monitoring as you might think.

Several members of the Sense team have a background in speech processing, and we often talk about the similarities between automatic speech recognition and electrical load disaggregation. Those similarities are real and interesting, but the differences between the two problems are perhaps even more illuminating.

Let’s start with the similarities. Most obviously, both automatic speech recognition and load disaggregation deal with time signals, which we analyze for cues about the sources that generated those signals – the words that had been spoken for automatic speech recognition, or the appliances that turn on and off for load disaggregation. Since these sources are physical systems in both cases (humans and electrical devices, respectively), similar mathematical tools are useful for analyzing the signals: spectral features, hidden Markov models, Deep Learning networks, and so forth.

speech recognition spectrogram
A spectrogram used in speech recognition. Compare it to the first photo in this post, which is a spectrogram used for device detection that shows the energy signature of a washing machine.

On a deeper level, both signal classes reflect underlying hierarchies: speech consists of phonemes, which combine to form words, which are parts of phrases and sentences. Similarly, elements such as pump motors, heating elements, lights and electronic control systems are the components which together make up electrical devices, which also combine in “meaningful” patterns: your washer and dryer are likely to be used in sequence, and if you enjoy warm snacks while watching TV, your microwave and TV set will also tend to co-occur (and at particular times of day and night).

However, load disaggregation is very different from automatic speech recognition, as Mike carefully warns incoming speech scientists (and as our analyses keep confirming to us!) For one thing, the time and power scales for load disaggregation are much wider than for automatic speech recognition. A phoneme typically spans tens to hundreds of milliseconds, and most words are between a few tenths of a second to a few seconds in duration. In contrast, the cues for an electrical device can occur in a fraction of a millisecond (e.g. high-frequency noise of a switching power supply), or require several minutes to unfold (the big heating elements of an electrical dryer, for example, have a very typical power pattern over 10 minutes or more, and accurate distinction between different light sources may require tracking of an entire 24-hour cycle). Also, electrical devices tend to repeat in perfect replicas each time they are used in a particular mode, whereas spoken sounds are quite variable – even when spoken by a single person.

Another crucial difference is that electrical devices, by their very nature, overlap in typical usage. Thus, an load disaggregation algorithm that fails when you use your kettle and toaster oven simultaneously is not very useful – in automatic speech recognition, in contrast, we almost never try to recognize overlapping speech.

These similarities and differences make load disaggregation a particularly exciting area of research: one needs to draw on the excellent research that has taken place in speech processing over the past five decades, while also being creative with innovative algorithms that account for the specifics of electrical devices. The problem is not entirely solved, and progress is rapid as well as rewarding!