Sensory's TrulyHandsFree software, which InsideDSP last covered at its v3 introduction in early 2013, precedes limited-vocabulary speech recognition with voice detection involving a specific key word or phrase. And with latest version 4.0, Sensory adopts convolutional (i.e. "deep learning") neural network (CNN) techniques. Jeff Bier began a recent editorial with the following statement:
Lately, neural network algorithms have been gaining prominence in computer vision and other fields where there's a need to extract insights based on ambiguous data.
"Other fields," as you may already realize, include speech recognition. Company CEO Todd Mozer was quick to point out, during a recent product briefing, that conventional neural network processing has been at the core of the company's DNA since its founding 21 years ago. Deep Learning techniques, however, are a more recent innovation. When asked to differentiate them from prior approaches, however, Mozer noted that there wasn't a black-vs-white differentiation. Instead, he suggested, they were more of a "shades of gray" evolution; "deep Learning approaches involve more data, bigger models, and more layers in the neural network."
Mozer notes that TrulyHandsfree's bailiwick is in the 50-100 word max vocabulary size, more typically 10-20 a word command set. Beyond this point, the company's fuller featured TrulyNatural product is the preferable approach. TrulyNatural, along with Sensory's newest TrulySecure voice-plus-face recognition biometric product, have implemented CNN algorithms from the beginning. The company initially thought that CNNs' processing and memory requirements would be excessive for TrulyHandsfree's target applications, but re-architecting of the approach between the v3 and v4 releases resulted in a more modular algorithm that demands as little as a 1 MByte memory footprint.
Regarding the additional processing burden of the CNN approach versus prior algorithm techniques, Mozer's answer took the form of power consumption (10-20%) rather than MIPS. His response also assumed that a "tiny" DSP was used only for front-end keyword triggering, with the bulk of the speech recognition task tackled by a “back-end” application processor. When asked whether a general-purpose CPU or DSP was up to the CNN task, versus a more tailored processor such as Synopsys' recently announced core, Mozer stated that this depends on the size of the vocabulary. Any processor can handle deep learning functions for a simple trigger, he suggested, if it can access enough memory from both capacity and bandwidth standpoints. However, as vocabulary size and/or accuracy expectations grow, CNN-tailored processors become increasingly attractive options.
Regarding accuracy, Sensory's announcement claims that "internal testing shows the new features of TrulyHandsfree offer a 60-80% decrease in word error rate compared to the previous version of TrulyHandsfree." Roughly 80% of that improvement, according to Mozer, comes from the use of CNN algorithms. The remainder is derived from three primary factors:
- Spotting techniques which ensure that the core part of the user's spoken request can be recognized in the middle of speech or when surrounded by ambient noise;
- Reverb and echo cancellation improvements that better handle harsh acoustic environments, with no performance downside in quiet environments; and
- Filterbank features, as an alternative to traditional Mel Frequency Cepstral Coefficents (MFCC) for front-end speech feature extraction (Figure 1)
Figure 1. The filterbank techniques optionally employed in Sensory's latest-generation v4 TrulyHandsfree voice detection and speech recognition algorithms improve accuracy versus traditional Mel Frequency Cepstral Coefficents (MFCC), at the expense of a larger required memory footprint.
Mozer notes that all speech recognizers begin by digitizing an analog signal and then performing feature extraction on it (i.e. translating audio information into features that can be subsequently analyzed. Earlier versions of TrulyHandsFree have exclusively used MFCCs for feature extraction; in v4 filterbanks are used instead of or in additional to MFCCs. MFCCs have the smaller memory footprint requirements of the two alternatives, while filterbank features can be more accurate. The filterbank approach particularly makes sense when the TrulyHandsfree algorithm is followed by TrulyNatural, which is filterbank-only in its implementation.
Mozer declined to provide license and royalty fee specifics, aside from noting that the latest generation product is being sold at the same price as prior-generation TrulyHandsfree offerings. TrulyHandsfree v4 is now available for licensing; SDKs and language development tools are also available to Sensory's customers. However, Mozer indicated that no off-the-shelf tools were currently available to assist potential customers in evaluating the tradeoffs of various processor types and memory capacities versus accuracy results and vocabulary set sizes, including whether CNNs (versus traditional algorithms) and filterbanks (versus MFCCs) are necessary. Instead, these tradeoffs are typically explored via interaction between customers and Sensory's engineers.