How Do I Apply Voice Biometrics?
There are many variables to consider when integrating voice biometric technology into your business process(es). Two of the most critical considerations are: what will be the source(s) for your speech samples, and what can or will be contained in them? Each is discussed in greater detail below.
Speech Sample Sources
Voice biometric systems process speech samples; therefore, the first step in implementation is knowing how you will get speech samples from your end users. You need a good way to capture speech from a specific user, without capturing speech from other users, or unwanted sounds and noises.
The most common device to capture speech samples is the telephone. Whether a mobile phone, landline phone, or VoIP phone, people speaking on telephones can have their speech recorded using a range of techniques made for this purpose, possibly without them ever knowing. If you have an IVR system, the IVR will likely be able to record audio that is then passed to a voice biometric system.
A second way to capture speech is through a web browser on a computer or tablet, using the embedded microphone. Because nearly every modern laptop and tablet has a microphone, capturing audio through the computer is practical for many applications. Likewise, voice capture is also easily done on a mobile phone.
Identify Desired Use Case
Relative to voice biometrics, VBG uses the term "use case" to describe how speech samples are collected. Use cases also determine the content of the speech samples being processed by the voice biometric system. VBG supports all possible speech capture methods; the following sections explain the options.
An "active" use case means the person speaking is repeating something they are required to speak. They are active and knowing participants in the process. The use of active prompting for specific content is also referred to as a "text dependent" use of voice biometrics. One minor downside of requiring participation is that end users have to follow instructions and behave in an expected manner. In some cases, such as monitoring a parolee, you may not have a willing or compliant participant -- which can impact system accuracy.
On the other hand, active approaches have several advantages. First, because the speech content is known in advance, less time is needed to create a voiceprint. Also, active approaches tend to use short phrases, so they are easy for end users to repeat and require very little storage space. And finally, active use cases integrate easily into many computer systems and usage scenarios.
Active Method 1 - Static Text Passphrase
In this active variant, the user speaks the same phrase 2-3 times to create a voiceprint, then once to verify. For example, the phrase may be, “at VBG, my voice in my unique identity." Note that your passphrase can be the same for everyone or you could make the phrase unique for each person, although unique phrases create operational complexity.
Most voice biometric vendors provide static text passphrases as their primary offering due to its simplicity. However, a weakness of this approach is the ease at which a third party could record the person speaking and then use the recording to authenticate (what is called a "recorded playback attack"). This is a genuine concern with the quality of recording devices and a determined fraudster. And although there are techniques to detect "liveness", VBG generally recommends using RandomPIN™ (see below) for applications that are at higher risk of playback attacks.
Active Method 2 - Static Numeric Passphrase
This is identical to the Static Text Passphrase, except that the static phrase is a number. One very common approach for this method is to have the end user repeat his or her mobile phone number 3 times to create their voiceprint. This is easy to remember, changes infrequently, etc. Like text-based passphrases, this method is also susceptible to recorded playback attacks. However, if static passphrases must be used, VBG recommends they be combined with an outbound call to a mobile phone to help strengthen confidence (i.e., possession of the phone is like a token, another factor).
Active Method 3 - RandomPIN™
With the RandomPIN™ use case, we require the end user to speak a set of diverse digit strings in order to enroll. To verify, we then ask the user to repeat a random number, typically 4 or 5 digits in length. A new number is generated for every verify attempt, for every user, for every call. The VBG system can provide the random string (of any length) or you can create your own. More than 60% of our customers choose this approach due to the robustness against fraud, speed of verification, and simplicity for the person speaking. There is nothing for the end user to remember, and both young and old, and cross-culturally, numbers are easy to speak.
A "passive" use case means that the person speaking does not have to speak anything in particular. In fact, there is no prompting for speech at all, and the end user may often be unaware that they are involved in a voice biometric process. For example, you can capture the speech of a caller during their conversation with a call center agent. The recording of the caller's leg of the conversation can then be used to either create a voiceprint, or be compared to one or more voiceprints for verification or identification. Passive use cases are also referred to as "conversational" or "text independent" forms of voice biometrics.
The most common source of passive speech samples is from call centers. Before considering this approach, please note that many call center systems record both agent and caller into the same (mono) channel. In these situations, you will have additional work to do to capture the audio before it gets to the recording system. The good news: there are a number of new technologies available, and specialty partners we work with, so please contact us if you you suspect this may be an issue.
The primary benefit of passive approaches is that you do not need to train end users or ask them to say anything specifically. Many business managers consider this the ideal, "frictionless" user experience for voice biometrics. The downside of passive approaches is that you are relying on whatever the speaker happens to be saying as input for the voice biometric system. In order to create a robust voiceprint using passive approaches, we need to capture all the phonemes of the language the user is speaking. This means more time is needed (compared to active approaches).
As a rule of thumb, end users will typically speak all phonemes of their language roughly twice in a 2-3 minute conversation. The VBG system removes silence and non-speech signals from passive recordings to arrive at a measure called “seconds of usable speech” (or SUS). Enrollment samples having SUS values of 60-90 or more will typically provide very good results. We can create voiceprints with only a few seconds of speech, but they will lack phonetic diversity (and will not lead to robust or accurate performance). So, consider 30 seconds of usable speech to be the minimum amount of speech for a successful passive enrollment (assuming this speech has phonetic diversity). And remember, saying "yes", "no", "umm" and other short-sylable words multiple times will not provide good speech diversity.
Passive Method - VBG NaturalSpeech™
The VBG NaturalSpeech™ method addresses all forms of passive speech collection. This use case can accept any speech recording for enrollment or verification, provided that it contains only one speaker and is in a format we accept. Once you have a robust VBG NaturalSpeech™ voiceprint, it is possible to verify with only a few seconds of speech -- a short sentence, phrase, digits, almost anything. And, VBG NaturalSpeech™ allows you to create multiple separate fraudster (i.e., "blacklist") voiceprint databases to determine if fraud is occurring.