How Do I Apply Voice Biometrics?
There are many variables to consider when integrating voice biometric technology into your business process(es). Two of the most critical considerations are: what will be the source(s) for your speech samples, and what will be contained in them? Each is discussed in greater detail below.
Speech Sample Sources
Voice biometric systems process speech samples; therefore, the first step in implementation is knowing how you will get speech samples from your end users. You need a good way to capture speech from a specific user, without capturing speech from other users, or unwanted sounds and noises.
The most common device to capture speech samples is the telephone. Whether a mobile phone, landline phone, or VoIP phone, people speaking on telephones can have their speech recorded using a range of techniques made for this purpose, possibly without them ever knowing. If you have an IVR system, the IVR will likely be able to record audio that is then passed to a voice biometric system.
A second way to capture speech is through a web browser on a computer or tablet, using the embedded microphone. Because nearly every modern laptop and tablet has a microphone, capturing audio through the computer is practical for many applications. Likewise, voice capture is also easily done on a mobile phone.
However, it is important to note that VBG technology can work with almost any form of recorded speech, provided that it is in a supported format. For example, one of our customers records people speaking during a language test and uses that audio to create voiceprints. Also, we have call center customers sending speech samples to us (post-call). Regardless of how you get the speech samples to us, note that you want to only capture the audio of the desired (individual) speaker.
Identify Desired Use Case
Relative to voice biometrics, the term "use case" refers to how speech samples are collected. Use cases also determine the content of the speech samples being processed by the voice biometric system. Voice Biometrics Group supports all possible speech capture methods; the following sections explain the options.
An "active" use case means the person speaking is repeating something they are required to speak. They are active and knowing participants in the process. One minor downside of requiring participation is that end users have to follow instructions and behave in an expected manner. In some cases, such as monitoring a parolee, you may not have a willing or compliant participant -- which can impact system accuracy.
On the other hand, active approaches have several advantages. First, because the speech content is known in advance, less time is needed to create a voiceprint. Also, active approaches tend to use short phrases, so they are easy for end users to repeat and require very little storage space. And finally, fraudsters will know about active systems. By simply having the threat of a voice biometric system, the rate of fraud decreases. Some people will of course try to crack the voice biometric process, but our customers tell us that the overall attempts at cheating decline significantly.
Active Method 1 - Static Text Passphrase
In this active variant, the user speaks the same phrase 2-3 times to create a voiceprint, then once to verify. For example, the phrase may be, “at VBG, my voice in my unique identity." Note that your passphrase can be the same for everyone or you could make the phrase unique for each person, although unique phrases create operational complexity.
Most voice biometric vendors provide static text passphrases as their primary offering due to its simplicity. However, a weakness of this approach is the ease at which a third party could record the person speaking and then use the recording to authenticate (what is called a "recorded playback attack"). This is a genuine concern with the quality of recording devices and a determined fraudster. And although there are techniques to detect "liveness", VBG generally recommends using RandomPIN™ (see below) for applications that are at higher risk of playback attacks.
Active Method 2 - Static Numeric Passphrase
This is identical to the Static Text Passphrase, except that the static phrase is a number. One very common approach for this method is to have the end user repeat his or her mobile phone number 3 times to create their voiceprint. This is easy to remember, changes infrequently, etc. Like text-based passphrases, this method is also susceptible to recorded playback attacks. However, if static passphrases must be used, VBG recommends that their use be combined with an outbound call to a mobile phone to help strengthen confidence (i.e., provide another factor -- possession of a cell phone).
Active Method 3 - RandomPIN™
With the RandomPIN™ use case, we require the end user to speak a set of diverse digit strings in order to enroll. To verify, we then ask the user to repeat a random number, typically 4 or 5 digits in length. A new number is generated for every verify attempt, for every user, for every call. The VBG system can provide the random string (of any length) or you can create your own. More than 60% of our customers choose this approach due to the robustness against fraud, speed of verification, and simplicity for the person speaking. There is nothing for the end user to remember, and both young and old, and cross-culturally, numbers are easy to speak.
A "passive" use case means that the person speaking does not have to speak anything in particular. In fact, there is no prompting for speech at all, and the end user may often be unaware that they are involved in a voice biometric process. For example, you can capture the speech of a caller during their conversation with a call center agent. The recording of the caller's leg of the conversation can then be used to either create a voiceprint, or be compared to one or more voiceprints for verification or identification. Sometimes passive voice biometrics is called "conversational biometrics".
The most common source of passive speech samples is call center recordings. Before considering this approach, please note that many call center systems record both agent and caller into the same (mono) channel. In these situations, you will have additional work to do to capture the audio before it gets to the recording system. VBG has a number of partners who specialize in recording IVR and call center speech, so we may be able to assist you should you not have the ability to record individual callers into a separate channel.
The primary benefit of passive approaches is that you do not need to train end users or ask them to say anything specifically. Many business managers consider this the ideal, "frictionless" user experience for voice biometrics. The downside of passive approaches is that you are relying on whatever the speaker happens to be saying as input for the voice biometric system. In order to create a robust voiceprint using passive approaches, we need to capture all the phonemes of the language the user is speaking, preferably twice. This means more time is needed (compared to active approaches).
As a rule of thumb, end users will typically speak all phonemes of their language roughly twice in a 2-3 minute conversation. The VBG system removes silence and non-speech signals from passive recordings to arrive at a measure called “seconds of usable speech” (or SUS). Enrollment samples having SUS values of 60-90 or more will typically provide very good results. We can create voiceprints with only a few seconds of speech, but they will lack phonetic diversity (and will not lead to robust or accurate performance). So, consider 30 seconds of usable speech to be the minimum amount of speech for a successful passive enrollment (assuming this speech has phonetic diversity). And remember, saying "yes", "no", "umm" and other short-sylable words multiple times will not provide good speech diversity.
Passive Method - NaturalSpeech™
VBG has created the NaturalSpeech™ method to address all forms of passive speech collection. This use case can accept any speech recording for enrollment or verification, provided that it contains only one speaker and is in a format we accept. Once you have a robust NaturalSpeech™ voiceprint, it is possible to verify with only a few seconds of speech -- a short sentence, phrase, digits, almost anything. And, NaturalSpeech™ allows you to create multiple separate fraudster (i.e., "blacklist") voiceprint databases to determine if fraud is occurring.