First, Some Definitions
What is voice biometrics? The term "voice biometrics" typically refers to a sophisticated set of software programs that processes speech samples for two main purposes: (1) to model the unique vocal characteristics from an individual's speech sample(s) and create a voiceprint, or (2) to compare a speech sample to one or more stored voiceprints in order to obtain a match probability. Once you have a voiceprint for someone, you can perform several useful functions, with the two most popular being:
Verification is most often used to protect whatever comes after the verification process. On the other hand, identification is most often used to detect the presence of fraud.
Assuming one or both of these functions makes conceptual sense for your business, next consider basic, practical implications of using speech within your applications.
Easy access to speech samples? Voice biometric technology is extremely useful, but only if it is easy for you to get speech samples from your end users. If your end users will typically be in a sports stadium, or using construction equipment, or will be in any other environment that is very loud and/or noisy, then voice biometric technology may not be the best fit. Also, consider how natural it is for your end users to speak. Will they have ready access to a phone or other microphone device? And consider if there are better technologies for your end users in their intended usage environment. For example, a fingerprint sensor might make more sense to log into a laptop than voice biometrics.
There's good news however: with everyone carrying mobile phones these days, there are many good and valid scenarios for using voice biometrics.
Will Voice Biometrics Be Accurate Enough?
Just as no security system is infallable, it is important to recognize that no biometric technology is 100% accurate. System accuracy could be anywhere between 95 and 99%, a broad range. Voice biometric accuracy is also in this range for a variety of reasons. However, even with its slight imperfections, voice biometrics is an extremely valuable tool.
This is especially true when recognizing that voice biometric systems are typically used as part of a multi-factor authentication process. A 97% accuracy level may not sound very good as a single factor. However, if you achieve 97% confidence with User ID and Password (something you know), and also achieve 97% confidence with a token (something you have), and also achieve a 97% confidence with voice biometrics (something you are), then most fraud and risk professionals will find this combination of probabilities quite acceptable.
Some More Definitions
False Acceptance Rate (FAR). The "False Acceptance Rate" or "FAR" is the rate at which a biometric system lets an imposter through. The lower the FAR, the better.
False Rejection Rate (FRR). The "False Rejection Rate" or "FRR" is the rate at which a biometric system denies a bona fide user access. Again, the lower the FRR, the better.
Equal Error Rate (EER). The "Equal Error Rate" or "EER" is point where FAR=FRR. An EER = 0 is ideal, although no commercial voice biometrics vendor has achieved this with real-world use cases.
Voice biometric scoring systems are probabilistic, not deterministic, so there are trade-offs between these errors to consider. For instance, if you set your confidence levels "high" to prevent impostors, you may end up blocking more valid users, which causes annoyance. Setting a confidence level "lower" will provide more convenience for your valid users, but you may end up letting in more imposters.
VBG will work with you to provide an optimal balance between security and convenience. Also note that EER is measured for a single attempt. By allowing retries in your application, you can increase the probability that a valid user gets through on a second or third retry, even if you originally set confidence levels high to reject impostors.
Beware of Claims!
Note that EER, FAR, and FRR results are derived from a set of audio samples used to process and obtain these measurements. Beware of extremely low EER values advertised by some vendors, as laboratory-derived EER results can be easily manipulated by removing samples that negatively impact results. EER results are only as good as the data sampling that was performed for their calculation. Unexpected real-world results can (and will) occur if your sampling is not truly representative of the end user population, specific languages and dialects, types of devices being used, environment where speech is being collected, etc.
We recommend that you compare voice biometric systems based on real-world users, including running trials within your intended production environment, rather than relying on published EER results with no insight into the data used for the test.
The vast majority of VBG's production customers are tuned to 1-3% EER, more than enough to function as a very reliable component to multi-factor authentication processes. You don't have to take our word for it either, as we offer a 60-day free trial!
Another significant factor impacting real-world EER is the content of the speech sample and presence of noise. If a speech sample is too noisy, or if the wrong information is spoken to the voice biometric system, then the voice biometric engine may have more difficulty using the speech sample to make an accurate determination.
Using different types of devices can also influence results. For example, mobile phone networks use different compression techniques compared to landline phones -- this affects the voice biometric process which models the unique vocal characteristics from speech samples.
What Kind of ROI Will I Get?
Assuming that voice biometrics is relevant to your business problem, and is sufficiently accurate for you, the next question to consider is whether implementing a voice biometric project will give you sufficient return on investment (ROI). This is an important question for VBG, too. If we cannot provide sufficient accuracy and performance at a cost that enables a successful business case for you, we won’t have a business.
The business case for voice biometrics typically involves one or more of three business drivers:
Reducing Exposure to Fraud
Fraud reduction is always specific to your particular situation. Sometimes employees commit fraud, while other times fraudsters target call center agents with social engineering. Typically, the savings from fraud losses are quite large -- and tend to be well in excess of the cost to implement a solution.
Reducing Costs for Authentication
A second driver, reducing costs for authentication, is a common need within IVR systems and call centers. Voice biometrics can automate the authentication process on a large percentage of calls that would otherwise require live agent "handle time". For example, the cost to authenticate a call with an agent is equal to the seconds of agent time for the authentication multiplied by the cost of the agent per second.
If an IVR that includes voice biometrics authenticates the caller before transferring to an agent, or if a voice biometrics system authenticates the caller during the beginning of the call with the agent, then that saves agent time. Savings in agent time can often be at little as 20 seconds, or up to 60 seconds or more.
Improving the Customer Experience
The third driver for voice biometrics is improving the customer experience (making it easier, faster, less obtrusive, etc.). In the call center scenario, not only does it save time for the agent, but it also saves time for the caller. And, many callers are sophisticated enough to recognize you are using advanced technology to protect them and provide then with greater security (as compared to asking questions like your mother's maiden name).
Measuring the return on improved customer experience and the impact to revenue and customer retention is difficult. Therefore, the first two drivers, reducing fraud exposure and reducing costs for authentication, are what typically push companies to adopt voice biometric technology.
Voice biometric systems will provide savings, even when factoring in the costs for the biometric system and the resources needed to integrate these systems into your busienss process(es) and maintain them over time. And while companies typically have a wide range of contraints when considering voice biometric systems, VBG offers many flexible deployment options and pricing plans to accommodate almost any conceivable need.
How Do I Apply Voice Biometrics?
There are many variables to consider when integrating voice biometric technology into your business process(es). Two of the most critical considerations are: what will be the source(s) for your speech samples, and what can or will be contained in them? Each is discussed in greater detail below.
Speech Sample Sources
Voice biometric systems process speech samples; therefore, the first step in implementation is knowing how you will get speech samples from your end users. You need a good way to capture speech from a specific user, without capturing speech from other users, or unwanted sounds and noises.
The most common device to capture speech samples is the telephone. Whether a mobile phone, landline phone, or VoIP phone, people speaking on telephones can have their speech recorded using a range of techniques made for this purpose, possibly without them ever knowing. If you have an IVR system, the IVR will likely be able to record audio that is then passed to a voice biometric system.
A second way to capture speech is through a web browser on a computer or tablet, using the embedded microphone. Because nearly every modern laptop and tablet has a microphone, capturing audio through the computer is practical for many applications. Likewise, voice capture is also easily done on a mobile phone.
VBG technology can work with almost any form of streaming or recorded speech, provided that it is in a supported format. To see some of the many different integrations VBG currently supports, please refer to our Integration Examples page.
Identify Desired Use Case
Relative to voice biometrics, VBG uses the term "use case" to describe how speech samples are collected. Use cases also determine the content of the speech samples being processed by the voice biometric system. VBG supports all possible speech capture methods; the following sections explain the options.
Active Methods
An "active" use case means the person speaking is repeating something they are required to speak. They are active and knowing participants in the process. The use of active prompting for specific content is also referred to as a "text dependent" use of voice biometrics. One minor downside of requiring participation is that end users have to follow instructions and behave in an expected manner. In some cases, such as monitoring a parolee, you may not have a willing or compliant participant -- which can impact system accuracy.
On the other hand, active approaches have several advantages. First, because the speech content is known in advance, less time is needed to create a voiceprint. Also, active approaches tend to use short phrases, so they are easy for end users to repeat and require very little storage space. And finally, active use cases integrate easily into many computer systems and usage scenarios.
Active Method 1 - Static Text Passphrase
In this active variant, the user speaks the same phrase 2-3 times to create a voiceprint, then once to verify. For example, the phrase may be, “at VBG, my voice in my unique identity." Note that your passphrase can be the same for everyone or you could make the phrase unique for each person, although unique phrases create operational complexity.
Most voice biometric vendors provide static text passphrases as their primary offering due to its simplicity. However, a weakness of this approach is the ease at which a third party could record the person speaking and then use the recording to authenticate (what is called a "recorded playback attack"). This is a genuine concern with the quality of recording devices and a determined fraudster. And although there are techniques to detect "liveness", VBG generally recommends using RandomPIN™ (see below) for applications that are at higher risk of playback attacks.
Active Method 2 - Static Numeric Passphrase
This is identical to the Static Text Passphrase, except that the static phrase is a number. One very common approach for this method is to have the end user repeat his or her mobile phone number 3 times to create their voiceprint. This is easy to remember, changes infrequently, etc. Like text-based passphrases, this method is also susceptible to recorded playback attacks. However, if static passphrases must be used, VBG recommends they be combined with an outbound call to a mobile phone to help strengthen confidence (i.e., possession of the phone is like a token, another factor).
Active Method 3 - RandomPIN™
With the RandomPIN™ use case, we require the end user to speak a set of diverse digit strings in order to enroll. To verify, we then ask the user to repeat a random number, typically 4 or 5 digits in length. A new number is generated for every verify attempt, for every user, for every call. The VBG system can provide the random string (of any length) or you can create your own. More than 60% of our customers choose this approach due to the robustness against fraud, speed of verification, and simplicity for the person speaking. There is nothing for the end user to remember, and both young and old, and cross-culturally, numbers are easy to speak.
Passive Methods
A "passive" use case means that the person speaking does not have to speak anything in particular. In fact, there is no prompting for speech at all, and the end user may often be unaware that they are involved in a voice biometric process. For example, you can capture the speech of a caller during their conversation with a call center agent. The recording of the caller's leg of the conversation can then be used to either create a voiceprint, or be compared to one or more voiceprints for verification or identification. Passive use cases are also referred to as "conversational" or "text independent" forms of voice biometrics.
The most common source of passive speech samples is from call centers. Before considering this approach, please note that many call center systems record both agent and caller into the same (mono) channel. In these situations, you will have additional work to do to capture the audio before it gets to the recording system. The good news: there are a number of new technologies available, and specialty partners we work with, so please contact us if you you suspect this may be an issue.
The primary benefit of passive approaches is that you do not need to train end users or ask them to say anything specifically. Many business managers consider this the ideal, "frictionless" user experience for voice biometrics.
The VBG system removes silence and non-speech signals from passive recordings to arrive at a measure called “seconds of usable speech” (or SUS). Enrollment samples having SUS values of 60-75 or more will typically provide very good results. On the other hand, we can create voiceprints with only a few seconds of speech, but they will probably lack phonetic diversity (and will not lead to robust or accurate performance).
So, consider 30 seconds of usable speech to be the minimum amount of speech for a successful passive enrollment . And remember, saying "yes", "no", "umm" and other short-sylable words multiple times will not provide good speech diversity.
Verify or Identify or Both?
By far the most common problem companies solve with voice biometrics is verification. Voice verification allows you to take a speech sample from someone claiming a specific identity, and compare it to the stored voiceprint for that identity. This is a 1:1 process that is extremely fast. Verification needs only a couple seconds of speech and the voice biometric scoring process is sub-second. VBG systems process many millions of verification requests a year.
However, some situations, particularly fraud detection and prevention, require that a speech sample be compared to multiple voiceprints in order to determine the best match. Many scenarios exist for voice-based fraud to occur, such as rampant language testing cheating where an individual pays a professional test-taker to take their exam for them.
Far more prevalent is the fraud that occurs in a call center or IVR system, such as duplicate registrations and account takeovers (ATO). VBG systems detect these types of fraud. Multiple fraudster "blacklists" can be setup to scan all incoming speech for suspicious activity.
It is important to note that depending on the size of the blacklist, comparing a speech sample to all voiceprints in the database may take significant processing time.