Image Description

Considering Voice Biometrics?

Start with this document to determine if voice biometric technology is right for you.

Last Updated Feb 2021

First, a Definition

What is voice biometrics? The term "voice biometrics" typically refers to a sophisticated set of software programs that processes speech samples for two main purposes: (1) to model the unique vocal characteristics from an individual's speech sample(s) and create a voiceprint, or (2) to compare a speech sample to one or more stored voiceprints in order to obtain a match probability. Once you have a voiceprint for someone, you can perform several useful functions, with the two most popular being:

  • Verification If someone is claiming a specific identity and you have a stored voiceprint for them, you can confirm their identity by comparing a speech sample from them to their stored voiceprint. This is called verification, but it is also called voice verification, speaker verification, or voice authentication. This is a very fast 1:1 matching process.
  • Identification If there is no specific identity claim, you can determine who the individual is by comparing a speech sample against all available voiceprints. This is called identification, but is also called speaker identification, or voice identification. This is a 1:many matching process, which can take substantial amounts of time.

Assuming one or both of these functions makes conceptual sense for your business, next consider basic, practical implications of using speech within your applications.

Easy access to speech samples? Voice biometric technology is extremely useful, but only if it is easy for you to get speech samples from your end users. If your end users will typically be in a sports stadium, or using construction equipment, or will be in any other environment that is very loud and/or noisy, then voice biometric technology may not be the best fit. Also, consider how natural it is for your end users to speak. Will they have ready access to a phone or other microphone device? And, consider if there are better technologies for your end users in their intended usage environment. For example, a fingerprint sensor might make more sense to log into a laptop than voice biometrics.

The good news however: with everyone carrying mobile phones these days, there are many good and valid scenarios for using voice biometrics.

Will Voice Biometrics Be Accurate Enough?

Just as no security system is infallable, it is important to recognize that no biometric technology is 100% accurate. System accuracy could be anywhere between 90 and 99%, a broad range. Voice biometric accuracy is also in this range for a variety of reasons. However, even with its imperfections, voice biometrics is an extremely valuable tool.

This is especially true when recognizing that voice biometric systems are typically used as part of a multi-factor authentication process. A 95% accuracy level may not sound very good as a single factor. However, if you achieve 95% confidence with User ID and Password (somethign you know), and also achieve 95% confidence with a token (something you have), and also achieve a 95% confidence with voice biometrics (something you are), then most fraud and risk professionals will find this combination of probabilities acceptable.

The method to assess the accuracy of a voice biometric system is "Equal Error Rate", or "EER". EER is the point where the "False Acceptance Rate" or "FAR" (i.e., letting an imposter through) is equal to the "False Rejection Rate" or "FRR" (i.e., denying access to a valid user). Voice biometric scoring systems are based on statistical probabilities, so there are trade-offs between these errors to consider. For instance, if you set your confidence levels "high" to prevent impostors, you may end up blocking more valid users, which causes annoyance. Setting a confidence level "lower" will provide more convenience for your valid users, but you may end up letting in more imposters. VBG will work with you to provide an optimal balance between security and convenience. Also note that EER is measured for a single attempt. By allowing retries in your application, you can increase the probability that a valid user gets through on a second or third retry, even if you originally set confidence levels high to reject impostors.

Note that EER, FAR, and FRR results are derived from a set of audio samples used to process and obtain these measurements. Beware of extremely low EER values advertised by some vendors, as laboratory-derived EER results can be easily manipulated by removing samples that negatively impact results. EER results are only as good as the data sampling that was performed for their calculation. Unexpected real-world results can (and will) occur if your sampling is not truly representative of the end user population, specific languages and dialects, types of devices being used, environment where speech is being collected, etc. Therefore, we recommend that you compare voice biometric systems based on real-world users, including running trials within your intended production environment, rather than relying on published EER results with no insight into the data used for the test.

Another significant factor impacting real-world EER is the content of the speech sample and presence of noise. If a speech sample is too noisy, or if the wrong information is spoken to the voice biometric system, then the voice biometric engine may have more difficulty using the speech sample to make an accurate determination. Using different types of devices can also influece results. For example, mobile phone networks use different compression techniques compared to landline phones -- this affects the voice biometric process which models the unique vocal characteristics from speech samples.

What Kind of ROI Will I Get?

Assuming that voice biometrics is relevant to your business problem, and is sufficiently accurate for you, the next question to consider is whether implementing a voice biometric project will give you sufficient return on investment (ROI). This is an important question for VBG, too. If we cannot provide sufficient accuracy and performance at a cost that enables a successful business case for you, we won’t have a business.

The business case for voice biometrics typically involves one or more of three business drivers:

  • Reducing exposure to fraud
  • Reducing costs for authentication
  • Improving the customer experience

Fraud reduction is always specific to your particular situation. Sometimes employees commit fraud, while other times fraudsters target call center agents with social engineering. VBG cannot share specific fraud reduction numbers from our clients; however, the cost savings are large -- and tend to be well in excess of the cost to implement a solution.

A second driver, reducing costs for authentication, is a common need within IVR systems and call centers. Voice biometrics can automate the authentication process on a large percentage of calls that would otherwise require live agent "handle time". For example, the cost to authenticate a call with an agent is equal to the seconds of agent time for the authentication multiplied by the cost of the agent per second. If an IVR that includes voice biometrics authenticates the caller before transferring to an agent, or if a voice biometrics system authenticates the caller during the beginning of the call with the agent, then that saves agent time. Savings in agent time can often be at little as 20 seconds, or up to 60 seconds or more.

The third driver for voice biometrics is improving the customer experience (making it easier, faster, less obtrusive, etc.). In the call center scenario, not only does it save time for the agent, but it also saves time for the caller. And, many callers are sophisticated enough to recognize you are using advanced technology to protect them and provide then with greater security (as compared to asking questions like your mother's maiden name).

Measuring the return on improved customer experience and the impact to revenue and customer retention is difficult. Therefore, the first two drivers, reducing fraud exposure and reducing costs for authentication, are what typically push companies to adopt voice biometric technology.

Voice biometric systems will provide savings, even when factoring in the costs for the biometric system and the resources needed to integrate these systems into your busienss process(es) and maintain them over time. And while companies typically have a wide range of contraints when considering voice biometric systems, VBG offers many flexible deployment options and pricing plans to accommodate almost any conceivable need.

How Do I Apply Voice Biometrics?

There are many variables to consider when integrating voice biometric technology into your business process(es). Two of the most critical considerations are: what will be the source(s) for your speech samples, and what can or will be contained in them? Each is discussed in greater detail below.

Speech Sample Sources

Voice biometric systems process speech samples; therefore, the first step in implementation is knowing how you will get speech samples from your end users. You need a good way to capture speech from a specific user, without capturing speech from other users, or unwanted sounds and noises.

The most common device to capture speech samples is the telephone. Whether a mobile phone, landline phone, or VoIP phone, people speaking on telephones can have their speech recorded using a range of techniques made for this purpose, possibly without them ever knowing. If you have an IVR system, the IVR will likely be able to record audio that is then passed to a voice biometric system.

A second way to capture speech is through a web browser on a computer or tablet, using the embedded microphone. Because nearly every modern laptop and tablet has a microphone, capturing audio through the computer is practical for many applications. Likewise, voice capture is also easily done on a mobile phone.

Identify Desired Use Case

Relative to voice biometrics, VBG uses the term "use case" to describe how speech samples are collected. Use cases also determine the content of the speech samples being processed by the voice biometric system. VBG supports all possible speech capture methods; the following sections explain the options.

Active Methods

An "active" use case means the person speaking is repeating something they are required to speak. They are active and knowing participants in the process. The use of active prompting for specific content is also referred to as a "text dependent" use of voice biometrics. One minor downside of requiring participation is that end users have to follow instructions and behave in an expected manner. In some cases, such as monitoring a parolee, you may not have a willing or compliant participant -- which can impact system accuracy.

On the other hand, active approaches have several advantages. First, because the speech content is known in advance, less time is needed to create a voiceprint. Also, active approaches tend to use short phrases, so they are easy for end users to repeat and require very little storage space. And finally, active use cases integrate easily into many computer systems and usage scenarios.

Active Method 1 - Static Text Passphrase

In this active variant, the user speaks the same phrase 2-3 times to create a voiceprint, then once to verify. For example, the phrase may be, “at VBG, my voice in my unique identity." Note that your passphrase can be the same for everyone or you could make the phrase unique for each person, although unique phrases create operational complexity.

Most voice biometric vendors provide static text passphrases as their primary offering due to its simplicity. However, a weakness of this approach is the ease at which a third party could record the person speaking and then use the recording to authenticate (what is called a "recorded playback attack"). This is a genuine concern with the quality of recording devices and a determined fraudster. And although there are techniques to detect "liveness", VBG generally recommends using RandomPIN™ (see below) for applications that are at higher risk of playback attacks.

Active Method 2 - Static Numeric Passphrase

This is identical to the Static Text Passphrase, except that the static phrase is a number. One very common approach for this method is to have the end user repeat his or her mobile phone number 3 times to create their voiceprint. This is easy to remember, changes infrequently, etc. Like text-based passphrases, this method is also susceptible to recorded playback attacks. However, if static passphrases must be used, VBG recommends they be combined with an outbound call to a mobile phone to help strengthen confidence (i.e., possession of the phone is like a token, another factor).

Active Method 3 - RandomPIN™

With the RandomPIN™ use case, we require the end user to speak a set of diverse digit strings in order to enroll. To verify, we then ask the user to repeat a random number, typically 4 or 5 digits in length. A new number is generated for every verify attempt, for every user, for every call. The VBG system can provide the random string (of any length) or you can create your own. More than 60% of our customers choose this approach due to the robustness against fraud, speed of verification, and simplicity for the person speaking. There is nothing for the end user to remember, and both young and old, and cross-culturally, numbers are easy to speak.

Passive Methods

A "passive" use case means that the person speaking does not have to speak anything in particular. In fact, there is no prompting for speech at all, and the end user may often be unaware that they are involved in a voice biometric process. For example, you can capture the speech of a caller during their conversation with a call center agent. The recording of the caller's leg of the conversation can then be used to either create a voiceprint, or be compared to one or more voiceprints for verification or identification. Passive use cases are also referred to as "conversational" or "text independent" forms of voice biometrics.

The most common source of passive speech samples is from call centers. Before considering this approach, please note that many call center systems record both agent and caller into the same (mono) channel. In these situations, you will have additional work to do to capture the audio before it gets to the recording system. The good news: there are a number of new technologies available, and specialty partners we work with, so please contact us if you you suspect this may be an issue.

The primary benefit of passive approaches is that you do not need to train end users or ask them to say anything specifically. Many business managers consider this the ideal, "frictionless" user experience for voice biometrics. The downside of passive approaches is that you are relying on whatever the speaker happens to be saying as input for the voice biometric system. In order to create a robust voiceprint using passive approaches, we need to capture all the phonemes of the language the user is speaking. This means more time is needed (compared to active approaches).

As a rule of thumb, end users will typically speak all phonemes of their language roughly twice in a 2-3 minute conversation. The VBG system removes silence and non-speech signals from passive recordings to arrive at a measure called “seconds of usable speech” (or SUS). Enrollment samples having SUS values of 60-90 or more will typically provide very good results. We can create voiceprints with only a few seconds of speech, but they will lack phonetic diversity (and will not lead to robust or accurate performance). So, consider 30 seconds of usable speech to be the minimum amount of speech for a successful passive enrollment (assuming this speech has phonetic diversity). And remember, saying "yes", "no", "umm" and other short-sylable words multiple times will not provide good speech diversity.

Passive Method - VBG NaturalSpeech™

The VBG NaturalSpeech™ method addresses all forms of passive speech collection. This use case can accept any speech recording for enrollment or verification, provided that it contains only one speaker and is in a format we accept. Once you have a robust VBG NaturalSpeech™ voiceprint, it is possible to verify with only a few seconds of speech -- a short sentence, phrase, digits, almost anything. And, VBG NaturalSpeech™ allows you to create multiple separate fraudster (i.e., "blacklist") voiceprint databases to determine if fraud is occurring.

Verify or Identify or Both?

By far the most common problem companies solve with voice biometrics is verification. Voice verification allows you to take a speech sample from someone claiming a specific identity, and compare it to the stored voiceprint for that identity. This is a 1:1 process that is extremely fast. Verification needs only a couple seconds of speech and the voice biometric scoring process is sub-second. VBG systems process many millions of verification requests a year.

However, some situations, particularly fraud detection and prevention, require that a speech sample be compared to multiple voiceprints in order to determine the best match. Many scenarios exist for voice-based fraud to occur, such as rampant language testing cheating where an individual pays a professional test-taker to take their exam for them. VBG technology can detect the presence of a professional test-taker by analyzing past tests and comparing them with current speech samples.

Far more prevalent is the fraud that occurs in a call center or IVR system. Some utilities and other organizations with field-based sales personnel have issues where friends or co-workers of a sales person pretend to be a customer agreeing to purchase services -- so the sales person can earn a commission dishonestly. VBG systems detect this type of fraud. Within a call center conducting retail sales, "blacklists" can be setup to capture the speech of a customer making a purchase. Is this person claiming to be a new customer actually an existing customer? Or, is this person someone who previously defrauded you? VBG systems also detect these types of fraud.

It is important to note that depending on the size of the blacklist, comparing a speech sample to all voiceprints in the database may take significant processing time -- potentially hours. If you suspect that you will require large blacklists and many comparisons per day, VBG can arrange for a customized system to be deployed in our data center -- or can even arrange to put a system in your own data center.

Using Both. Many customers use both verification and identification as they have separate business processes that benefit from them. However, we even have customers using verify and identify together within a single scenario. Consider the use of "continuous verification" by a prison system using VBG technology. A prisoner is verified throughout the call, approximately every 10 seconds, to validate that he or she has not handed-off the phone to another prisoner. If a score change is detected during a verify event, the same speech sample is also identified against the entire database to determine who the phone was given to. There are other similar scenarios, but VBG technology is extremely flexible and can be adapted to almost any conceivable situation.

What About Languages?

Since voice biometric technology deals exclusively with speech samples, language becomes an important topic. Fortunately, our technology is both text-independent and language-independent. That is, VBG technology can be deployed for any language and for any use case. Our core technology works by measuring how individuals uniquely say things -- not by what language they are speaking (or what they are saying).

And while we are able to score any language and use case with our core technology, we do perform separate speech collection projects and create separate models for each language and use case pairing. This is done to improve accuracy and to customize engine settings to your particular situation. For example, if you have one application where you are using our RandomPIN™ use case for U.S. English speakers, and another application where you are using a Static Text Passphrase for Latin American Spanish users, you should not expect to have robust performance (low EER) using one model. Separate language-use case models allow us to account for the different phonemes and expected content while creating voiceprints or performing verify or identify comparisons. Not every voice biometrics vendor takes this approach.

In the event VBG does not have your language and use case pre-modeled, we will work with you to capture a large enough sample of your language to set up your language on our voice biometric engine. There are multiple ways to collect speech, but for completely new languages that VBG has never modelled before, there are little or no VBG fees.

Don't Forget About Consent!

With the General Data Protection Regulation (GDPR) in Europe, California Consumer Privacy Act (CCPA), Biometric Information Privacy Acts (BIPA) in many U.S. states, and a host of other existing and pending laws around the world -- consent should now be considered mandatory. Obtaining consent is not unique to voice biometrics, or to biometric vendors or even authentication service providers in general -- it applies to any company who is capturing or storing any form of information that can be considered "personally identifiable" (or PII).

Fortunately, most people understand the need to protect their identity, and they are increasingly willing to take extra steps to assure their private information remains private. And, VBG has built-in consent management to help you manage this process for your end users, from initially obtaining consent, to perhaps later deciding to revoke consent.

VBG strongly recommends that you speak with your legal and compliance teams, or appropriate outside advisors, to make sure you are abiding by all necessary laws and regulations in ALL locations where you have end users of your systems.

What About Implementation?

At this point we've explained the situations where voice biometrics make sense, have addressed accuracy expectations, have discussed financial considerations and provided a calculator, have provided background on possible use cases, etc. If you are ready to get started, VBG strongly recommends that you participate in a free trial of either VBG Enterprise™ system or VBG Pro™.

We'll setup and conduct your free trial in our U.S. hosting facility, providing you with API access credentials, access to online development resources and code samples, etc. And, we will work with your team to get you up and running quickly!

In preparing to test VBG technology, there are three general approaches available to you:

  • Custom Integration. Use REST, SOAP, VXML, or Streaming Media APIs to send audio to our platform directly.
  • Standard Plug-Ins. Use our RADIUS Server, ADFS Server, or Windows Login plug-ins and adapters.
  • User Interaction Options. Use our built-in IVR dialogs, mobile push app, or WebRTC tools for local headsets and microphones.

These options are covered in greater detail within our Integration Examples page. There are of course many other considerations when implementing a voice biometrics solution. However, the best way to get started is to prototype with a free trial account.

Finally, if you have limited time and or development resources, you may want to consider a VBG partner or one of our existing platform integrations to help simplify any integration concerns. Please see our Partners page for the latest listing of companies we are currently working with.

Thanks for reading this document!

Next Article in the Series

Contact Us

Do You Have Any Questions?

Please let us know how we can help you and we'll respond promptly!

VBG wants to hear from you