Wondering if Voice Biometrics is for You?

If you are new to voice biometrics and have a goal of determining whether it may be useful to your business situation, then this section is for you.

Voice Biometrics is a technology. As the name implies, it means measuring the characteristics of your voice biology - how you make sound, the characteristics of your vocal tract, and other information that a recording of your voice reveals to a sophisticated voice biometric engine.

Measuring voice is useful because of a simple fact:  Nearly everyone has a set of voice characteristics that distinguishes a person from other people, similarly to how everyone has (nearly) unique fingerprints. Companies like Voice Biometrics Group have developed algorithms that accept samples of a person’s voice and extract the characteristics of the voice. We then store those characteristics as a “voiceprint.” If I have a voiceprint of someone, then I can use it for two valuable purposes:

  • I can confirm the identity the person claims by comparing the characteristics of the new voice sample to the voiceprint in the database. This is like using your voice as a password. This use case is Authentication. You may hear this use case called Voice Verification, Speaker Verification, or Speaker Authentication. The point of Authentication is most often to protect whatever comes after the authentication step.

  • I can determine who the person is by comparing the characteristics of a new voice sample against all the voiceprints in the database. This is useful for fraud detection such as checking whether a person claiming to be a new customer is in fact someone in your fraudster voiceprint database or a repeat customer. This use case is Identification. You may hear it called Speaker Identification or Voice Identification. The point of Identification is most often to detect fraud within a system.

If you are still reading at this point, you probably already have a problem that one or both of these two capabilities will solve. For example, authentication is everywhere and in everything related to accessing information or systems. Non-biometric methods commonly in place today are "something you know," such as passwords and personal information like the house number where you lived when you were 12, or "something you have," such as your mobile phone or a security fob. However, these classic "something you know" and "something you have" non-biometric authentication methods are vulnerable to fraud through social engineering and theft. Adding a biometric factor reduces the vulnerability of your system to unauthorized access by adding “something you are.”

Furthermore, professional fraud is becoming prevalent, especially for retail organizations serving customers over the phone and online. Identifying a repeat fraudster by comparing a voice sample against a blacklist of known fraudster voiceprints gives you the ability to take action to detect and prevent fraud. Fraud detection is not the only way to use Identification. One of our customers uses Identification to automatically determine the identity of patients in a healthcare clinical trial, enabling a double-blind study such that the medical clinician does not know the identity of the patient when the patient reports data. How cool is that!? Check out these pages to see more details of how our customers are using voice biometrics to solve their problems.

Will Voice Biometrics be Accurate Enough?

If you are at the exploration stage, understanding the level of accuracy is an essential factor in deciding whether voice biometrics will be helpful to you.

First, you should recognize that no biometric is 100% accurate. For example, a 2014 study on iris recognition determined system accuracy could be between 90 and 99%, a broad range. Voice is likewise in this range for a variety of reasons. However, even with imperfection, a biometric is a valuable factor in security with a clear ROI.

The reason is that security is about managing risk. No method can ever provide 100% certainty. Therefore, security experts combine factors to lower risk. And each factor will typically have a score, the combination of which results in a risk profile. Voice Biometrics provides a contribution to the overall risk score, enabling a security expert to ignore lower risk transactions and focus on higher risk transactions. Or in the case of fraud detection, even if Voice Biometrics does not catch all fraud, if it catches some, the company is usually better off.

The general method to assess “accuracy” of a Voice Biometric system is what is called the Equal Error Rate, or EER. It is the combination of two types of errors that may occur on a single transaction. One error is letting an impostor through (False Accept), and the other error is blocking a valid user (False Reject). You need to know that because a Voice Biometric system is a statistical system based on probabilities, these errors are a trade-off. If you set your confidence levels higher to prevent impostors, you will also block more valid users, which causes annoyance. You influence this trade-off by telling your Voice Biometric vendor how you want your configuration to balance the tradeoff of security versus annoyance. And note that EER is for a single transaction. By allowing retries, you materially increase the probability a valid user gets through on a second or third retry, even if you originally set confidence levels to strongly reject impostors.

Measuring EER is not exactly the error you will see in the real-world. EER is based on the data set used to do the measurement. It can easily be manipulated by ignoring samples that hurt results. Therefore, we recommend that you compare Voice Biometric systems based on real-world users, including running trials within your environment, rather than relying on published EER results with no insight into the data used for the test.

Lastly, recognize that real-world implementations are subject to variances that go beyond the mathematics of the Voice Biometric engine. By far the biggest reason for frustration in a voice system is not getting clear audio. If the audio is noisy or if the wrong information is spoken to the Voice Biometric system, then the Voice Biometric engine, just the same as a person, will have trouble using the audio to make an accurate determination. Colds and other factors that affect a person's voice will in turn affect the integrity of matching a voice sample to a voiceprint. The extent of the impact depends on how altered is the person's voice.

Also, different sounding audio because of using different types of devices and communication influences results. Mobile phone networks use different compression techniques compared to landline phones and therefore affect your voice pattern. Therefore, we recommend providing simple training to help anyone who will be speaking to a Voice Biometric system learn how to be a good user of the system. We will provide you the materials.

If you can tolerate real-world results in the range of 90% and up, including issues due to noise and other factors in the environment, then Voice Biometrics is accurate enough for your needs and you have the potential to reap benefits. The next question to answer is whether getting those benefits make sense financially.

What Kind of ROI Will I Get?

At this point, the two decision barriers you passed are knowing that Voice Biometrics is relevant to your business problem and that Voice Biometrics is sufficiently accurate. The next question to answer is whether implementing a Voice Biometric project will give you a return on your investment. This is an important question for us, too. If we do not provide sufficient accuracy and performance at a cost that enables a successful business case for you, we don’t have a business.

You will find that the business case for voice biometrics typically involves one or more of three business drivers:

  • Reducing exposure to fraud
  • Reducing costs for authentication
  • Improving the customer experience

Fraud reduction is always specific to your particular situation. Sometimes fraud is from internal employees, sometimes from professionals who target call center agents with social engineering. Voice Biometrics Group does not have specific fraud reduction numbers to share for any particular client, but the numbers are usually large, well in excess of the cost to implement a solution. Read through client deployments to see examples of business drivers for live production deployments that may be similar to your situation.

A second driver, reducing costs for authentication, most often arises in the context of a call center. Voice Biometrics automates the authentication process on a large percentage of calls that normally require agent time. In this scenario, a licensing model with a fee per transaction usually guarantees an ROI in less than a year. For example, the cost to authenticate a call with an agent is equal to the seconds of agent time for the authentication multiplied by the cost of the agent per second. If an IVR that includes Voice Biometrics authenticates the caller before transferring to an agent, or if a Voice Biometrics system authenticates the caller during the beginning of the call with the agent (using a passive technique described later), then that saves agent time. Savings in agent time can often be at least 20 seconds and as much as 72 seconds. When you run numbers with our ROI Calculator, you should find savings such that the overall business case works, even if not every call is automated and even with setup and startup costs.

In addition to the traditional call center savings, voice biometrics makes a new set of applications possible that would be far too costly to consider otherwise. Some of the examples in our client deployments fit this category, such as automating offender monitoring and verifying student participation in online education.

The third driver for Voice Biometrics is making the authentication process easier for customers. In the call center scenario, not only does it save time for the agent, but it also saves time for the caller. Furthermore, the caller will hopefully recognize you are using advanced technology to protect them and provide greater security while also reducing time spent telling an agent where you lived when you were 12.

Measuring the return on improved customer experience and the impact to revenue and customer retention is difficult to measure. The Return On Effort metric may be one measure as well as Customer Sat and Net Promoter Scores. However, most companies have multiple competing ideas for improving customer experience as well as many other projects that provide a “hard” ROI. Therefore, the first two drivers, reducing fraud exposure and reducing costs for authentication, are more commonly the drivers that compel companies to act.

Calculating ROI depends on both what you invest and what you get. The three business drivers above define what you get. What you invest falls into costs for the biometric system plus the resources you must apply to incorporate voice biometrics into your business processes. We have found that most companies have different types of constraints that require flexible pricing models in order to fit the business case. We have flexible pricing and are able to adapt our pricing to accommodate special needs. Use our Voice Biometric ROI Calculator to assess the business impact of applying voice biometrics in a typical contact center.

How to Apply Voice Biometrics Technology

If your application has passed the third barrier of having a viable business case, you may now move on to consider the technical feasibility of implementing Voice Biometrics within your environment.

First, Get Audio

Voice Biometric systems analyze voice recordings; therefore, the first step in implementation is knowing how you will get voice samples of the person for whom you are creating a voiceprint. You need a way to capture audio of the person you want to authenticate and/or identify such that you have only that voice and not a mix of that voice and other voices or sounds.

The most common method to capture audio for a voiceprint is over the telephone. When you are speaking on the phone, your voice is obviously going over the telephone line, whether a mobile phone, landline phone, or VoIP phone, and it can be recorded using a range of equipment made for this purpose, possibly without you even knowing. If you have an IVR, the IVR will likely be able to record audio that is then passed to a Voice Biometric engine.

A second way is through a web browser on a computer or tablet or phone equipped with a microphone, such as every laptop computer made today. The web browser serves the purpose of being a telephone. Because nearly every modern laptop has a microphone, capturing audio through the computer is practical for many applications where the user will be near or on a computer. Likewise, voice capture is also easily done on a mobile device with a native application.

And lastly, any other way that you can get audio through a microphone and record in a format suitable for a voice biometric engine. For example, one of our customers records people speaking during a spoken language test and uses that audio to create voiceprints. Another example is using stored recordings in your call center. Regardless of how you get the audio, note that you want to only capture the audio of the desired speaker. If you include other audio, then you will have to do extra work to separate the audio before using it with a voice biometric engine.

Then Determine User Experience

Once you have a means of capturing audio, your next step is to determine what, if anything, you will be asking the person of interest to do in order to capture that person’s voice and send to the Voice Biometric engine in order to make a voiceprint. This step is called “enrollment.” Once enrolled, you can then authenticate the user by capturing a new voice sample that is compared against the voiceprint from enrollment. And you can compare that new sample against all other voiceprints in the database to determine the identity of the speaker. Several options exist for enrollment and verification. Each has pros and cons. The following section explains the options.

Passive Method, Also Known as NaturalSpeech™

The first distinction among options is “active” vs. “passive.” A passive enrollment means that the person speaking does not have to take any action specific to enrolling. You will capture the audio of the person from a conversation the person is having with a call center agent, for example, or from how that person is responding to a teacher. You will then submit the audio to the voice biometric system to generate a voiceprint. VBG has developed the NaturalSpeech™ use case for these scenarios.

The most common source of audio data is recordings in a call center. However, usually these recordings are saved in a form that mixes all the speakers into a single audio track and compresses the voice as much as possible. Consequently, these recordings are often not going to be an easy source for voice biometric systems. You will likely have some work to do to either capture the audio before it gets to the recording system or perform post-processing on the audio in order to distinguish the speakers.

If you have audio in a form that is easily ingested by the voice biometric system, the obvious benefit of this passive approach is that you do not need to train the speaker or ask her to do anything extra in order to enroll her voiceprint. Many business managers consider this the ideal User Experience for voice biometrics.

The downside of this approach is that you are relying on whatever the speaker happens to be saying as the audio data for the voiceprint. Not controlling the content of what is spoken creates inefficiencies. The first inefficiency relates to the fact that creating a viable voiceprint requires sampling enough unique spoken sounds from a person, what we call “phonetic diversity.” You will get a high quality voiceprint by having the person speak all the phonemes in her language at least twice. For English, this means 44 phonemes sampled twice.

As a rule of thumb, it turns out that a person will speak all phonemes at least twice in a two to three-minute conversation. After removing silence and the other person’s voice, that means true spoken speech, what we call “Seconds of Usable Speech,” may be around 30 seconds, but is better to be closer to 60 or 90 seconds. Although you can use less speech to create a voiceprint, the voiceprint will not necessarily be a good representation of the speaker’s voice characteristics. Consider 30 seconds of usable speech to be the minimum amount of speech for a successful passive enrollment, assuming this speech has phonetic diversity. 30 seconds of someone saying "yes" and "no" many times does not create a robust voiceprint.

A second inefficiency comes from managing audio files. This can be a complex programming task that links multiple systems together to keep track of people and their matching audio and the part of the audio used for enrollment. The impacts are higher cost and longer time to implement.

Active Method

In contrast to the passive approach, the active method requires knowing participation of the person whose voice you are capturing. The obvious downside of requiring participation is that the speaking party has to follow instructions and behave in an expected manner. Furthermore, active participation requires the speaker to devote time to the process, especially for enrollment. You may find yourself in a position where you either do not trust your customers or employees to comply to the requested behavior or may not be willing to devote time to the process. These concerns are two common reasons why companies choose not to implement a voice biometric project at all or insist on using only the passive method.

On the other hand, the active use case has distinct advantages. First, because the nature of what is spoken is known in advance, the voice biometric engine can be more efficient. Our engine works with both phrases and numbers. It turns out that speaking numbers (or you can say “digits”) presents distinct advantages for a voice biometric system because it is easy to validate the content in many languages, the amount of speech is relatively short, and numbers are easy for people to remember.

A second advantage of the active use case is awareness. By simply having the threat of a voice biometric system, the rate of fraud decreases. Some people will of course try to crack the voice biometric, but our customers tell us that the overall attempts at cheating decline significantly.

A third advantage is operational simplicity. In a passive system discussed earlier, what we call a NaturalSpeech™ system where the person can speak anything, acquiring enough audio and managing audio data adds complexity. With an active use case, on the other hand, the speech samples are immediate and short and require no separate storage or management.

Active Method Options

With an active method for applying voice biometrics, you have further decisions to make regarding the user experience and the level of security. Specifically, you have the following options for what a person speaks in order to create a voiceprint:

Static Passphrase (Text):

As the name implies, this means the user speaks the same thing (“static”) that is a phrase of text. For example, the phrase may be, “My voice is my password. Please verify me.” The person speaks this phrase three times or so to enroll and one time to verify. This is simple and easy. The phrase can be the same for everyone or you could make the phrase unique for each person, although unique phrases create operational complexity.

Static Text is what most voice biometric vendors provide as their most common use case due to simplicity. A weakness of this approach is the ease at which a third party could record the person speaking and then use the recording to authenticate. This is a genuine concern with the quality of recording devices and a determined fraudster. Vendors use “liveness testing” to thwart recorded playbacks. Liveness testing typically looks for machine noise from recording devices and other audio characteristics. These are not effective, so more recently, liveness testing means adding a second prompt to collect a sample that is not the static passphrase. Because of the extra step, we recommend using RandomPIN™ for applications that are at higher risk of playback attacks.

Static Passphrase (Number):

As the name implies, this means speaking the same number each time. Numbers tend to be easier to remember and in many cases a person already has at least one long digit string memorized, like a phone number, that you can use for a voiceprint. You require the person to speak the number string three times for enrollment and one time for verification. Like Static Text, this type of voiceprint is more vulnerable to recording by a fraudster, so we recommend combining it's use with other techniques -- such as outbound telephone calls to a trusted phone number. If outbound calls are not possible, these Static techniques might be best left for lower security scenarios.


To thwart a fraudster who tries to make a recording of a person’s voice, you can require the person to speak a random number rather than a fixed phrase. VBG has developed the RandomPIN™ use case for these scenarios. With the RandomPIN™ use case, you require the person to speak a set of six digit strings that we have crafted for phonetic diversity, each five numbers long, for enrollment. To verify, you ask for a single five-digit string that is different every time. You could use a four-digit string, which is easier for people to remember when speaking back. And to make the system more secure with a four-digit string, you can ask for a four-digit random string two times, thus collecting eight digits, a good approach if the user is an infrequent user but the security level needs to be high. Our system can provide you the random string or you can create your own. More than 60% of our customers choose this approach due to the robustness against fraud, speed of verification, and simplicity for the person speaking. She has nothing to remember and she will likely be highly competent at speaking numbers.

Verification vs. Identification - Or Both?

By far the most common problem companies solve with voice biometrics is Authentication. This is verifying that the person claiming to be Jane is in fact Jane by comparing a new voice sample with Jane’s voiceprint in the database. This is a fast process, taking a few seconds to capture a voice and a fraction of a second to compare the new voice sample with a voiceprint.

Some situations, particularly fraud detection and prevention, require comparing the new voice sample against a large number of voiceprints. The most common example is a “Blacklist” scenario. The Blacklist is a list of known fraudsters for whom you have a voiceprint. For fraud prevention, when signing up a new customer, capture a voice sample, then compare the new customer’s voice sample against all the voiceprints in the Blacklist. If you get a “hit,” then you know to be wary of that new customer.

Depending on the size of the Blacklist, comparing against all the voiceprints in the database may take significant processing time, perhaps hours, so keep that in mind if you are interested in Identification. Read our customer stories to learn about other innovate ways to use Identification besides a Blacklist, particularly a clever application that performs Continuous Verification and then Identification if it discovers fraud.

What About Languages?

Language is fortunately not that big of an issue. Voice Biometrics technology uses the way people make sounds as a way to reveal unique characteristics. Because the voice biometric engine measures how rather than what you are speaking, it does not care about language. However, to tune the voice biometric engine to get the best results means taking into account that different languages use different sounds. Spanish, for example, has a different set of phonemes (and actually a fewer number with some overlap) than English. For this reason, we separate languages as well as use cases, resulting in use-case-language combinations.

Suppose for example, you have a population of users in the US who need to use biometrics in both Spanish and English. We would set you up with one account for Random Number English and a separate account for Random Number Spanish as the way to optimize results.

We currently support 35 languages in Production. In the unlikely event we do not have your language, we will work with you to capture a large enough sample of your language to set up your language on our voice biometric engine. We call the capture of voice samples a Data Collection Study. For each new language, we like to gather voice samples from 75 males and 75 females. That is usually sufficient to give us an excellent start. You will find that some biometric companies require substantially more data collection. This is due to the fact that their underlying algorithm requires more data for training, often on the order of a few thousand people.

What About Implementation Process?

If you made it this far, you’re ready to learn how to implement a solution. Three possibilities exist:

  • You directly access our RESTful APIs. This option is appropriate for companies who wish to embed voice biometrics into their own product or want to include voice biometrics into a business process. For this option, skip over to our Developer Page to learn more about our RESTful APIs and programming model. Check out our mobile app development page to learn how native mobile applications also can leverage our RESTful APIs. Request a Free Trial to let your developers work with our APIs. Most developers can get familiar with our API in minutes using our online tools and up and running in a few days using their favorite development environment.

  • We deploy an IVR-based system for you and integrate with your business process. We offer a cloud or premises-based IVR system to save you the hassle of implementing a system. For example, see our resource page on Multi-Factor Authentication to learn how to leverage our IVR quickly and easily in your business process. Request a Free Trial to test out our complete IVR and voice biometric solution in either English or Spanish. Changing our IVR application to your language of choice requires only a small fee for new prompts.

  • Finally, if you are looking for a turn-key solution where voice biometrics is an element within a broader security solution, we will connect you with one of our partners. Simply fill out a contact form and make note of this in your request. We will participate with our partner to design and implement a solution appropriate for you.