Glossary of Key Terms
A collection of commonly-used terms related to voice biometric technology
Automatic Speech Recognition
Automatic Speech Recognition (ASR), also known as voice recognition, is a technology where spoken words are recognized by the system for specific content. ASR is used in IVR systems, smartphones, cars, computers, and televisions for commands or dictation. ASR is useful in the context of voice biometrics for recognizing and validating the content of spoken voice in text-dependent interactions.
Biometric / Voice Biometric
Biometrics refers to the measurement of a physical feature or repeatable action of an individual. There are many forms of biometric measurements. Examples are fingerprints, retinal/iris scans, facial recognition, hand geometry, signatures, DNA, and of course voiceprints. Voice biometrics is the measurement of how a person speaks.
Voice biometrics is the measurement of how a person speaks. Voice biometric systems use the measurement of spoken voice to distinguish between people, usually to validate or invalidate a claimed identity or determine identity from a group of people.
Classification is the process for categorizing a person by evaluating voice samples from that person. Examples include male or female, young or old, or emotional state. Classification systems can be used within Identification systems to help reduce the number of comparisons necessary and to increase confidence in scoring. For example, if the person’s gender is clearly male, then it is not necessary to compare against voiceprints that are known to be female.
EER / FAR / FRR
EER, or Equal Error Rate, is the operating point where the percentage of false acceptances is equal to the percentage of false rejections. The lower the EER value, the better, as it is desireable to be both very good at recognizing valid system users as well as very good at screening out imposters and fraudsters. This is the most common term used to judge the accuracy of biometric and other security systems. It can be misleading because it is often a laboratory measurement based on a finite set of samples.
FAR, or False Acceptance Rate, is the percentage of times when the system will incorrectly let an imposter or fraudster in as a valid user. This scenario is sometimes also referred to as a 'Type II' error. Giving unauthorized users access to any system can have profound implications, so it is very important to tune biometric systems to low FAR levels.
FRR, or False Rejection Rate, is the percentage of times when the system will incorrectly reject a valid user. This scenario is sometimes also referred to as a 'Type I' error. Rejecting a valid user is an inconvenience and this can have implications for long-term user acceptance. To help manage these types of errors, tuning is recommended, along with offering retries.
In the context of voice biometrics, enrollment is the capture of voice samples from an individual with the intent of creating a voiceprint from the unique characteristics of the person’s speech patterns. Enrollment may be either Active or Passive.
Active Enrollment means the person is knowingly engaged in the enrollment process and is speaking what the biometric system asked the person to speak. Passive Enrollment means the person’s voice samples are captured and delivered to the voice biometric system in the general course of other activity, such as when speaking with a call center agent. Without some notification, the person will not be aware that his/her voice is being captured. Passive Enrollment generally requires longer duration of voice samples to fully represent the person’s voice characteristics, such as in the VBG NaturalSpeech™ method.
Within classic (non-Deep Learning) voice biometric engine algorithms, audio samples that are submitted to these engines analyze the voice sample and extract unique vocal characteristics using sophisticated audio signal processing techniques. The process that performs these tasks is commonly referred to as 'feature extraction'.
In an Identification application, audio is captured and passed to the voice biometric system without any claim of identity. The voice biometric system extracts information from the audio samples and then compares to a set of previously enrolled voiceprints. The best match from the set of voiceprints is returned as the identity of the person. If the identity is known for certain to be in the set of voiceprints, then this is a Closed Set Identification. The best match is most likely the identity of the person.
However, in many applications, such as a blacklist fraud detection application, the person may not have a voiceprint enrolled in the database. When identity is not certain, this is called an Open Set Identification. In Open Set Identification, the best match is simply the best possible match of tested voiceprints. The application must then also look at the absolute values of the matching process to determine if the match is strong enough to warrant further research into whether this was in fact that person.
Identification, whether Closed Set or Open Set, requires comparisons against multiple voiceprints. Large voiceprint databases can result in significant computing resources being needed. Therefore, custom deployments may need to be designed to provide appropriate response times.
Interactive Voice Response (IVR)
Interactive Voice Response refers to technology in which a person is able to interact with a computer application through a telephone call. The person either uses the telephone keypad or speaks to provide input. The computer application issues commands that cause recorded or synthesized computer voice to play back to the person.
The use of voice biometric authentication in conjunction with IVR systems is a common practice these days. IVR systems provide a convenient means to capture voice samples. Furthermore, IVR systems are often the entry point for contacting a business over the phone, making the IVR system the ideal point to automate the authentication process.
One of the potential weaknesses of a biometric system is the susceptibility to a recorded occurence of the biometric being used. For example, for facial biometrics, a fraudster may attempt to use a picture of the correct person to trick the system. Therefore, many biometric systems perform "liveness testing" in order to test whether the biometric sample is from a live person (or not).
For voice, companies use different techniques, such as listening for exactly the same audio as what was submitted in an earlier transaction or identifying noise from recording devices. For applications that are at risk to recorded playback attacks, VBG recommends using RandomPIN™, which inherently includes liveness testing in the process.
More recently, VBG researchers have been developing sophisticated anti-spoofing algorithms using advanced deep learning techniques. We are now able to more accurately determine if a submitted speech sample is a recording, or is synthetic speech.
Multi-Factor Authentication (MFA)
An authentication factor refers to a piece of information and/or a technique used to authenticate a person’s identity. Factors are something you have, such as a picture ID or token; something you know, such as password, PIN, or answer to a shared secret; or something you are, such as a biometric in the form of a fingerprint, voiceprint, etc.
Multi-Factor Authentication, or MFA, is a system where multiple factors are obtained during a single session to authenticate an individual, resulting in greater confidence in the authentication. VBG strongly encourages the use of MFA whenever possible.
VBG NaturalSpeech™ is a passive use case supported by VBG where a voiceprint is created from one or more speech samples of unknown content. In order to capture enough phonetic diversity in a person’s voiceprint, the voice samples must be of sufficient duration and content. For best results, typially 45-60 seconds of speech are needed. However, when verifying against a VBG NaturalSpeech™ voiceprint, only 2 seconds of speech are needed.
Voice biometric systems rely on analyzing voice samples. A Playback Attack is when an impostor uses a recording of a person's voice during a verification event in order to fraudulently be "passed" by the voice biometric system.
RandomPIN™ is an active use case supported by VBG where a person enrolls by speaking several sets of digit strings. To verify, the VBG system generates a random number, typically four or five digits. Because the verification phrase changes every time, RandomPIN™ is less susceptible to recorded playback attacks compared to fixed passphrases. And, the use of random numbers is very convenient -- there is nothing for users to remember.
A text-dependent voice biometric interaction means that the content of what the person is speaking is essential to the biometric analysis. The use of text-dependent prompting can help to minimize the time needed for enrollment, as well as simplify the process for end users.
Both of VBG's active use cases, Static Passphrase and RandomPIN™, are text-dependent.
A text-independent voice biometric interaction means that the content of what the person speaks is not relevant to the biometric analysis. That is, the voice biometric system is able to analyze the voice regardless of what the person is saying. Enrolling a person’s voice with no constraints on the content means that more speech is typically necessary to capture the unique characteristics of a person’s voice. VBG gets good results when creating voiceprints with 45-60+ seconds of speech.
Our passive use case, VBG NaturalSpeech™, is a good example of a text-independent method. We use this technique to process post-call recordings, or actively authenticate a caller in real-time as they are speaking with an agent.
In a Verification application, audio is captured and passed to the voice biometric system with a specific claim of identity. The voice biometric system extracts information from the audio sample and then compares it to the stored voiceprint for the claimed individual. A threshold score is used to determine whether the user should "pass" and continue in the system or "fail" and perform some alternate process. Unlike Identification which is a 1:N process of a sample to many voiceprints, Verification is a 1:1 process that is extremely fast from a processing perspective.
Voice biometric technolgy works because we all have unique vocal characteristics, event if we say the same thing!
“Feature Extractions” is used to analyze audio samples and distill unique characteristice into a personal voiceprint.
All voiceprints are encrypted and secured within the VBG database. Later, voiceprints are used to verify you.
A voiceprint is a mathematical representation of the unique physiological and behavioral features of a person's voice stored in electronic format. A voice biometric system creates a voiceprint from samples of a person’s voice. A voiceprint is not a recording or file that can be played back or otherwise listened to. Rather, a voiceprint is derived from audio analysis and a feature extraction process, then stored in a proprietary format for each voice biometric company. Voiceprints cannot be reverse-engineered back into original speech, so this gives voiceprints very high security with respect to information storage concerns.
VoiceXML is an Extensible Markup Language that was designed so developers could write interactive voice applications using the same type of programming model as web applications. It is an industry standard and popular within call centers and IVR systems. It does not require specific hardware to run, nor does it require proprietary extensions for any of the major telephone systems providers. Many client applications leverage the simplicity and power of VoiceXML within their IVR systems to gather speech samples from their users and send them to the VBG system.