Basic Voice Biometric Concepts
What is "Voice Biometrics"?
As an area of study, voice biometrics is rooted in the disciplines of mathematics and digital signal processing. Literally, a "biometric" refers to a measurement of some aspect of your body. Voice biometric software measures and models both physiological aspects of your vocal tract (lips, glottis, etc.), as well as behavioral aspects of your speech (pitch, tone, cadence, etc). The ability to uniquely model an individual's voice makes voice biometric software attractive for many authentication scenarios.
What is a Voice Biometric Decision Engine?
A voice biometric decision “engine” is a collection of software functions that processes audio signals, extracts relevant vocal information (or features), and creates a statistically representative model of the original speech. Later, speech samples can be submitted to the voice engine, compared to stored speech samples, and then scores or match probabilities are generated. To improve the accuracy of the decision engine, it is not uncommon for multiple mathematical algorithms to be used as inputs.
Typical High-Level Engine Functions
Enrollment is the process where speech samples are submitted to the engine, the engine extracts unique vocal information from the samples, and then the engine constructs a unique voiceprint. VBG's technology allows enrollment to be very flexible. Users can be prompted with simple passphrases, numbers, or natural speech -- almost anything.
Verification is the process where a user makes a claim of identiy (supplies a user id, account number, or other identifier), then the user submits a speech sample to the engine, then the engine creates a temporary voiceprint for the user and compares it to his/her stored voiceprint, and finally the engine determines if there is a match or not. This one-to-one matching process is fast, accurate and efficient, making voice verification systems appropriate for high-volume commercial use.
Identification is the process where a user submits speech samples to the engine without first making any claim of identity, then the engine creates a temporary voiceprint for the user and compares it to all stored voiceprints in the database, and then the engine determines the best possible match. Because databases can contain literally millions of voice prints, identification systems are not yet practical for real-time commercial use. However, identification systems are quite useful for offline forensic analysis.
Classification means that a voice sample is going to be categorized or classified according to some previously defined rules -- for instance, male or female, young or old, or perhaps according to language or dialect. Classification systems can be used within verification and identification systems to help pair-down comparative samples and provide further confidence in scoring.
Segmentation is a capability where the voice biometric engine can evaluate and separate multiple speakers from a single audio sample. A variety of audio processing techniques are applied to the sample in an attempt to identify the individual speakers within the recording and segment (separate) their speech for further processing. This is a highly advanced capability that is typically used in custom applications for law enforcement officials and military investigators.
Back to top
Speech Prompting Techniques
A number of different prompting (or challenge) techniques can be used to obtain speech samples within voice biometric systems. The technique used depends on a number of factors such as: level of security desired, type of people using the system, amount of time available, design preferences, etc. The two main classifications of prompting techniques are:
Text-Independent means that the voice print being compared is not necessarily based on the same words or phrases as were used to create the original voiceprint. During enrollment speech is sampled to gather as much phonetic information as possible. During verification, almost any reasonable words, phrases, or numbers can be prompted for -- with responses ideally having similar phonomes. Text-independent challenge techniques typically require a longer enrollment process in order to sample a diverse number of phonomes. However, text-independent systems are generally viewed as being more flexible and user-friendly, and some believe they offer better security characteristics.
Text-Dependent means that the voice print being compared is based on the same words or phrases (or a subset) as were previously captured. Because only a limited set of words, phrases, or numbers is used, text-dependent systems are typically considered to be easier for users and simpler to implement. And because there are limits on the phrases being captured, enrollment is typically shorter. Because of this however, text-dependent systems are sometimes viewed as being less secure. Hackers and fraudsters presumably have fewer combinations to try or can possibly record static phrases and play them back to pass authentication.
In practice, there are very good and appropriate reasons to use both types of challenges. There are too many variables in each and every application scenario to automatically favor one technique over another. And lately many companies are beginning to experiment with mixed use cases where both text-independent and text-dependent challenges are used in multiple prompting loops -- all in a effort to create even higher security.
