Recent conference papers and publications

VBG PhD Research Interests

Research

Highlights from Our Research Team

For a number of years, VBG has maintained an active relationship with a Tier 1 Research University. Through VBG's sponsorship of PhD students, we are able to work together to explore topics of interest in artificial intelligence and deep learning, machine learning, natural language processing, and digital signal processing among others. Not all research is commercially focused, but many projects do have commercial implications. Here we highlight some of the research team's recent work that VBG is adapting for its products.

Abstract

The performance of automatic speaker verification (ASV) systems could be degraded by voice spoofing attacks. Most existing works aimed to develop standalone spoofing countermeasure (CM) systems. Relatively little work targeted at developing an integrated spoofing aware speaker verification (SASV) system. In the recent SASV challenge, the organizers encourage the development of such integration by releasing official protocols and baselines. In this paper, we build a probabilistic framework for fusing the ASV and CM subsystem scores. We further propose fusion strategies for direct inference and fine-tuning to predict the SASV score based on the framework. Surprisingly, these strategies significantly improve the SASV equal error rate (EER) from 19.31% of the baseline to 1.53% on the official evaluation trials of the SASV challenge. We verify the effectiveness of our proposed components through ablation studies and provide insights with score distribution analysis.

Link:

https://arxiv.org/abs/2202.05253

Team Researchers:
Neil Zhang Ge Zhu Zhiyao Duan

Abstract

In this paper, we present UR-AIR system submission to the logical access (LA) and the speech deepfake (DF) tracks of the ASVspoof 2021 Challenge. The LA and DF tasks focus on synthetic speech detection (SSD), i.e. detecting text-to-speech and voice conversion as spoofing attacks. Different from previous ASVspoof challenges, the LA task this year presents codec and transmission channel variability, while the new task DF presents general audio compression. Built upon our previous research work on improving the robustness of the SSD systems to channel effects, we propose a channel-robust synthetic speech detection system for the challenge. To mitigate the channel variability issue, we use an acoustic simulator to apply transmission codec, compression codec, and convolutional impulse responses to augmenting the original datasets. For the neural network backbone, we propose to use Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Networks (ECAPA-TDNN) as our primary model. We also incorporate one-class learning with channel-robust training strategies to further learn a channel-invariant speech representation. Our submission achieved EER 20.33% in the DF task; EER 5.46% and min-tDCF 0.3094 in the LA task.

Link:

https://arxiv.org/abs/2107.12018

Team Researchers:
Neil Zhang Ge Zhu Zhiyao Duan

Abstract

Spoofing countermeasure (CM) systems are critical in speaker verification; they aim to discern spoofing attacks from bona fide speech trials. In practice, however, acoustic condition variability in speech utterances may significantly degrade the performance of CM systems. In this paper, we conduct a cross-dataset study on several state-of-the-art CM systems and observe significant performance degradation compared with their single-dataset performance. Observing differences of average magnitude spectra of bona fide utterances across the datasets, we hypothesize that channel mismatch among these datasets is one important reason. We then verify it by demonstrating a similar degradation of CM systems trained on original but evaluated on channel-shifted data. Finally, we propose several channel robust strategies (data augmentation, multi-task learning, adversarial learning) for CM systems, and observe a significant performance improvement on cross-dataset experiments.

Link:

https://arxiv.org/abs/2104.01320

Team Researchers:
Neil Zhang Ge Zhu Zhiyao Duan

Abstract

Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion. Recently, researchers developed anti-spoofing techniques to improve the reliability of ASV systems against spoofing attacks. However, most methods encounter difficulties in detecting unknown attacks in practical use, which often have different statistical distributions from known attacks. Especially, the fast development of synthetic voice spoofing algorithms is generating increasingly powerful attacks, putting the ASV systems at risk of unseen attacks. In this work, we propose an anti-spoofing system to detect unknown synthetic voice spoofing attacks (i.e., text-to-speech or voice conversion) using one-class learning. The key idea is to compact the bona fide speech representation and inject an angular margin to separate the spoofing attacks in the embedding space. Without resorting to any data augmentation methods, our proposed system achieves an equal error rate (EER) of 2.19% on the evaluation set of ASVspoof 2019 Challenge logical access scenario, outperforming all existing single systems (i.e., those without model ensemble).

Link:

https://arxiv.org/abs/2010.13995

Team Researchers:
Neil Zhang Zhiyao Duan

Abstract

State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies of speech utterances as input features. With the ability of deep neural networks to learn representations from raw data, recent studies attempted to extract speaker embeddings directly from raw waveforms and showed competitive results. In this paper, we propose a new speaker embedding called raw-x-vector for speaker verification in the time domain, combining a multi-scale waveform encoder and an x-vector network architecture. We show that the proposed approach outperforms existing raw-waveform-based speaker verification systems by a large margin. We also show that the proposed multi-scale encoder improves over single-scale encoders for both the proposed system and another state-of-the-art raw-waveform-based speaker verification systems. A further analysis of the learned filters shows that the multi-scale encoder focuses on different frequency bands at its different scales while resulting in a more flat overall frequency response than any of the single-scale counterparts.

Link:

https://arxiv.org/abs/2010.12951

Team Researchers:
Ge Zhu Zhiyao Duan

Abstract

Commercial speaker verification systems are an important component in security services for various domains, such as law enforcement, government, and finance. These systems are sensitive to noise present in the input signal, which leads to inaccurate verification results and hence security breaches. Traditional speech enhancement (SE) methods have been employed to improve the performance of speaker verification systems. However, to the best of our knowledge, the impact of state-of-the-art speech enhancement techniques has not been analyzed for text-independent automatic speaker verification (ASV) systems using real-world utterances. In this work, our contribution is twofold. First, we propose two deep neural network (DNN) architectures for SE, and we compare the performance of the proposed networks with the existing work. We evaluate the resulting SE networks using the objective measures of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). Second, we analyze the performance of ASV systems when SE methods are used as front-end processing to remove the non-stationary background noise. We compare the resulting equal error rate (EER) using our DNN based SE approaches, as well as existing SE approaches, with real customer data and the freely available RedDots dataset. Our results show that our DNN based SE approaches provide benefits for speaker verification performance.


Contact Us

Do You Have Any Questions?

Please let us know how we can help you and we'll respond promptly!

VBG wants to hear from you