Research on Asynchronous Multimodal Fusion and Few-shot Recognition in Cued Speech

Introduction/Objectives

To solve the problem of lip-reading confusion, Professor R. Orin Cornett of Gallaudet University invented a communication method using gestures to assist lip-reading in 1967, called Cued Speech (CS), which is translated into Chinese as 线索语. In this system, hand position is used to encode vowels, while hand shape is used to encode consonants (see Figure 1). Specifically, in English CS, four hand positions are used to encode monophthong, two hand movements encode diphthongs and eight hand shapes encode consonants. In view of the existing problems and challenges in the CS automatic recognition, this project will study CS asynchronous multimode fusion from the perspective of tensor decomposition, and use knowledge distillation in transfer learning to solve CS recognition problems based on few samples.

Impact/Challenge

Impact：

This project can promote the communication between deaf people and the hearing normal population, which has important social significance and application value. At the same time, the research of this project also plays an important role in promoting lip-reading, early education of hearing impairment, speech correction and treatment, robotics, audio-visual conversion and human-computer interaction. In addition, the asynchronous multi-mode fusion and dynamic decoding modeling based on the weak label data have also attracted much attention in other fields. For example, multimodal speech recognition in cocktail party problem and automatic recognition of human facial expressions. Therefore, the solutions proposed in this project will certainly promote the development of these two problems in these fields and have important significance.
Challenge：

1) For the CS asynchronous multimodal fusion, the previous method did not take into account the change of time delay, and ignored the non-linear asynchronous relationship between CS modes (lips, hand shape and hand position), and did not take into account the pairwise relationship between the modalities in the fusion. In addition, the explainable analysis of the influence of different modes on the recognition effect cannot be provided. 2) In the dynamic recognition task of CS, there is very little data, so the recognition effect of this task is low at present and needs to be further improved.

Solutions/Contributions

This project is the first to study the nonlinear fusion of asynchronous multimode in CS from tensor decomposition. (a) A new fusion model based on three tensors is established, so that it can not only consider the relationship between the three modalities, but also focus on the relationship between lips and hands, lips and hand shape, which is more suitable for the characteristics of the three modalities in CS; (b) Tensor T-SVD (Tensor-Singular Value Decomposition) and LTRD (Low Tubal Rank Decomposition) were used for the first time to reduce the number of parameters in the model. It is worth noting that the two models proposed in this project can also be applied to asynchronous multimodal fusion problems in other fields.

The innovation of the CS dynamic recognition model based on few-shot learning are as follows : (a) the teacher-student network model of CS automatic recognition is constructed by using speech signals for the first time; (b) Add KL divergence and cosine similarity between teacher features and student features to the existing loss function to construct a new loss function through linear combination; (c) Use LSTM to replace the previous HMM-GMM to carry out the end-to-end dynamic decoding of CS.

Key innovations

1) The characteristics and innovation of the problem

The topic of CS automatic recognition based on deep learning in this project belongs to a relatively new interdisciplinary research field, in which the scientific problems involve speech processing, image recognition, multi-modal fusion and other disciplines. Meanwhile, based on the original Chinese CS proposed by the applicant, this project will also be the first study on the automatic recognition of Chinese CS.

2) The characteristics and innovation of the method

A) Solve the problem of nonlinear fusion between asynchronous multimodes in CS from the new perspective of tensor decomposition. The solution of this problem will make the multi-modal fusion process of CS more consistent with its inherent complexity, reduce the interference caused by asynchronous problems, and make the fusion features more accurately reflect the information of CS, so as to improve the recognition accuracy of the system. B) In view of the current problem of poor CS automatic recognition, we aim to use speech signals for transfer learning, and establish an end-to-end LSTM dynamic decoding model based on teacher-student network. Among them, this project will construct a loss function, which takes into account both the semantic relevance of features and their similarity in distribution. The solution of this problem will enable CS automatic recognition to overcome the limitations of weak labels and small amount of data in previous studies, so that the parameters can be optimized globally, and further improve the efficiency and robustness of the model.

图1. 英语线索语表

Team

Liu Li.

Projects

Research on Asynchronous Multimodal Fusion and Few-shot Recognition in Cued Speech