Lip reading by alternating between spatiotemporal and spatial convolutions

Τσουρούνης, Δημήτριος; Καστανιώτης, Δημήτριος; Φωτοπουλος, Σπύρος

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: https://hdl.handle.net/123456789/1514

Τύπος:	Άρθρο σε επιστημονικό περιοδικό
Τίτλος:	Lip reading by alternating between spatiotemporal and spatial convolutions
Συγγραφέας:	[EL] Τσουρούνης, Δημήτριος[EN] Tsourounis, Dimitrios [EL] Καστανιώτης, Δημήτριος[EN] Kastaniotis, Dimitris [EL] Φωτοπουλος, Σπύρος[EN] Fotopoulos, Spiros
Ημερομηνία:	20/05/2021
Περίληψη:	Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework.
Γλώσσα:	Αγγλικά
Σελίδες:	17
DOI:	10.3390/jimaging7050091
EISSN:	2313-433X
Θεματική κατηγορία:	[EL] Μηχανική και Τεχνολογίες, άλλοι τομείς[EN] Engineering and Technologies, miscellaneous
Λέξεις-κλειδιά:	lip reading; temporal convolutional networks; spatiotemporal processing
Κάτοχος πνευματικών δικαιωμάτων:	Copyright 2021 by the authors. Licensee MDPI, Basel, Switzerland.
Όροι και προϋποθέσεις δικαιωμάτων:	This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).
Ηλεκτρονική διεύθυνση του τεκμηρίου στον εκδότη:	https://www.mdpi.com/2313-433X/7/5/91
Ηλεκτρονική διεύθυνση περιοδικού:	https://www.mdpi.com/journal/jimaging
Τίτλος πηγής δημοσίευσης:	Journal of Imaging
Τεύχος:	5
Τόμος:	7
Σελίδες τεκμηρίου (στην πηγή):	Article no 91
Σημειώσεις:	This research is co-financed by Greece and the European Union (European Social FundESF) through the Operational Programme «Human Resources Development, Education and Lifelong Learning 2014–2020» in the context of the project “Lip Reading Greek words with Deep Learning” (MIS 5047182)”.
Εμφανίζεται στις συλλογές:	Ερευνητικές ομάδες

Αρχεία σε αυτό το τεκμήριο:

Το πλήρες κείμενο αυτού του τεκμηρίου δεν διατίθεται προς το παρόν από το αποθετήριο