Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study

doi:10.4316/AECE.2023.03001

3/2023 - 1

View TOC | « Previous Article | Next Article »

Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study

GALIC, J. , GROZDIC, D.

Extra paper information in

Click to see author's profile in

SCOPUS,

IEEE Xplore,

Web of Science

Download PDF (1,307 KB) | Citation | Downloads: 1,441 | Views: 2,368

Author keywords
artificial neural networks, audio databases, automatic speech recognition, hidden markov models, support vector machines

References keywords
speech(22), recognition(15), data(13), augmentation(12), processing(7), audio(7), interspeech(6), signal(5), science(5), whispered(4)
Blue keywords are present in both the references section and the paper title.

About this article
Date of Publication: 2023-08-31
Volume 23, Issue 3, Year 2023, On page(s): 3 - 12
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2023.03001
Web of Science Accession Number: 001062641900001
SCOPUS ID: 85172345871

Abstract

Full text preview

Automatic Speech Recognition (ASR) systems are notorious for their poor performance in adverse conditions, leading to high sensitivity and low robustness. Due to the costly and time-consuming nature of creating extensive speech databases, addressing the issue of low robustness has become a prominent area of research, focusing on the synthetic generation of speech data using pre-existing natural speech. This paper examines the impact of standard data augmentation techniques, including pitch shift, time stretch, volume control, and their combination, on the accuracy of isolated-word ASR systems. The performance of three machine learning models, namely Hidden Markov Models (HMM), Support Vector Machines (SVM), and Convolutional Neural Networks (CNN), is analyzed on two Serbian corpora of isolated words. The Whi-Spe speech database in neutral phonation is utilized for augmentation and training, and a specifically developed Python-based software tool is employed for the augmentation process in this research study. The conducted experiments demonstrate a statistically significant reduction in the Word Error Rate (WER) for the CNN-based recognizer on both testing datasets, achieved through a single augmentation technique based on pitch-shifting.

References

Cited By «-- Click to see who has cited this paper

[1] D. R. Hill, "Man-machine interaction using speech," Advances in Computers, vol. 11, pp. 165-230, 1971.
[CrossRef] [SCOPUS Times Cited 20]

[2] J.-U. Bang, M.-Y. Choi, S.-H. Kim, and O. W. Kwon, "Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps," IEICE Trans. Inf. Syst., vol. 103-D, pp. 406-415, 2020.
[CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 14]

[3] D. K. Singh, P. P. Amin, H. B. Sailor, and H. A. Patil, "Data augmentation using CycleGAN for end-to-end children ASR," in 2021 29th European Signal Processing Conference (EUSIPCO), 2021, pp. 511-515.
[CrossRef] [Web of Science Times Cited 11] [SCOPUS Times Cited 17]

[4] A. Chatziagapi et al., "Data Augmentation Using GANs for speech emotion recognition," in Proc. Interspeech 2019, 2019, pp. 171-175.
[CrossRef] [Web of Science Times Cited 89] [SCOPUS Times Cited 124]

[5] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech recognition," in Proc. Interspeech 2015, 2015, pp. 3586-3589.
[CrossRef]

[6] M. P. Fernandez-Gallego and D. T. Toledano, "A study of data augmentation for ASR robustness in low bit rate contact center recordings including packet losses," Applied Sciences, vol. 12, no. 3, 2022.
[CrossRef] [Web of Science Times Cited 2] [SCOPUS Times Cited 3]

[7] P. R. R. Gudepu, et al., "Whisper augmented end-to-end/hybrid speech recognition system - CycleGAN approach," in Proc. of Interspeech, Shanghai International Convention Center (virtual), Shanghai, China, 2020, pp. 2302-2306.
[CrossRef] [Web of Science Times Cited 5] [SCOPUS Times Cited 15]

[8] B. T. Atmaja and A. Sasou, "Effects of data augmentations on speech emotion recognition," Sensors, vol. 22, no. 16, 2022.
[CrossRef] [Web of Science Times Cited 15] [SCOPUS Times Cited 25]

[9] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220-5224, 2017.
[CrossRef] [SCOPUS Times Cited 811]

[10] J. M. Ramirez, A. Montalvo, and J. R. Calvo, "A survey of the effects of data augmentation for automatic speech recognition systems," in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, I. Nystrom, Y. Hernandez Heredia, and V. Milian Nunez, Eds., Cham: Springer International Publishing, 2019, pp. 669-678.
[CrossRef] [Web of Science Times Cited 10] [SCOPUS Times Cited 12]

[11] R. Damania, "Data augmentation for automatic speech recognition for low resource languages," Rochester Institute of Technology, NY, United States of America, 2021

[12] O. O. Abayomi-Alli, R. Damasevicius, A. Qazi, M. Adedoyin-Olowe, and S. Misra, "Data augmentation and deep learning methods in sound classification: A systematic review," Electronics, vol. 11, no. 22, 2022.
[CrossRef] [Web of Science Times Cited 46] [SCOPUS Times Cited 67]

[13] T. Sugiura, A. Kobayashi, T. Utsuro, and H. Nishizaki, "Audio synthesis-based data augmentation considering audio event class," in 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), 2021, pp. 60-64.
[CrossRef] [SCOPUS Times Cited 8]

[14] M. Muthumari, C. A. Bhuvaneswari, J. E. N. S. Kumar Babu, and S. P. Raju, "Data augmentation model for audio signal extraction," in 2022 3rd International Conference on Electronics and Sustainable Communication Systems (ICESC), 2022, pp. 334-340.
[CrossRef] [SCOPUS Times Cited 7]

[15] B. Markovic, S. T. Jovicic, J. Galic, and D. Grozdic, "Whispered speech database: Design, processing and application," in Text, Speech, and Dialogue, I. Habernal and V. Matousek, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 591-598.
[CrossRef] [SCOPUS Times Cited 19]

[16] S. T. Jovicic, "Serbian emotional speech database: Design, processing and evaluation," in Proc. 9th Conference on Speech and Computer (SPECOM), 2004, pp. 77-81

[17] [Online] Available: Temporary on-line reference link removed - see the PDF document

[18] I.-D. Borlea, R.-E. Precup, and A. Borlea, "Improvement of K-means cluster quality by post processing resulted clusters," Procedia Computer Science, vol. 199, pp. 63-70, Feb. 2022.
[CrossRef] [Web of Science Times Cited 91] [SCOPUS Times Cited 110]

[19] C. Pozna and R.-E. Precup, "Aspects concerning the observation process modelling in the framework of cognition processes," Acta Politechnica Hungarica, vol. 9, no. 1, pp. 203-223, 2012

[20] S. Ogutcu, et al. "Early detection of mortality in COVID-19 patients through laboratory findings with factor analysis and artificial neural networks," Romanian Journal of Information Science and Technology, vol. 25, no. 4, pp. 290-302, 2022

[21] E. Arican and T. Aydin, "An RGB-D descriptor for object classification," Romanian Journal of Information Science and Technology, vol. 25, no. 3-4, pp. 338-349, 2022

[22] L. Ferreira-Paiva, E. Alfaro-Espinoza, V. M. Almeida, L. B. Felix, R. V. A. Neves, "A survey of data augmentation for audio classification," XXIV Brazilian Congress of Automatics (CBA), 2022

[23] G. Maguolo, M. Paci, L. Nanni, and L. Bonan, "Audiogmenter: a MATLAB toolbox for audio data augmentation," Applied Computing and Informatics, Jan. 2021.
[CrossRef] [Web of Science Times Cited 2] [SCOPUS Times Cited 7]

[24] D. T. Grozdic, S. T. Jovicic, and M. Subotic, "Whispered speech recognition using deep denoising autoencoder," Engineering Applications of Artificial Intelligence, vol. 59, pp. 15-22, 2017.
[CrossRef] [Web of Science Times Cited 54] [SCOPUS Times Cited 68]

[25] The MathWorks, Inc. (2021). MATLAB version: R2021b. Accessed: June 01, 2022. Available: https://www.mathworks.com

[26] J. Galic, S. T. Jovicic, D. Grozdic, and B. Markovic, "HTK-based recognition of whispered speech," in Speech and Computer, A. Ronzhin, R. Potapova, and V. Delic, Eds., Cham: Springer International Publishing, 2014, pp. 251-258.
[CrossRef] [SCOPUS Times Cited 11]

[27] S. Young, et al., "The HTK Book (for HTK Version 3.4)," Cambridge University Engineering Department, 2006

[28] J. Bernal-Chaves, C. Pelaez-Moreno, A. Gallardo-Antolin, and F. Diaz-de-Maria, "Multiclass SVM-based isolated-digit recognition using a HMM-guided segmentation," in Proc. ITRW on Nonlinear Speech Processing (NOLISP 2005), 2005, pp. 137-144

[29] Z. Qu, L. Yu, L. Zhang, and M. Shao, "A speech recognition system based on a hybrid HMM/SVM architecture," in First International Conference on Innovative Computing, Information and Control - Volume I (ICICIC'06), 2006, pp. 100-104.
[CrossRef]

[30] J. M. Garcia-Cabellos, C. Pelaez-Moreno, A. Gallardo-Antolin, F. Perez-Cruz, and F. Diaz-de-Maria, "SVM classifiers for ASR: A discussion about parameterization," in 12th European Signal Processing Conference, 2004, pp. 2067-2070.
[CrossRef]

[31] J. Galic, B. Popovic, and D. Sumarac Pavlovic, "Whispered speech recognition using hidden markov models and support vector machines," Acta Politechnica Hungarica, vol. 15, no. 5, pp. 11-29, 2018.
[CrossRef] [SCOPUS Times Cited 8]

[32] A. Alsobhani, H. M. A. ALabboodi, and H. Mahdi, "Speech recognition using convolution deep neural networks," Journal of Physics: Conference Series, vol. 1973, no. 1, p. 012166, Aug. 2021.
[CrossRef] [SCOPUS Times Cited 31]

[33] G. Habib and S. Qureshi, "Optimization and acceleration of convolutional neural networks: A survey," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 7, pp. 4244-4268, 2022.
[CrossRef] [Web of Science Times Cited 74] [SCOPUS Times Cited 113]

[34] [Online] Available: Temporary on-line reference link removed - see the PDF document

[35] W. C. Sabine, Collected papers on acoustics, Harvard University Press; Reprint edition, pp. 3-69, 1922

[36] B. McFee, et al., "Librosa: Audio and music signal analysis in python," in Proceedings of the 14th Python in Science Conference, 2015, pp. 18-24.
[CrossRef]

[37] P. Virtanen et al., "SciPy 1.0: fundamental algorithms for scientific computing in Python," Nat Methods, vol. 17, no. 3, pp. 261-272, Mar. 2020.
[CrossRef] [Web of Science Times Cited 22597] [SCOPUS Times Cited 23900]

[38] J. D. Gibbons and S. Chakraborti, "Nonparametric statistical inference," in International Encyclopedia of Statistical Science, M. Lovric, Ed., Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 977-979.
[CrossRef]

References Weight

Web of Science® Citations for all references: 23,005 TCR
SCOPUS® Citations for all references: 25,390 TCR

Web of Science® Average Citations per reference: 590 ACR
SCOPUS® Average Citations per reference: 651 ACR

TCR = Total Citations for References / ACR = Average Citations per Reference

We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more

Citations for references updated on 2025-05-29 22:41 in 181 seconds.

Note¹: Web of Science® is a registered trademark of Clarivate Analytics.
Note²: SCOPUS® is a registered trademark of Elsevier B.V.
Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site.

Copyright ©2001-2025
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania

All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.

Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.

Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.

Menu:

Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study