IJISA Vol. 10, No. 3, 8 Mar. 2018
Cover page and Table of Contents: PDF (size: 818KB)
Full Text (PDF, 818KB), PP.22-32
Views: 0 Downloads: 0
Language Models (LMs), Acoustic Models (AMs), Kaldi, Automatic Speech Recognition (ASR), Word Error Rates (WERs)
In this work, the Language Models (LMs) and Acoustic Models (AMs) are developed using the speech recognition toolkit Kaldi for noisy and enhanced speech data to build an Automatic Speech Recognition (ASR) system for Kannada language. The speech data used for the development of ASR models is collected under uncontrolled environment from the farmers of different dialect regions of Karnataka state. The collected speech data is preprocessed by proposing a method for noise elimination in the degraded speech data. The proposed method is a combination of Spectral Subtraction with Voice Activity Detection (SS-VAD) and Minimum Mean Square Error-Spectrum Power Estimator (MMSE-SPZC) based on Zero Crossing. The word level transcription and validation of speech data is done by Indic language transliteration tool (IT3 to UTF-8). The Indian Language Speech Label (ILSL12) set is used for the development of Kannada phoneme set and lexicon. The 75% and 25% of transcribed and validated speech data is used for system training and testing respectively. The LMs are generated by using the Kannada language resources and AMs are developed by using Gaussian Mixture Models (GMM) and Subspace Gaussian Mixture Models (SGMM). The proposed method is studied determinedly and used for enhancing the degraded speech data. The Word Error Rates (WERs) of ASR models for noisy and enhanced speech data are highlighted and discussed in this work. The developed ASR models can be used in spoken query system to access the real time agricultural commodity price and weather information in Kannada language.
Thimmaraja Yadava G, H S Jayanna, "Creation and Comparison of Language and Acoustic Models Using Kaldi for Noisy and Enhanced Speech Data", International Journal of Intelligent Systems and Applications(IJISA), Vol.10, No.3, pp.22-32, 2018. DOI:10.5815/ijisa.2018.03.03
[1]Rabiner L, and B.H. Juang. “Fundamentals of speech recognition”, Upper Saddle River, NJ, USA: Prentice- Hall, Inc, 1993.
[2]P. Loizou “Speech Enhancement: Theory and Practice”, 1st ed. Boca Raton, FL: CRC Taylor & Francis, 2007.
[3]J. Ramirez, J. M. Gorriz and J. C. Segura, “Voice Activity Detection. Fundamentals and Speech Recognition System Robustness”, 2003.
[4]S. Boll “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE Trans.Acoust., Speech, Signal Process, vol. 2, ASSP-27, pp.113-120, April. 1979.
[5]S. Kamath and P. Loizou, “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise”, in Proc. IEEE Int. Conf. Acoust., Speech, Signal process., Orlando, USA, May. 2002.
[6]Jounghoon Beh, Hanseok Ko “A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech”, IEEE Int. Conf. on Multimedia and Expo, vol.3, pp.I-648,I-651, April 2003.
[7]Cole C, Karam M, Aglan H “Spectral subtraction of Noise in Speech Processing Applications”, 40th Southeastern Symposium System Theory, SSST-2008, pp.50-53, 16-18 March 2008.
[8]Goodarzi H.M, Seyedtabaii S “Speech Enhancement using Spectral Subtraction based on a Modified Noise Minimum Statistics Estimation”, Fifth Joint Int. Conf, pp.1339,1343, 25-27 Aug. 2009.
[9]Y. Ephraim and D. Malah “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator”, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109-1121, Dec. 1984.
[10]Y. Ephraim and D. Malah “Speech enhancement using a minimum mean square error log-spectral amplitude estimator”, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-33, no. 2, pp. 443-445, Apr. 1985.
[11]Yang Lu and Philipos C. Loizou “Estimators of the Magnitude-Squared Spectrum and Methods for Incorporating SNR Uncertainty”, IEEE Trans. on Audio, Speech, and Language processing, vol. 19, no. 5, July. 2011.
[12]P. J. Wolfe and S. J. Godsill, “Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement”, in Proc. 11th IEEE Signal Process. Workshop Statist. Signal Process., pp. 496-499, Aug. 2001.
[13]Rainer Martin, “Speech Enhancement Based on Minimum Mean-SquareError Estimation and Supergaussian Priors”, IEEE Trans. on Speech and Audio Processing, vol. 13, no. 5, September. 2005.
[14]Philipos C. Loizou, “Speech Enhancement Based on Perceptually Motivated Bayesian Estimators of the Magnitude Spectrum”, IEEE Trans. on Speech and Audio Processing , vol. 13, no. 5, Sep. 2005.
[15]Rabiner L. R, “Applications of Speech Recognition in the Area of Telecommunications”, IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 501-510, 1997.
[16]Daniel Povey, et al., “The Kaldi Speech Recognition Toolkit, Workshop on Automatic Speech Recognition and Understanding”, , US IEEE Signal Processing Society, Hilton Waikoloa Village, Big Island, Hawaii 2011.
[17]Ahmed Ali, Yifan Zhang, Cardinal P, Dahak N, Vogel S, Glass J, “A complete KALDI recipe for building Arabic speech recognition systems”, Spoken Language Technology Workshop (SLT), IEEE, South Lake Tahoe, NV, pp. 525-529, Dec. 2014.
[18]Patrick Cardinal and Ahmed Ali, et al., “Recent advances in ASR Applied to an Arabic transcription system for Al-Jazeera”, Proceedings of the Annual Int. Conf. on Speech Communication, Singapore, pp. 2088-2092, 2014.
[19]Alexey Karpov a, Konstantin Markov, et al., “Large vocabulary Russian speech recognition using syntactico-statistical language modeling”, Speech Communication, vol. 56, pp. 213-228, Jan. 2014.
[20]Shahnawazuddin S, Deepak Thotappa, B D Sarma, et al., “Assamese Spoken Query System to Access the Price of Agricultural Commodities”, National Conference on Communications (NCC), New Delhi, India, pp. 1-5, 2013.
[21]Agricultural Marketing Information Network - AGMARKNET, http://agmarknet.nic.in.
[22]T. Lotter and P. Vary, “Speech enhancement by map spectral amplitude estimation using a super-Gaussian speech model,” EURASIP J. Appl. Signal Process., vol. 2005, no. 1, pp. 1110–1126, 2005.
[23]A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs”, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 2, pp. 749-752, 2001.
[24]“Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs”, ITU, ITU-T Rec. P. 862, 2000.
[25]Yi Hu and Philipos C. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement”, IEEE transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, Jan. 2008.
[26]Y. Hu and P. Loizou, ”Subjective comparison of speech enhancement algorithms”, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 1, pp. 153-156, 2006.
[27]Yi Hu and Philipos C. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement”, IEEE transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, Jan. 2008.
[28]Sunnydayal. V, N. Sivaprasad, T. Kishore Kumar,"A Survey on Statistical Based Single Channel Speech Enhancement Techniques", International Journal of Intelligent Systems and Applications(IJISA), vol.6, no.12, pp.69-85, 2014. DOI: 10.5815/ijisa.2014.12.10.
[29]Ravi Kumar. K, P.V. Subbaiah, "A Survey on Speech Enhancement Methodologies", International Journal of Intelligent Systems and Applications (IJISA), Vol.8, No.12, pp.37-45, 2016. DOI: 10.5815/ijisa.2016.12.05.