International Journal of Image, Graphics and Signal Processing(IJIGSP)

ISSN: 2074-9074 (Print), ISSN: 2074-9082 (Online)

Published By: MECS Press

IJIGSP Vol.5, No.9, Jul. 2013

Improved Frame Level Features and SVM Supervectors Approach for The Recogniton of Emotional States from Speech: Application to Categorical and Dimensional States

Full Text (PDF, 332KB), PP.8-13

Views:93   Downloads:1


Imen Trabelsi, Dorra Ben Ayed, Noureddine Ellouze

Index Terms

Speech emotion recognition, valence, arousal, MFCC, GMM Supervector, SVM


The purpose of speech emotion recognition system is to classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral and happiness. 
Speech features that are commonly used in speech emotion recognition (SER) rely on global utterance level prosodic features. In our work, we evaluate the impact of frame-level feature extraction. The speech samples are from Berlin emotional database and the features extracted from these utterances are energy, different variant of mel frequency cepstrum coefficients (MFCC), velocity and acceleration features. The idea is to explore the successful approach in the literature of speaker recognition GMM-UBM to handle with emotion identification tasks. In addition, we propose a classification scheme for the labeling of emotions on a continuous dimensional-based approach.

Cite This Paper

Imen Trabelsi,Dorra Ben Ayed,Noureddine Ellouze,"Improved Frame Level Features and SVM Supervectors Approach for The Recogniton of Emotional States from Speech: Application to Categorical and Dimensional States", IJIGSP, vol.5, no.9, pp.8-13, 2013.DOI: 10.5815/ijigsp.2013.09.02


[1]M. Donn, W. Ruili and C. Liyanage, Ensemble methods for spoken emotion recognition in call-centres. Speech communication, vol. 49, 2007.

[2]L. Xiao, J. Yadegar and N. Kamat, A Robust Multi-Modal Emotion Recognition Framework for Intelligent Tutorig Systems. In IEEE International Conference on Advanced Learning Technologies (ICALT), 2011.

[3]K. Forbes and D. Litman, Using bigrams to identify relationships between student certainness states and tutor responses in a spoken dialogue corpus. In Proc. Of 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal, 2011.

[4]D. Ververidis and C. Kotropoulos, Emotional speech recognition: Resources, features, and methods. Speech Communication, vol. 48, no. 9, pp. 1162–1181, 2006.

[5]T.New, S. Foo and L. D. Silva, Speech emotion recognition using hidden Markov models. Speech Communication, vol. 41, pp. 603–623, 2003.

[6]D. Neiberg, K. Elenius, and K. Laskowski, Emotion recognition in spontaneous speech using GMMs, in Proc. INTERSPEECH, 2006.

[7]T.-L. Pao, Y.-T. Chen, and J.-H. Yeh, Mandarin emotional speech recognition based on SVM and NN. In Proc of the 18th International Conference on Pattern Recognition (ICPR'06), vol. 1, pp. 1096-1100, 2006.

[8]Y. Lin, and G. Wei, Speech emotion recognition based on HMM and SVM. In Proc. of 2005 International Conference on Machine Learning and Cybernetics, vol. 8, pp. 4898-4901.

[9]I. Trabelsi and D. BenAyed, Evaluation d'une approche hybride GMM-SVM pour l'identification de locuteurs. La revue e-STA, 8(1), 61-65, 2011.

[10]I. Trabelsi and D. Ben Ayed, On the use of different feature extraction methods for linear and non linear kernels, in Proc. Of Sciences of Electronics, Technologies of Information and Telecommunications (SETIT 2012), pp. 797 – 802, 2012.

[11]J. S. Park, J. H. Kim and Y. H. Oh, J. Feature Vector Classification based Speech Emotion Recognition for Service Robots, IEEE Trans. on Consumer Electronics. 55, 1590-1596, 2009.

[12]D. Bitouk, R. Verma, A. Nenkova, Class-level spectral features for emotion recognition,Speech Communication. 52, pp. 613-625, 2010.

[13]R. Fernandez and R.W. Picard, Classical and Novel Discriminant Features for Affect Recognition from Speech. In Proc. Of InterSpeech , pp. 1-4, Lisbon, Portugal, 2005.

[14]S.N. Wrigley, G. J. Brown, V. Wanand and S. Renals, Speech and crosstalk detection in multichannel audio. In IEEE Transactions on Speech and Audio Processing, 2005.

[15]J. Bouvrie, J. Ezzat, and T. Poggio, Localized Spectro-Temporal Cepstral Analysis of Speech. In Proc. ICASSP 2008, pp. 4733-4736, 2008.

[16]B. Vlasenko, B.Schuller, A. Wendemuth and G. Rigoll, Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing. In Proc. of Affective Computing and Intelligent Interaction, pages 139–147, Lisbon, Portugal, 2007.

[17]A. P. Dempster, N. M. Laid, and D. Durbin, Maximum Likelihood from incomplete data via the EM algorithm. J. Royal Statistical Soc, vol. 39, pp. 1-38, 1977.

[18]D. Reynolds, T. Quatieri, and R. Dunn, Speaker verification using adapted gaussian mixture models. DSP, Vol. 10, No. 3, pp. 19–41, 2000.

[19]V. Vapnik, The nature of statistical learning theory. Spring-verlag, New York, 2005.

[20]C. C. Chang and C. J . Lin (2001). LIBSVM : a library for support vector machines. Available:

[21]F. Schwenker, S. Scherer, Y.M. Magdi, G. Palm, The GMM-SVM supervector approach for the recognition of the emotional status from speech. In: Alippi, C.,Polycarpou, M., Panayiotou, C., Ellinas, G. (eds.) ICANN 2009, Part I. LNCS, vol. 5768, pp. 894–903. Springer, Heidelberg, 2009.