Sharif Digital Repository / Sharif University of Technology / Search result

Robust phoneme recognition using MLP neural networks in various domains of MFCC features

, Article 2010 5th International Symposium on Telecommunications, IST 2010, 4 December 2010 through 6 December 2010, Tehran ; 2010 , Pages 755-759 ; 9781424481835 (ISBN) Dabbaghchian, S ; Sameti, H ; Ghaemmaghami, M. P ; BabaAli, B ; Sharif University of Technology

2010

Abstract

This paper focuses on enhancing MFCC features using a set of MLP NN in order to improve phoneme recognition accuracy under different noise types and SNRs. A NN can be used in different domains (between any pair of MFCC feature extraction blocks). It includes FFT, MEL, LOG, DCT and DELTA domains. Various domains have different complexities and achieve different degrees. A comparative study is done in this paper in order to find the best domain. Furthermore, a set of MLP NNs, instead of one NN, is used to enhance various noise types with different levels of SNRs. In this case, each NN is trained with a special noise type and SNR. The database used in the simulations is created by artificially...

The combination of CMS with PMC for improving robustness of speech recognition systems

, Article 13th International Computer Society of Iran Computer Conference on Advances in Computer Science and Engineering, CSICC 2008, Kish Island, 9 March 2008 through 11 March 2008 ; Volume 6 CCIS , 2008 , Pages 825-829 ; 18650929 (ISSN); 3540899847 (ISBN); 9783540899846 (ISBN) Veisi, H ; Sameti, H ; Sharif University of Technology

2008

Abstract

This paper addresses the robustness problem of automatic speech recognition systems for real applications in presence of noise. PMCC algorithm is proposed for combining PMC technique with CMS method. The proposed algorithm utilizes the CMS normalization ability in PMC method to takes the advantages of these methods to compensate the effect of both additive and convolutional noises. Also, we have investigated VTLN for speaker normalization and MLLR and MAP for speaker and acoustic adaptation. Different combinations of these methods are used to achieve robustness and making the system usable in real applications. Our evaluations are done on 4 different real noisy tasks on Nevisa recognition...

Using augmented reality and automatic speech recognition techniques to help deaf and hard of hearing people

, Article ACM International Conference Proceeding Series ; 2012 ; 9781450312431 (ISBN) Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology

2012

Abstract

Recently, many researches show Augmented Reality (AR) and Automatic Speech Recognition (ASR) can help people with disabilities. In this paper we implement an innovative system for helping deaf people by combining AR, ASR, and AVSR technologies. This system can instantly take narrator's speech and converts it into readable text and shows it directly on AR display. We show that our system's accuracy becomes over 85 percent on average, by using different ASR engines near using an AVSR engine in different noisy environments. We also show in a survey that more than 90 percent of deaf people on average need such system as assistant in portable devices, near using only text or only sign-language...

Spectral subtraction in model distance maximizing framework for robust speech recognition

, Article 2008 9th International Conference on Signal Processing, ICSP 2008, Beijing, 26 October 2008 through 29 October 2008 ; 2008 , Pages 627-630 ; 9781424421794 (ISBN) BabaAli, B ; Sameti, H ; Safayani, M ; Sharif University of Technology

2008

Abstract

This paper has presented a novel discriminative parameters calibration approach based on the Model Distance Maximizing (MDM) to improve the performance of our previous proposed robustness method named spectral subtraction (SS) in likelihoodmaximizing framework. In the previous work, for adjusting the spectral over-subtraction factor of SS, conventional ML approach is used that only utilizes the true model without considering other confused models. This makes it very probably to reach a suboptimal solution. While in MDM, by maximizing the dissimilarities among models, the performance of our speech recognizer-based spectral subtraction method could be further improved. Experimental results...

Noise and speaker robustness in a persian continuous speech recognition system

, Article 2007 9th International Symposium on Signal Processing and its Applications, ISSPA 2007, Sharjah, 12 February 2007 through 15 February 2007 ; 2007 ; 1424407796 (ISBN); 9781424407798 (ISBN) Veisi, H ; Sameti, H ; Sharif University of Technology

2007

Abstract

In this paper VTLN speaker normalization, MLLR and MAP adaptation methods are investigated in a Persian HMM-based speaker independent large vocabulary continuous speech recognition system. Speaker and environmental noise robustness are achieved in real world applications for this system. A search-based method is used in VTLN to find speaker relative warping factors. The warping factors are applied to signal's spectrum to normalize the variation effect of VTL between speakers. In the MLLR framework, Gaussian mean and covariance transformations in global and full adaptation are experienced. In this method, regression tree based adaptation in batch-supervised fashion is used. Also the standard...

Audio-visual speech recognition techniques in augmented reality environments

, Article Visual Computer ; Vol. 30, issue. 3 , March , 2014 , pp. 245-257 ; ISSN: 01782789 Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology

Abstract

Many recent studies show that Augmented Reality (AR) and Automatic Speech Recognition (ASR) technologies can be used to help people with disabilities. Many of these studies have been performed only in their specialized field. Audio-Visual Speech Recognition (AVSR) is one of the advances in ASR technology that combines audio, video, and facial expressions to capture a narrator's voice. In this paper, we combine AR and AVSR technologies to make a new system to help deaf and hard-of-hearing people. Our proposed system can take a narrator's speech instantly and convert it into a readable text and show the text directly on an AR display. Therefore, in this system, deaf people can read the...

Combining augmented reality and speech technologies to help deaf and hard of hearing people

, Article Proceedings - 2012 14th Symposium on Virtual and Augmented Reality, SVR 2012 ; 2012 , Pages 174-181 ; 9780769547251 (ISBN) Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology

2012

Abstract

Augmented Reality (AR), Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS) can be used to help people with disabilities. In this paper, we combine these technologies to make a new system for helping deaf people. This system can take the narrator's speech and convert it into a readable text and show it directly on AR display. To improve the accuracy of the system, we use Audio-Visual Speech Recognition (AVSR) as a backup for the ASR engine in noisy environments. In addition, we use the TTS system to make our system more usable for deaf people. The results of testing the system show that its accuracy is over 85 percent on average in different places. Also, the result of a...

A model distance maximizing framework for speech recognizer-based speech enhancement

, Article AEU - International Journal of Electronics and Communications ; Volume 65, Issue 2 , February , 2011 , Pages 99-106 ; 14348411 (ISSN) Babaali, B ; Sameti, H ; Falk, T. H ; Sharif University of Technology

2011

Abstract

This paper has presented a novel discriminative parameter calibration approach based on the model distance maximizing (MDM) framework to improve the performance of our previously-proposed method based on spectral subtraction (SS) in a likelihood-maximizing framework. In the previous work, spectral over-subtraction factors were adjusted based on the conventional maximum-likelihood (ML) approach that utilized only the true model and did not consider other confused models, thus likely reached suboptimal solutions. While in the proposed MDM framework, improved speech recognition performance is obtained by maximizing the dissimilarities among models. Experimental results based on FARSDAT, TIMIT...

An improved parallel model combination method for noisy speech recognition

, Article Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2009 ; 2009 , Pages 237-242 ; 9781424454792 (ISBN) Veisi, H ; Sameti, H ; Sharif University of Technology

Abstract

In this paper a novel method, called PC-PMC, is proposed to improve the performance of automatic speech recognition systems in noisy environments. This method is based on the parallel model combination (PMC) technique and uses the Cepstral Mean Subtraction (CMS) normalization ability and Principal Component Analysis (PCA) compression and decorrelation capabilities. It takes the advantages of both additive noise compensation of PMC and convolutive noise removal ability of CMS and PCA. The first problem to be solved in the realizing of PC-PMC is that PMC algorithm requires invertible modules in the front-end of the system while CMS normalization is not an invertible process. Also, it is...

Flat-Start single-stage discriminatively trained hmm-based models for asr

, Article IEEE/ACM Transactions on Audio Speech and Language Processing ; Volume 26, Issue 11 , 2018 , Pages 1949-1961 ; 23299290 (ISSN) Hadian, H ; Sameti, H ; Povey, D ; Khudanpur, S ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

In recent years, end-to-end approaches to automatic speech recognition have received considerable attention as they are much faster in terms of preparing resources. However, conventional multistage approaches, which rely on a pipeline of training hidden Markov models (HMM)-GMM models and tree-building steps still give the state-of-the-art results on most databases. In this study, we investigate flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition. We thoroughly look into different issues that arise in such a setup and propose a standalone system, which achieves...

Acoustic modeling from frequency-domain representations of speech

, Article Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2 September 2018 through 6 September 2018 ; Volume 2018-September , 2018 , Pages 1596-1600 ; 2308457X (ISSN) Ghahremani, P ; Hadian, H ; Lv, H ; Povey, D ; Khudanpur, S ; Sharif University of Technology

International Speech Communication Association 2018

Abstract

In recent years, different studies have proposed new methods for DNN-based feature extraction and joint acoustic model training and feature learning from raw waveform for large vocabulary speech recognition. However, conventional pre-processed methods such as MFCC and PLP are still preferred in the state-of-the-art speech recognition systems as they are perceived to be more robust. Besides, the raw waveform methods - most of which are based on the time-domain signal - do not significantly outperform the conventional methods. In this paper, we propose a frequency-domain feature-learning layer which can allow acoustic model training directly from the waveform. The main distinctions from...

Filter-bank design based on dependencies between frequency components and phoneme characteristics

, Article European Signal Processing Conference, 29 August 2011 through 2 September 2011 ; Septembe , 2011 , Pages 2142-2145 ; 22195491 (ISSN) Mohammadi, S. H ; Sameti, H ; Tavanaei, A ; Soltani Farani, A ; Sharif University of Technology

2011

Abstract

Mel-frequency Cepstral coefficients are widely used for feature extraction in speech recognition systems. These features use Mel-scaled filters. A new filter-bank based on dependencies between frequency components and phoneme characteristics is proposed. F-ratio and mutual information are used for this purpose. A new filter-bank is designed in which frequency resolution of sub-band filters is inversely proportional to the computed dependency values. These new filterbank is used instead of Mel-scaled filters for feature extraction. A phoneme recognition experiment on FARSDAT Persian language database showed that features extracted using the proposed filter-bank reach higher accuracy (63.92%)...

Improving the performance of speech recognition systems using fault-tolerant techniques

, Article 2008 9th International Conference on Signal Processing, ICSP 2008, Beijing, 26 October 2008 through 29 October 2008 ; 2008 , Pages 579-582 ; 9781424421794 (ISBN) Veisi, H ; Sameti, H ; Sharif University of Technology

2008

Abstract

In this paper, using of fault tolerant techniques are studied and experimented in speech recognition systems to make these systems robust to noise. Recognizer redundancy is implemented to utilize the strengths of several recognition methods that each one has acceptable performance in a specific condition. Duplication-with-comparison and NMR methods are experimented with majority and plurality voting on a telephony Persian speech-enabled IVR system. Results of evaluations present two promising outcomes, first, it improves the performance considerably; second, it enables us to detect the outputs with low confidence. © 2008 IEEE

Robust parsing for word lattices in continuous speech recognition systems

, Article 2007 9th International Symposium on Signal Processing and its Applications, ISSPA 2007, Sharjah, 12 February 2007 through 15 February 2007 ; 2007 ; 1424407796 (ISBN); 9781424407798 (ISBN) Momtazi, S ; Sameti, H ; Fazel Zarandi, M ; Bahrani, M ; Sharif University of Technology

2007

Abstract

One of the roles of a Natural Language Processing (NLP) model in Continuous Speech Recognition (CSR) systems is to find the best sentence hypothesis by ranking all n-best sentences according to the grammar. This paper describes a robust parsing algorithm for Spoken Language Recognition (SLR) which utilizes a technique that improves the efficiency of parsing. This technique integrates grammatical and statistical approaches, and by using a best-first parsing strategy improves the accuracy of recognition. Preliminary experimental results using a Persian continuous speech recognition system show effective improvements in accuracy with little change in recognition time. The word error rate was...

An evolutionary decoding method for HMM-based continuous speech recognition systems using particle swarm optimization

, Article Pattern Analysis and Applications ; Vol. 17, issue. 2 , 2014 , pp. 327-339 Najkar, N ; Razzazi, F ; Sameti, H ; Sharif University of Technology

Abstract

The main recognition procedure in modern HMM-based continuous speech recognition systems is Viterbi algorithm. Viterbi algorithm finds out the best acoustic sequence according to input speech in the search space using dynamic programming. In this paper, dynamic programming is replaced by a search method which is based on particle swarm optimization. The major idea is focused on generating initial population of particles as the speech segmentation vectors. The particles try to achieve the best segmentation by an updating method during iterations. In this paper, a new method of particles representation and recognition process is introduced which is consistent with the nature of continuous...

Likelihood-maximizing-based multiband spectral subtraction for robust speech recognition

, Article Eurasip Journal on Advances in Signal Processing ; Volume 2009 , 2009 ; 16876172 (ISSN) Babaali, B ; Sameti, H ; Safayani, M ; Sharif University of Technology

2009

Abstract

Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected...

Spectral subtraction in likelihood-maximizing framework for robust speech recognition

, Article INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association, Brisbane, QLD, 22 September 2008 through 26 September 2008 ; December , 2008 , Pages 980-983 ; 19909772 (ISSN) Baba Ali, B ; Sameti, H ; Safayani, M ; Sharif University of Technology

2008

Abstract

Spectral Subtraction (SS), as a speech enhancement technique, originally designed for improving quality of speech signal judged by human listeners. it usually improve the quality and intelligibility of speech signals, while the speech recognition systems need compensation techniques capable of reducing the mismatch between the noisy speech features and the clean models. This paper proposes a novel approach for solving this problem by considering the SS and the speech recognizer as two interconnected components, sharing the common goal of improved speech recognition accuracy. The experimental evaluations on a real recorded database and the TIMIT database show that the proposed method can...

A novel noise immune, Fuzzy approach to speaker independent, isolated word speech recognition

, Article 2006 World Automation Congress, WAC'06, Budapest, 24 June 2006 through 26 June 2006 ; 2006 ; 1889335339 (ISBN); 9781889335339 (ISBN) Halavati, R ; Shouraki, S. B ; Razaghpour, M ; Tajik, H ; Cholakian, A ; Sharif University of Technology

IEEE Computer Society 2006

Abstract

This paper presents a novel approach to isolated word speech recognition using fuzzy modeling which is specifically designed to ignore noise. The task is based on conversion of speech spectrogram into a linguistic fuzzy description and comparison of this representation with fuzzy linguistic descriptions of words. The method is tested on single speaker and multiple speaker tests and the results are compared with a widely used speech recognition approach, showing much higher noise resistance. Copyright - World Automation Congress (WAC) 2006

Design and Performance Improvement of a Spoken Term Detection System

, M.Sc. Thesis Sharif University of Technology Ghadirinia, Marzieh (Author) ; Sameti, Hossein (Supervisor)

Abstract

Recently, widely application of video and radio data makes the exploiting an efficient speech information retrival systems highly crucial. In the present work, Our focus is on spoken term detection which is one of the most important approaches for information retrival. The present system is including two main steps: first, speech processing by means of automatic speech recognition. In recognition Step, we apply large vocabulary. In all recent approaches, the main concern is to retrieve words which are out of vocabulary (OOV). The state of the art to tackle the problem is to exploit the proxy kewords which are in vocabulary words and could be recognized instead of OOV words. Such proxies have...

محتواي کتاب

Speaker phone mode classification using Gaussian mixture models

, Article SPA 2011 - Signal Processing: Algorithms, Architectures, Arrangements, and Applications - Conference Proceedings, 29 September 2011 through 30 September 2011 ; September , 2011 , Pages 112-117 ; 9781457714863 (ISBN) Eghbal Zadeh, H ; Sobhan Manesh, F ; Sameti, H ; BabaAli, B ; Sharif University of Technology

2011

Abstract

This study focuses on the mode classification of phones speaker modes using GMM 1. In this regard, speech data in both enabled and disabled speaker modes of cell phones and telephones were collected, processed and classified into two different categories. The different mixture numbers (1 to 4) of GMM and wave files sizes of 10, 20, 40 and 80 kb were tested in order to obtain an optimal condition for classification. The GMM method attained 87.99% correct classification rate on test data. This classification is important for speech enabled IVR 2 systems [1], dialog systems and many systems in speech processing in the sense that it could help to load an optimum model for increasing system...