Wednesday, June 5, 2019

Speaker Independent Speech Recognizer Development

Speaker Independent Speech Recognizer DevelopmentChapter 4Methodology and ImplementationThis chapter describes the methodology and capital punishment of the speaker independent nomenclature recognizer for the Sinhala langu be on and the Android mobile application for voice dialing. Mainly there are twain phase angles of the research. First adept is to build the speaker independent Sinhala rescue recognizer to recognize the digits spoken in Sinhala voice communication. The second phase is to build an android application by integrating the trained manner of speaking recognizer. This chapter covers the tools, algorithms, theoretical aspects, the standards and the commove structures use for the entire research abut.4.1Research phase 1 Build the speaker independent Sinhala speech recognizer for recognizing the digits.In this section the development of the speaker independent Sinhala speech recognizer is described, step by step. It includes the ph atomic number 53tic mental lexi con, language model, grammar lodge, acoustical speech database and the trained acoustic model creation.4.1.1 Data preparationThis system is a Sinhala speech recognition voice dial and since there is no such speech database which is do earlier was available, the speech has to be taken from the scratch to develop the system.Data collectionThe first stage of every speech recognizer is the collection of sound signals. Database should wait a mixed bag of enough speakers recording. The size of the database is compared to the task we handle. For this application only little number of words was considered. This research aims only the written Sinhala mental lexicon that can be applied for voice dialing. Altogether twelve words were considered with the ten numbers including two initial c exclusivelying words amatanna and katakaranna. Here the Database has two parts, the instruction part and the testing part. Usually about 1/10th of the full speech data is used to the testing part. In th is research 3000 speech samples were used for didactics and 150 speech samples were used for testing.Speech databaseBefore collecting data, a speech database was drawd. The database was included with the Sinhala speech samples taken from variety of people who were in different age levels. Since there was no such database published anywhere for Sinhala language relevant for voice dialing, speech had to be collected from Sinhala native speakers.Prompt planerTo create the speech database, the first step was to prepare the prompt sheet having a list of sentences for all the recordings. Here it used 100 sentences that are different from from each one former(a) by generating the numbers randomly. 50 sentences are starting with the word amatanna while the other half is starting with the word katakaranna. The prompt sheet used for this research is wedded in the extension A.RecordingThe prepared sentences in the prompt sheet were put down by using thirty (30) native speakers since thi s is speaker independent application. The speakers were selected according to the age limits and divided them into eight age groups. Four people were selected from each group except one age group. Two females and two males were included into each age group. One group only contained two people with one female and one male. Each speaker was minded(p) 100 sentences to speak and altogether 3000 speech samples were recorded for training. The description of speakers such as gender and age can be found in Appendix A. If there was an error in the recording due to the background noise and filler sounds, the speaker was asked to reprise it and got the correct sound signal. Since the proposed system is a discrete system, the speakers have to make a short pause at the start and end of the recording and in addition between the words when they were uttered. Speech was recorded in a quiet room and the recordings were done at nights by using a condenser record-keeper microphone. The sounds were recorded under the sampling rate of 44.1 kilohertz using mono channel and they were saved under *.wav format.Sampling frequency and format of speech audio frequency levelsSpeech recording files were saved in the file format of MS WAV. The Praat software was used to convert the 44.1 kHz sampling frequency signals to 16 kHz frequency signals since the frequency should be 16kHz of the training samples. Audio files were recorded in a medium length of 11 seconds. Since there should be a muteness in the beginning and the end of the utterance and it should not be exceeded 0.2 seconds, the Praat software was used to edit all 3000 sound signals.4.1.2 Pronunciation dictionaryThe pronunciation dictionary was implemented by hand since the number of words used for the voice dialing system is very few. It is used only 12 words from the Sinhala vocabulary. To create the dictionary, the International Phonetic Alphabet for Sinhala Language and the previously created dictionaries by CMU Sphinx we re used. But the acoustic phones were taken aroundly by studying the different types of databases given by the Carnegie Mellon Universitys Sphinx Forum (CMU Sphinx Forum).Two dictionaries were implemented for this system. One is for the speech utterances and the other one is for filler sounds. The filler sounds contain the silences in the beginning, middle and at the end of the speech utterances. The attachment of the two types of dictionaries can be found on the Appendix A. They are referred to as the languagedictionaryand thefiller dictionary.4.1.3 Creating the grammar fileThe grammar file also created by hand since the number of words used for the system is very few. The JSGF (JSpeech Grammar Format) format was used to implement the grammar file. The grammar file can be found in Appendix A.4.1.4 Building the language modelWord search is restricted by a language model. It identifies the matching words by comparing the previously acknowledge words by the model and restricts the matching process by taking off the words that are not possible to be. N-gram language model is the most common language models used nowadays. It is a finite state language model and it contains statistics of word sequences. In search space where restriction is applied, a broad(a) accuracy rate can be obtained if the language model is a very successful one. The result is the language model can predict the quest word properly. It usually restricts the word search which are included the vocabulary.The language model was built using the cmuclmtk software. First of all the reference school text was created and that text (svd.text) can be found in Appendix A. It was written in a specific format. The speech sentences were delimited byandtags. wherefore the vocabulary file was generated by giving the following command.text2wfreq svd.vocabThen the generated vocabulary file was edited to remove words (numbers and misspellings). When finding misspellings, they were fixed in the input refe rence text. The generated vocabulary file (svd.vocab) can be found in the Appendix A.Then the ARPA format language model was generated using these commands.text2idngram -vocab svd.vocab -idngram svd.idngram idngram2lm -vocab_type 0 -idngram svd.idngram -vocab svd.vocab arpa svd.arpaFinally the CMU binary of language model (DMP file) was generated using the commandsphinx_lm_convert -i svd.arpa -o svd.lm.DMPThe final output containing the language model needed for the training process is svd.lm.dmp file. This is a binary file.4.1.5Acoustic modelBefore starting the acoustic model creation, the following file structure was arranged as described by the CMU Sphinx tool kit guide. The name of the speech database is svd (Sinhala Voice Dial). The content of these files is given in Appendix A.svd.dic -Phonetic dictionarysvd.phone -Phoneset filesvd.lm.DMP -Language modelsvd.filler -List of fillerssvd _train.fileids -List of files for trainingsvd _train.transcription -Transcription for training svd _test.fileids -List of files for testingsvd _test.transcription -Transcription for testingAll these files were included in to one directory and it was named as etc. The speech samples of wav files were included in to another directory and named it as wav. These two directories were included in to another directory and named it using the name of the database (svd). Before starting the training process, there should be another directory that contains the svd and the required compilation package pocketsphinx, sphinxbase and sphinxtrain directories. All the packages and the svd directory were put into another directory and started the training process.Setting up the training scriptsThe command prompt terminal is used to run the scripts of the training process. Before starting the process, terminal was changed to the database svd directory and then the following command was run.python ../sphinxtrain/scripts/sphinxtrain t svd setupThis command copied all the required embodiment files into etc sub directory of the database directory and prepared the database for training. The two configuration files created were feat.params and sphinx_train.cfg. These two are given in Appendix A.Set up the databaseThese values were filled in at configuration time. The Experiment name, will be used to name model files and log files in the database.$CFG_DB_NAME = svd$CFG_EXPTNAME = $CFG_DB_NAMESet up the format of database audioSince the database contains speech utterances with the wav format and they were recorded using MSWav, the extension and the type were given accordingly as wav and mswav.$CFG_WAVFILES_DIR = $CFG_BASE_DIR/wav$CFG_WAVFILE_EXTENSION = wav$CFG_WAVFILE_TYPE = mswav one of nist, mswav, rawConfigure Path to filesThis process was done automatically when having the right file structure in the running directory. The naming of the files must be very accurate. The paths were assigned to the variables used in main training of models.$CFG_DICTIONARY = $CFG_LIST_DIR/$CFG _DB_NAME.dic$CFG_RAWPHONEFILE = $CFG_LIST_DIR/$CFG_DB_NAME.phone$CFG_FILLERDICT = $CFG_LIST_DIR/$CFG_DB_NAME.filler$CFG_LISTOFFILES = $CFG_LIST_DIR/$CFG_DB_NAME_train.fileids$CFG_TRANSCRIPTFILE = $CFG_LIST_DIR/$CFG_DB_NAME_train.transcription$CFG_FEATPARAMS = $CFG_LIST_DIR/feat.paramsConfigure model type and model parametersThe model type dogging and semi continuous can be used in pocket sphinx. Continuous type is used for continuous speech recognition. Semi continuous is used for discrete speech recognition process. Since this application use discrete speech the semi continuous model training was used.$CFG_HMM_TYPE = .cont. Sphinx 4, Pocketsphinx$CFG_HMM_TYPE = .semi. PocketSphinx$CFG_FINAL_NUM_DENSITIES = 8 Number of tied states (senones) to create in decision-tree clustering$CFG_N_TIED_STATES = 1000The number of senones used to train the model is indicated in this value. The sound can be chosen accurately if the number of senones is higher. But if we use too much senones, then it may not be able to recognize the unseen sounds. So the Word Error Rate can be very much higher on unseen sounds.The approximate number of senones and number of densities is provided in the duck below.Configure sound feature parametersThe default parameter used for sound files in Sphinx is a rate of 16 thousand samples per second (16KHz). If this is the case, then the etc/feat.params file will be automatically generated with the recommended values. The Recommended values are Feature extraction parameters$CFG_WAVFILE_SRATE = 16000.0$CFG_NUM_FILT = 40 For wideband speech its 40, for name 8khz reasonable value is 31$CFG_LO_FILT = 133.3334 For telephone 8kHz speech value is 200$CFG_HI_FILT = 6855.4976 For telephone 8kHz speech value is 3500Configure decoding parametersThe following were properly configured in theetc/sphinx_train.cfg.$DEC_CFG_DICTIONARY = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.dic$DEC_CFG_FILLERDICT = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.filler$DEC_CFG_LISTOFFI LES = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME_test.fileids$DEC_CFG_TRANSCRIPTFILE = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME_test.transcription$DEC_CFG_RESULT_DIR = $DEC_CFG_BASE_DIR/result These variables, used by the decoder, have to be user defined, and may affect the decoder output$DEC_CFG_LANGUAGEMODEL_DIR = $DEC_CFG_BASE_DIR/etc$DEC_CFG_LANGUAGEMODEL = $DEC_CFG_LANGUAGEMODEL_DIR/ $CFG_DB_NAME.lm.DMPTrainingAfter setting all these paths and parameters in the configuration file as described above, the training was proceeded. To start the training process the following command was run.python ../sphinxtrain/scripts/sphinxtrain runScripts launched jobs on the machine, and it took few minutes to run.Acoustic ModelAfter the training process, the acoustic model was located in the following path in the directory. Only this tract is needed for the speech recognition tasks.model_parameters/svd.cd_semi_200We need only that folder for the speech recognition tasks we have to perform.4.1.6Testi ng Results150 speech samples were used as testing data. The aligning results could be obtained after the training process. It was located in the following path in the database directory.results/svd.align4.1.7Parameters to be optimizedWord error rateWER was given as a percentage value. It was calculated according to the following equationtruenessAccuracy was also given as a percentage. That is the opposite value of the WER. It was calculated using the following equationTo obtain an optimal recognition system, the WER should be minimized and the accuracy should be maximized. The parameters of the configuration file were changed time to time and obtained an optimal recognition system where the WER was the minimum with a high accuracy rate.4.2Research phase 2 Build the voice dialing mobile application.In this section, the implementation of voice dialer for android mobile application is described. The application was developed using the programming language JAVA and it was done using the Eclipse IDE. It was tested in both the emulator and the actual device. The application is able to recognize the spoken digits by any speaker and dial the recognized number. To do this process the trained acoustic model, the pronunciation dictionary, the language model and the grammar files were needed. The speech recognition was performed by using these models in the mobile device itself by using the pocketsphinx library. It is a library written in C language to use for embedded speech recognition devices in Android platform.The step by step implementation and integration of the necessary components were discussed in detail in this section.Resource FilesWhen inputting the resource files to the Android application, they were added in to theassets/directory of the project. Then the physical path was given to make them available for pocketsphinx.After adding them, the Assets directory contained the following resource files.Dictionarysvd.dicsvd.dic.md5Grammardigits.gramdigits.gram.md5m enu.grammenu.gram.md5Language modelsvd.lm.DMPsvd.lm.DMP.md5Acoustic Modelfeat.paramsfeat.params.md5mdefmdef.md5 bureaumeans.md5mixture_weightsmixture_weights.md5noisedictnoisedict.md5transition_matricestransition_matrices.md5variancesvariances.md5Assets.lstmodels/dict/svd.dicmodels/grammar/digits.grammodels/grammar/menu.grammodels/hmm/en-us-semi/feat.paramsmodels/hmm/en-us-semi/mdefmodels/hmm/en-us-semi/meansmodels/hmm/en-us-semi/mixture_weightsmodels/hmm/en-us-semi/noisedictmodels/hmm/en-us-semi/sendumpmodels/hmm/en-us-semi/transition_matricesmodels/hmm/en-us-semi/variancesmodels/lm/svd.lm.DMPSetup the RecognizerFirst of all the recognizer should be set up by adding the resource files. The model parameters taken after the training process were added as the HMM in the application. The recognition process was depended mainly on this resource files. Since the grammar files and the language model were added as assets, these two can be used for the recognition process of the application as well as the HMM. The utterances can be recognized from either the grammar files or language model. The whole process is coded using the Java computer programming language.4.3Architecture of the developed Speech Recognition System

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.