In building speech recognition based applications, robustness to different noisy background condition is an important challenge. In this paper bimodal approach is proposed to improve the robustness of Hindi speech recognition system. Also an importance of different types of visual features is studied for audio visual automatic speech recognition (AVASR) system under diverse noisy audio conditions. Four sets of visual feature based on Two-Dimensional Discrete Cosine Transform feature (2D-DCT), Principal Component Analysis (PCA), Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT- DCT) and Two-Dimensional Discrete Wavelet Transform followed by PCA (2D-DWT-PCA) are reported. The audio features are extracted using Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic feature. Overall, 48 features, i.e. 39 audio features and 9 visual features are used for measuring the performance of the AVASR system. Also, the performance of the AVASR using noisy speech signal generated by using NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to −10 dB) using Aligarh Muslim University Audio Visual (AMUAV) Hindi corpus. AMUAV corpus is Hindi continuous speech high quality audio visual databases of Hindi sentences spoken by different subjects.
Simultaneous perception of audio and visual stimuli often causes concealment or misrepresentation of information actually contained in these stimuli. Such effects are called the "image proximity effect" or the "ventriloquism effect" in the literature. Until recently, most research carried out to understand their nature was based on subjective assessments. The authors of this paper propose a methodology based on both subjective and objectively retrieved data. In this methodology, objective data reflect the screen areas that attract most attention. The data were collected and processed by an eye-gaze tracking system. To support the proposed methodology, two series of experiments were conducted - one with a commercial eye-gaze tracking system Tobii T60, and another with the Cyber-Eye system developed at the Multimedia Systems Department of the Gdańsk University of Technology. In most cases, the visual-auditory stimuli were presented using a 3D video. It was found that the eye-gaze tracking system did objectivize the results of experiments. Moreover, the tests revealed a strong correlation between the localization of a visual stimulus on which a participant's gaze focused and the value of the "image proximity effect". It was also proved that gaze tracking may be useful in experiments which aim at evaluation of the proximity effect when presented visual stimuli are stereoscopic.