To determine speech intelligibility using the test suggested by Ozimek et al. (2009), the subject composed sentences with the words presented on a computer screen. However, the number and the type of these words were chosen arbitrarily. The subject was always presented with 18, similarly sounding words. Therefore, the aim of this study was to determine whether the number and the type of alternative words used by Ozimek et al. (2009), had a significant influence on the speech intelligibility. The aim was also to determine an optimal number of alternative words: i.e., the number that did not affect the speech reception threshold (SRT) and not unduly lengthened the duration of the test. The study conducted using a group of 10 subjects with normal hearing showed that an increase in the number of words to choose from 12 to 30 increased the speech intelligibility by about 0.3 dB/6 words. The use of paronyms as alternative words as opposed to random words, leads to an increase in the speech intelligibility by about 0.6 dB, which is equivalent to a decrease in intelligibility by 15 percentage points. Enlarging the number of words to choose from, and switching alternative words to paronyms, led to an increase in response time from approximately 11 to 16 s. It seems that the use of paronyms as alternative words as well as using 12 or 18 words to choose from is the best choice when using the Polish Sentence Test (PST).
This study sought to evaluate the effect of speech intensity on performance of the Callsign Acquisition Test (CAT) and Modified Rhyme Test (MRT) presented in noise. Fourteen normally hearing listeners performed both tests in 65 dB A white background noise. Speech intensity varied while background noise remained constant to form speech-to-noise ratios (SNRs) of -18, -15, -12, -9, and -6 dB. Results showed that CAT recognition scores were significantly higher than MRT scores at the same SNRs; however, the scores from both tests were highly correlated and their relationship for the SNRs tested can be expressed by a simple linear function. The concept of CAT can be easily ported to other languages for testing speech communication under adverse listening conditions.
A phoneme segmentation method based on the analysis of discrete wavelet transform spectra is described. The localization of phoneme boundaries is particularly useful in speech recognition. It enables one to use more accurate acoustic models since the length of phonemes provide more information for parametrization. Our method relies on the values of power envelopes and their first derivatives for six frequency subbands. Specific scenarios that are typical for phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function, which represent the change of the energy distribution in the frequency domain. The exact definition of this method is described in the paper. The final decision on localization of boundaries is taken by analysis of the event function. Boundaries are, therefore, extracted using information from all subbands. The method was developed on a small set of Polish hand segmented words and tested on another large corpus containing 16 425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets. From this, results with F-score equal to 72.49% were obtained.
Speech enhancement is fundamental for various real time speech applications and it is a challenging task in the case of a single channel because practically only one data channel is available. We have proposed a supervised single channel speech enhancement algorithm in this paper based on a deep neural network (DNN) and less aggressive Wiener filtering as additional DNN layer. During the training stage the network learns and predicts the magnitude spectrums of the clean and noise signals from input noisy speech acoustic features. Relative spectral transform-perceptual linear prediction (RASTA-PLP) is used in the proposed method to extract the acoustic features at the frame level. Autoregressive moving average (ARMA) filter is applied to smooth the temporal curves of extracted features. The trained network predicts the coefficients to construct a ratio mask based on mean square error (MSE) objective cost function. The less aggressive Wiener filter is placed as an additional layer on the top of a DNN to produce an enhanced magnitude spectrum. Finally, the noisy speech phase is used to reconstruct the enhanced speech. The experimental results demonstrate that the proposed DNN framework with less aggressive Wiener filtering outperforms the competing speech enhancement methods in terms of the speech quality and intelligibility.
The human voice is one of the basic means of communication, thanks to which one also can easily convey the emotional state. This paper presents experiments on emotion recognition in human speech based on the fundamental frequency. AGH Emotional Speech Corpus was used. This database consists of audio samples of seven emotions acted by 12 different speakers (6 female and 6 male). We explored phrases of all the emotions – all together and in various combinations. Fast Fourier Transformation and magnitude spectrum analysis were applied to extract the fundamental tone out of the speech audio samples. After extraction of several statistical features of the fundamental frequency, we studied if they carry information on the emotional state of the speaker applying different AI methods. Analysis of the outcome data was conducted with classifiers: K-Nearest Neighbours with local induction, Random Forest, Bagging, JRip, and Random Subspace Method from algorithms collection for data mining WEKA. The results prove that the fundamental frequency is a prospective choice for further experiments.
The paper investigates the interdependence between the perceptual identification of the vocalic quality of six isolated Polish vowels traditionally defined by the spectral envelope and the fundamental frequency F0. The stimuli used in the listening experiments were natural female and male voices, which were modified by changing the F0 values in the ±1 octave range. The results were then compared with the outcome of the experiments on fully synthetic voices. Despite the differences in the generation of the investigated stimuli and their technical quality, consistent results were obtained. They confirmed the findings that in the perceptual identification of vowels of key importance is not only the position of the formants on the F1 × F2 plane but also their relationship to F0, the connection between the formants and the harmonics and other factors. The paper presents, in quantitative terms, all possible kinds of perceptual shifts of Polish vowels from one phonetic category to another in the function of voice pitch. An additional perceptual experiment was also conducted to check a broader range of F0 changes and their impact on the identification of vowels in CVC (consonant, vowel, consonant) structures. A mismatch between the formants and the glottal tone value can lead to a change in phonetic category.
The paper presents the results of sentence and logatome speech intelligibility measured in rooms with induction loop for hearing aid users. Two rooms with different acoustic parameters were chosen. Twenty two subjects with mild, moderate and severe hearing impairment using hearing aids took part in the experiment. The intelligibility tests composed of sentences or logatomes were presented to the subjects at fixed measurement points of an enclosure. It was shown that a sentence test is more useful tool for speech intelligibility measurements in a room than logatome test. It was also shown that induction loop is very efficient system at improving speech intelligibility. Additionally, the questionnaire data showed that induction loop, apart from improving speech intelligibility, increased a subject’s general satisfaction with speech perception
The paper analyzes the estimation of the fundamental frequency from the real speech signal which is obtained by recording the speaker in the real acoustic environment modeled by the MP3 method. The estimation was performed by the Picking-Peaks algorithm with implemented parametric cubic convolution (PCC) interpolation. The efficiency of PCC was tested for Catmull-Rom, Greville, and Greville two- parametric kernel. Depending on MSE, a window that gives optimal results was chosen.
Speaker‘s emotional states are recognized from speech signal with Additive white Gaussian noise (AWGN). The influence of white noise on a typical emotion recogniztion system is studied. The emotion classifier is implemented with Gaussian mixture model (GMM). A Chinese speech emotion database is used for training and testing, which includes nine emotion classes (e.g. happiness, sadness, anger, surprise, fear, anxiety, hesitation, confidence and neutral state). Two speech enhancement algorithms are introduced for improved emotion classification. In the experiments, the Gaussian mixture model is trained on the clean speech data, while tested under AWGN with various signal to noise ratios (SNRs). The emotion class model and the dimension space model are both adopted for the evaluation of the emotion recognition system. Regarding the emotion class model, the nine emotion classes are classified. Considering the dimension space model, the arousal dimension and the valence dimension are classified into positive regions or negative regions. The experimental results show that the speech enhancement algorithms constantly improve the performance of our emotion recognition system under various SNRs, and the positive emotions are more likely to be miss-classified as negative emotions under white noise environment.
Reverberation is a common problem for many speech technologies, such as automatic speech recognition (ASR) systems. This paper investigates the novel combination of precedence, binaural and statistical independence cues for enhancing reverberant speech, prior to ASR, under these adverse acoustical conditions when two microphone signals are available. Results of the enhancement are evaluated in terms of relevant signal measures and accuracy for both English and Polish ASR tasks. These show inconsistencies between the signal and recognition measures, although in recognition the proposed method consistently outperforms all other combinations and the spectral-subtraction baseline.
The aim of this work was to measure subjective speech intelligibility in an enclosure with a long reverberation time and comparison of these results with objective parameters. Impulse Responses (IRs) were first determined with a dummy head in different measurement points of the enclosure. The following objective parameters were calculated with Dirac 4.1 software: Reverberation Time (RT), Early Decay Time (EDT), weighted Clarity (C50) and Speech Transmission Index (STI). For the chosen measurement points, a convolution of the IRs with the Polish Sentence Test (PST) and logatome tests was made. PST was presented at a background of a babble noise and speech reception threshold - SRT (i.e. SNR yielding 50% speech intelligibility) for those points were evaluated. A relationship of the sentence and logatome recognition vs. STI was determined. It was found that the final SRT data are well correlated with speech transmission index (STI), and can be expressed by a psychometric function. The difference between SRT determined in condition without reverberation and in reverberation conditions appeared to be a good measure of the effect of reverberation on speech intelligibility in a room. In addition, speech intelligibility, with and without use of the sound amplification system installed in the enclosure, was compared.
Speech emotion recognition is an important part of human-machine interaction studies. The acoustic analysis method is used for emotion recognition through speech. An emotion does not cause changes on all acoustic parameters. Rather, the acoustic parameters affected by emotion vary depending on the emotion type. In this context, the emotion-based variability of acoustic parameters is still a current field of study. The purpose of this study is to investigate the acoustic parameters that fear affects and the extent of their influence. For this purpose, various acoustic parameters were obtained from speech records containing fear and neutral emotions. The change according to the emotional states of these parameters was analyzed using statistical methods, and the parameters and the degree of influence that the fear emotion affected were determined. According to the results obtained, the majority of acoustic parameters that fear affects vary according to the used data. However, it has been demonstrated that formant frequencies, mel-frequency cepstral coefficients, and jitter parameters can define the fear emotion independent of the data used.
Although the emotions and learning based on emotional reaction are individual-specific, the main features are consistent among all people. Depending on the emotional states of the persons, various physical and physiological changes can be observed in pulse and breathing, blood flow velocity, hormonal balance, sound properties, face expression and hand movements. The diversity, size and grade of these changes are shaped by different emotional states. Acoustic analysis, which is an objective evaluation method, is used to determine the emotional state of people’s voice characteristics. In this study, the reflection of anxiety disorder in people’s voices was investigated through acoustic parameters. The study is a case-control study in cross-sectional quality. Voice recordings were obtained from healthy people and patients. With acoustic analysis, 122 acoustic parameters were obtained from these voice recordings. The relation of these parameters to anxious state was investigated statistically. According to the results obtained, 42 acoustic parameters are variable in the anxious state. In the anxious state, the subglottic pressure increases and the vocalization of the vowels decreases. The MFCC parameter, which changes in the anxious state, indicates that people can perceive this situation while listening to the speech. It has also been shown that text reading is also effective in triggering the emotions. These findings show that there is a change in the voice in the anxious state and that the acoustic parameters are influenced by the anxious state. For this reason, acoustic analysis can be used as an expert decision support system for the diagnosis of anxiety.
This study examined whether differences in reverberation time (RT) between typical sound field test rooms used in audiology clinics have an effect on speech recognition in multi-talker environments. Separate groups of participants listened to target speech sentences presented simultaneously with 0-to-3 competing sentences through four spatially-separated loudspeakers in two sound field test rooms having RT = 0:6 sec (Site 1: N = 16) and RT = 0:4 sec (Site 2: N = 12). Speech recognition scores (SRSs) for the Synchronized Sentence Set (S3) test and subjective estimates of perceived task difficulty were recorded. Obtained results indicate that the change in room RT from 0.4 to 0.6 sec did not significantly influence SRSs in quiet or in the presence of one competing sentence. However, this small change in RT affected SRSs when 2 and 3 competing sentences were present, resulting in mean SRSs that were about 8-10% better in the room with RT = 0:4 sec. Perceived task difficulty ratings increased as the complexity of the task increased, with average ratings similar across test sites for each level of sentence competition. These results suggest that site-specific normative data must be collected for sound field rooms if clinicians would like to use two or more directional speech maskers during routine sound field testing.
The 13th-century Persian poet Saʿdi from Shiraz is considered to be one of the most prominent representatives of medieval Persian ethical literature. His works full of moralizing anecdotes were well known and widely read not only in Persia, but in the other parts of the Islamic world as well. Due to his highly humanistic approach, the relations between people were one of the most important issues discussed by the poet. This article is an attempt to define the status of ‘speech’ in Saʿdi’s moral imagination and to show how it becomes a key instrument in shaping relations with others. In the poet’s opinion, the right words reasonably spoken, just like an appropriate silence, shape the relationship between people and help them avoid conflict and open dispute. Quarrels and confrontations, according to the poet, not only damage a person literally by exposing his flaws and imperfections of character, thereby compromising his reputation (aberu), but may also undermine the basis of social life, generating hostility between people. That is why Saʿdi urges his readers to use soft and gentle speech in dealing with people and always behave in a conciliatory manner in response to aggression and rudeness. Highlighting the moral aspect of speech, Saʿdi shows how kind words form an invisible veil between people, which should be preserved if man desires to maintain his image, good name and dignity.
The subject of this article is an analysis of the earliest of Karl Marx’s articles, Comments on the Latest Prussian Censorship Instruction. The essence of his views presented in that article was to protest against the restriction of the right to free expression of opinions by journalists. Marx pointed out that the new Prussian Censorship Instruction only seemed to liberalize censorship, but in fact in many aspects tightened the rules, for example, reinforced those that pertained to religious criticism. He thought that the Prussian Censorship Instruction was not an enactment of law, because by limiting freedom, lawmakers acted against the essence of the press, law and state. Marx thought that a press law was needed to guarantee freedom of the press and that censorship should be abolished entirely.
LABLITA-Suite. Resources for the acquisition of Italian as a second language – LABLITA-suite provides technology-enhanced learning resources for the acquisition of Italian L2. IMAGACT allows for mastering the semantic properties of action verbs in the early phases of language acquisition. The LABLITA corpus of Spoken Italian can be used for training learners for face to face conversations. RIDIRE and CORDIC provide corpus linguistic tools for accessing Italian phraseology, which is useful for enhancing writing capabilities in the various domains of language usage.
The most challenging in speech enhancement technique is tracking non-stationary noises for long speech segments and low Signal-to-Noise Ratio (SNR). Different speech enhancement techniques have been proposed but, those techniques were inaccurate in tracking highly non-stationary noises. As a result, Empirical Mode Decomposition and Hurst-based (EMDH) approach is proposed to enhance the signals corrupted by non-stationary acoustic noises. Hurst exponent statistics was adopted for identifying and selecting the set of Intrinsic Mode Functions (IMF) that are most affected by the noise components. Moreover, the speech signal was reconstructed by considering the least corrupted IMF. Though it increases SNR, the time and resource consumption were high. Also, it requires a significant improvement under nonstationary noise scenario. Hence, in this article, EMDH approach is enhanced by using Sliding Window (SW) technique. In this SWEMDH approach, the computation of EMD is performed based on the small and sliding window along with the time axis. The sliding window depends on the signal frequency band. The possible discontinuities in IMF between windows are prevented by the total number of modes and the number of sifting iterations that should be set a priori. For each module, the number of sifting iterations is determined by decomposition of many signal windows by standard algorithm and calculating the average number of sifting steps for each module. Based on this approach, the time complexity is reduced significantly with suitable quality of decomposition. Finally, the experimental results show the considerable improvements in speech enhancement under non-stationary noise environments.
The article addresses prosodic characteristics of the new intonation contour which can be observed in spontaneous speech in the contemporary Russian language. The study focuses on the attempt made to identify the most relevant criteria of this new intonation construction and to explain the reasons underlying its occurrence.
The aim of this study was to create a single-language counterpart of the International Speech Test Signal (ISTS) and to compare both with respect to their acoustical characteristics. The development procedure of the Polish Speech Test Signal (PSTS) was analogous to the one of ISTS. The main difference was that instead of multi-lingual recordings, speech recordings of five Polish speakers were used. The recordings were cut into 100–600 ms long segments and composed into one-minute long signal, obeying a set of composition rules, imposed mainly to preserve a natural, speech-like features of the signal. Analyses revealed some differences between ISTS and PSTS. The latter has about twice as high volume of voiceless fragments of speech. PSTS’s sound pressure levels in 1/3-octave bands resemble the shape of the Polish long-term average female speech spectrum, having distinctive maxima at 3–4 and 8–10 kHz which ISTS lacks. As PSTS is representative of Polish language and contains inputs from multiple speakers, it can potentially find an application as a standardized signal used during the procedure of fitting hearing aids for patients that use Polish as their main language.
The Chinese word identification and sentence intelligibility are evaluated by grades 3 and 5 students in the classrooms with different reverberation times (RTs) from three primary school under different signal-to-noise ratios (SNRs). The relationships between subjective word identification and sentence in- telligibility scores and speech transmission index (STI) are analyzed. The results show that both Chinese word identification and sentence intelligibility scores for grades 3 and 5 students in the classroom in- creased with the increase of SNR (and STI), increased with the increase of the age of students, and decreased with the increase of RT. To achieve a 99% sentence intelligibility score, the STIs required for grades 3, grade 5 students, and adults are 0.71, 0.61, and 0.51, respectively. The required objective acoustical index determined by a certain threshold of the word identification test might be underestimated for younger children (grade 3 students) in classroom but overestimated for adults. A method based on the sentence test is more useful for speech intelligibility evaluation in classrooms than that based on the word test for different age groups. Younger children need more favorable classroom acoustical environment with a higher STI than older children and adults to achieve the optimum speech communication in the classroom.
Laughter is one of the most important paralinguistic events, and it has specific roles in human conversation. The automatic detection of laughter occurrences in human speech can aid automatic speech recognition systems as well as some paralinguistic tasks such as emotion detection. In this study we apply Deep Neural Networks (DNN) for laughter detection, as this technology is nowadays considered state-of-the-art in similar tasks like phoneme identification. We carry out our experiments using two corpora containing spontaneous speech in two languages (Hungarian and English). Also, as we find it reasonable that not all frequency regions are required for efficient laughter detection, we will perform feature selection to find the sufficient feature subset.
Despite various speech enhancement techniques have been developed for different applications, existing methods are limited in noisy environments with high ambient noise levels. Speech presence probability (SPP) estimation is a speech enhancement technique to reduce speech distortions, especially in low signalto-noise ratios (SNRs) scenario. In this paper, we propose a new two-dimensional (2D) Teager-energyoperators (TEOs) improved SPP estimator for speech enhancement in time-frequency (T-F) domain. Wavelet packet transform (WPT) as a multiband decomposition technique is used to concentrate the energy distribution of speech components. A minimum mean-square error (MMSE) estimator is obtained based on the generalized gamma distribution speech model in WPT domain. In addition, the speech samples corrupted by environment and occupational noises (i.e., machine shop, factory and station) at different input SNRs are used to validate the proposed algorithm. Results suggest that the proposed method achieves a significant enhancement on perceptual quality, compared with four conventional speech enhancement algorithms (i.e., MMSE-84, MMSE-04, Wiener-96, and BTW).