Search results

Filters

  • Journals
  • Authors
  • Keywords
  • Date
  • Type

Search results

Number of results: 7
items per page: 25 50 75
Sort by:

Abstract

The human voice is one of the basic means of communication, thanks to which one also can easily convey the emotional state. This paper presents experiments on emotion recognition in human speech based on the fundamental frequency. AGH Emotional Speech Corpus was used. This database consists of audio samples of seven emotions acted by 12 different speakers (6 female and 6 male). We explored phrases of all the emotions – all together and in various combinations. Fast Fourier Transformation and magnitude spectrum analysis were applied to extract the fundamental tone out of the speech audio samples. After extraction of several statistical features of the fundamental frequency, we studied if they carry information on the emotional state of the speaker applying different AI methods. Analysis of the outcome data was conducted with classifiers: K-Nearest Neighbours with local induction, Random Forest, Bagging, JRip, and Random Subspace Method from algorithms collection for data mining WEKA. The results prove that the fundamental frequency is a prospective choice for further experiments.
Go to article

Abstract

The paper analyzes the estimation of the fundamental frequency from the real speech signal which is obtained by recording the speaker in the real acoustic environment modeled by the MP3 method. The estimation was performed by the Picking-Peaks algorithm with implemented parametric cubic convolution (PCC) interpolation. The efficiency of PCC was tested for Catmull-Rom, Greville, and Greville two- parametric kernel. Depending on MSE, a window that gives optimal results was chosen.
Go to article

Abstract

Although the emotions and learning based on emotional reaction are individual-specific, the main features are consistent among all people. Depending on the emotional states of the persons, various physical and physiological changes can be observed in pulse and breathing, blood flow velocity, hormonal balance, sound properties, face expression and hand movements. The diversity, size and grade of these changes are shaped by different emotional states. Acoustic analysis, which is an objective evaluation method, is used to determine the emotional state of people’s voice characteristics. In this study, the reflection of anxiety disorder in people’s voices was investigated through acoustic parameters. The study is a case-control study in cross-sectional quality. Voice recordings were obtained from healthy people and patients. With acoustic analysis, 122 acoustic parameters were obtained from these voice recordings. The relation of these parameters to anxious state was investigated statistically. According to the results obtained, 42 acoustic parameters are variable in the anxious state. In the anxious state, the subglottic pressure increases and the vocalization of the vowels decreases. The MFCC parameter, which changes in the anxious state, indicates that people can perceive this situation while listening to the speech. It has also been shown that text reading is also effective in triggering the emotions. These findings show that there is a change in the voice in the anxious state and that the acoustic parameters are influenced by the anxious state. For this reason, acoustic analysis can be used as an expert decision support system for the diagnosis of anxiety.
Go to article

Abstract

The aim of this study was to create a single-language counterpart of the International Speech Test Signal (ISTS) and to compare both with respect to their acoustical characteristics. The development procedure of the Polish Speech Test Signal (PSTS) was analogous to the one of ISTS. The main difference was that instead of multi-lingual recordings, speech recordings of five Polish speakers were used. The recordings were cut into 100–600 ms long segments and composed into one-minute long signal, obeying a set of composition rules, imposed mainly to preserve a natural, speech-like features of the signal. Analyses revealed some differences between ISTS and PSTS. The latter has about twice as high volume of voiceless fragments of speech. PSTS’s sound pressure levels in 1/3-octave bands resemble the shape of the Polish long-term average female speech spectrum, having distinctive maxima at 3–4 and 8–10 kHz which ISTS lacks. As PSTS is representative of Polish language and contains inputs from multiple speakers, it can potentially find an application as a standardized signal used during the procedure of fitting hearing aids for patients that use Polish as their main language.
Go to article

Abstract

Speech and music signals are multifractal phenomena. The time displacement profile of speech and music signal show strikingly different scaling behaviour. However, a full complexity analysis of their frequency and amplitude has not been made so far. We propose a novel complex network based approach (Visibility Graph) to study the scaling behaviour of frequency wise amplitude variation of speech and music signals over time and then extract their PSVG (Power of Scale freeness of Visibility Graph). From this analysis it emerges that the scaling behaviour of amplitude-profile of music varies a lot from frequency to frequency whereas it’s almost consistent for the speech signal. Our left auditory cortical areas are proposed to be neurocognitively specialised in speech perception and right ones in music. Hence we can conclude that human brain might have adapted to the distinctly different scaling behaviour of speech and music signals and developed different decoding mechanisms, as if following the so called Fractal Darwinism. Using this method, we can capture all non-stationary aspects of the acoustic properties of the source signal to the deepest level, which has huge neurocognitive significance. Further, we propose a novel non-invasive application to detect neurological illness (here autism spectrum disorder, ASD), using the quantitative parameters deduced from the variation of scaling behaviour for speech and music.
Go to article

Abstract

The Chinese word identification and sentence intelligibility are evaluated by grades 3 and 5 students in the classrooms with different reverberation times (RTs) from three primary school under different signal-to-noise ratios (SNRs). The relationships between subjective word identification and sentence in- telligibility scores and speech transmission index (STI) are analyzed. The results show that both Chinese word identification and sentence intelligibility scores for grades 3 and 5 students in the classroom in- creased with the increase of SNR (and STI), increased with the increase of the age of students, and decreased with the increase of RT. To achieve a 99% sentence intelligibility score, the STIs required for grades 3, grade 5 students, and adults are 0.71, 0.61, and 0.51, respectively. The required objective acoustical index determined by a certain threshold of the word identification test might be underestimated for younger children (grade 3 students) in classroom but overestimated for adults. A method based on the sentence test is more useful for speech intelligibility evaluation in classrooms than that based on the word test for different age groups. Younger children need more favorable classroom acoustical environment with a higher STI than older children and adults to achieve the optimum speech communication in the classroom.
Go to article

Abstract

In order to understand commands given through voice by an operator, user or any human, a robot needs to focus on a single source, to acquire a clear speech sample and to recognize it. A two-step approach to the deconvolution of speech and sound mixtures in the time-domain is proposed. At first, we apply a deconvolution procedure, constrained in the sense, that the de-mixing matrix has fixed diagonal values without non-zero delay parameters. We derive an adaptive rule for the modification of the de-convolution matrix. Hence, the individual outputs extracted in the first step are eventually still self-convolved. This corruption we try to eliminate by a de-correlation process independently for every individual output channel.
Go to article

This page uses 'cookies'. Learn more