So we have we have been working on a number of things recently. We have finished our web experiments where we have been looking at the influence of wind noise on the perceptual quality of speech. For this experiment people were asked to listen to samples of recordings with added wind noise and rate the quality, attempt to repeated what was said and rate the difficulty of the task. We varied the wind noise sample in term of level and ‘gustiness’. We are analyzing the data at the moment attempting to understand how level and gustiness relate to sound quality for this particular case.
Wind noise Detector
In addition to these subjective tests we have developed a ‘wind noise detector’. This algorithm listens to an audio stream and detects the presence of ‘wind-noise’. The detector compresses the information within the audio stream by extracting ‘audio features’. Audio Features are efficient representations of sounds. The amount of data required to represent an uncompressed digital audio stream is very large and to build a detector which utilized the raw audio stream is simply not possible. Therefore features must be extracted which can represent the information present in the stream much more efficiently Luckily, by an understanding of how sound is processed by the human auditory system, gives us a way of compressing the information stream, throwing away all the perceptually unimportant parts while keep the salient features. This is the how mp3 and other compression method achieve their high compression ratios. See the later topic for more information on the features extraction.
Teaching a machine to detect wind noise
Audio Features – Mel-Frequency Cepstrum coefficients
The audio features representation called Mel Frequency Cepstrum Coefficients (MFCC) is commonly used in speech recognition to compress the information stream prior to the recognition stage. The MFCC is a spectral representation of a signal over a (usually short eg 20 ms ) time period. A spectral representation means, rather than representing the signal in the time domain i.e. how the pressure fluctuations over time the representation simply shows the levels of the different frequency components with the analysis time period (this time window is often referred to as a window). The ‘Mel’ part refers to the frequencies over which the spectrum is evaluated. A Fourier transform has a linearly spaced frequency components, however this is not how the human auditory system performs The human system is sensitive over a logarithmic scale, in other words the change in frequency for a low pitched sound is much more noticeable compared with the same change but a t a higher pitch. The Mel scale attempts to represent how the human auditory system represents pitch.
Cepstrum – The cepstrum is a representation of a signal where the inverse Fourier transform of the log spectrum is computed. A property of the logarithm is that process that previously were multiplicative become additive, this enables components parts of signals to be separated more easily. For example speech spectra can be thought as a product between the spectra of the speech source and the vocal tract. The vocal tract produces resonances or ‘Formants’, by computing the cepstrum the formats and speech source components can be separated out, where low ‘quefrency‘ components represent the spectral envelop of formants and higher components represent the speech source.
Therefore the Mel-frequency cepstrum is a representation of the spectral envelope of a signal where the frequency scale is warped to be representative of the human auditory system. Typically this reduces the data in a 20 ms wind sampled at 44.1 kHz from 1102 samples to 12 MFCCs. This is a very efficient representation and much of the salient information is preserved.