'Project/음성평가시스템'에 해당되는 글 14건

Pratt 기능을 Matlab으로 구현 (0)	2013.09.12
LPC를 통한 포만트 분석 (0)	2013.09.10
특징추출4. Mel_filterbank (0)	2013.08.21
현재 진행상황 (0)	2013.08.19
특징추출3. Mef filter bank 영문설명 (0)	2013.08.14

피치분석 (0)	2013.09.12
LPC를 통한 포만트 분석 (0)	2013.09.10
특징추출4. Mel_filterbank (0)	2013.08.21
현재 진행상황 (0)	2013.08.19
특징추출3. Mef filter bank 영문설명 (0)	2013.08.14

피치분석 (0)	2013.09.12
Pratt 기능을 Matlab으로 구현 (0)	2013.09.12
특징추출4. Mel_filterbank (0)	2013.08.21
현재 진행상황 (0)	2013.08.19
특징추출3. Mef filter bank 영문설명 (0)	2013.08.14

Pratt 기능을 Matlab으로 구현 (0)	2013.09.12
LPC를 통한 포만트 분석 (0)	2013.09.10
현재 진행상황 (0)	2013.08.19
특징추출3. Mef filter bank 영문설명 (0)	2013.08.14
특징추출2. 영어음성 STFT하기 (1)	2013.08.13

LPC를 통한 포만트 분석 (0)	2013.09.10
특징추출4. Mel_filterbank (0)	2013.08.21
특징추출3. Mef filter bank 영문설명 (0)	2013.08.14
특징추출2. 영어음성 STFT하기 (1)	2013.08.13
특징추출1. Preemphasis (0)	2013.08.12

Mel Frequency Cepstral Coefficient (MFCC) tutorial

The first step in any automatic speech recognition system is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise, emotion etc.

The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope. This page will provide a short tutorial on MFCCs.

Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980's, and have been state-of-the-art ever since. Prior to the introduction of MFCCs, Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) and were the main feature type for automatic speech recognition (ASR). This page will go over the main aspects of MFCCs, why they make a good feature for ASR, and how to implement them.

Steps at a Glance

We will give a high level intro to the implementation steps, then go in depth why we do the things we do. Towards the end we will go into a more detailed description of how to calculate MFCCs.

Frame the signal into short frames.
For each frame calculate the periodogram estimate of the power spectrum.
Apply the mel filterbank to the power spectra, sum the energy in each filter.
Take the logarithm of all filterbank energies.
Take the DCT of the log filterbank energies.
Keep DCT coefficients 2-13, discard the rest.

There are a few more things commonly done, sometimes the frame energy is appended to each feature vector. Delta and Delta-Delta features are usually also appended. Liftering is also commonly applied to the final features.

Why do we do these things?

We will now go a little more slowly through the steps and explain why each of the steps is necessary.

An audio signal is constantly changing, so to simplify things we assume that on short time scales the audio signal doesn't change much (when we say it doesn't change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame.

The next step is to calculate the power spectrum of each frame. This is motivated by the human cochlea (an organ in the ear) which vibrates at different spots depending on the frequency of the incoming sounds. Depending on the location in the cochlea that vibrates (which wobbles small hairs), different nerves fire informing the brain that certain frequencies are present. Our periodogram estimate performs a similar job for us, identifying which frequencies are present in the frame.

The periodogram spectral estimate still contains a lot of information not required for Automatic Speech Recognition (ASR). In particular the cochlea can not discern the difference between two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase. For this reason we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions. This is performed by our Mel filterbank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations. We are only interested in roughly how much energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and how wide to make them. Seebelow for how to calculate the spacing.

Once we have the filterbank energies, we take the logarithm of them. This is also motivated by human hearing: we don't hear loudness on a linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes our features match more closely what humans actually hear. Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean subtraction, which is a channel normalisation technique.

The final step is to compute the DCT of the log filterbank energies. There are 2 main reasons this is performed. Because our filterbanks are all overlapping, the filterbank energies are quite correlated with each other. The DCT decorrelates the energies which means diagonal covariance matrices can be used to model the features in e.g. a HMM classifier. But notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.

What is the Mel scale?

The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear.

The formula for converting from frequency to Mel scale is:

$\text{[math]}$

To go from Mels back to frequency:

$\text{[math]}$

Implementation steps

We start with a speech signal, we'll assume sampled at 16kHz.

1. Frame the signal into 20-40 ms frames. 25ms is standard. This means the frame length for a 16kHz signal is 0.025*16000 = 400 samples. Frame step is usually something like 10ms (160 samples), which allows some overlap to the frames. The first 400 sample frame starts at sample 0, the next 400 sample frame starts at sample 160 etc. until the end of the speech file is reached. If the speech file does not divide into an even number of frames, pad it with zeros so that it does.

The next steps are applied to every single frame, one set of 12 MFCC coefficients is extracted for each frame. A short aside on notation: we call our time domain signal $\text{[math]}$ . Once it is framed we have $\text{[math]}$ where n ranges over 1-400 (if our frames are 400 samples) and $\text{[math]}$ ranges over the number of frames. When we calculate the complex DFT, we get $\text{[math]}$ - where the $\text{[math]}$ denotes the frame number corresponding to the time-domain frame. $\text{[math]}$ is then the power spectrum of frame $\text{[math]}$ .

2. To take the Discrete Fourier Transform of the frame, perform the following:

$\text{[math]}$

where $\text{[math]}$ is an $\text{[math]}$ sample long analysis window (e.g. hamming window), and $\text{[math]}$ is the length of the DFT. The periodogram-based power spectral estimate for the speech frame $\text{[math]}$ is given by:

$\text{[math]}$

This is called the Periodogram estimate of the power spectrum. We take the absolute value of the complex fourier transform, and square the result. We would generally perform a 512 point FFT and keep only the first 257 coefficents.

3. Compute the Mel-spaced filterbank. This is a set of 20-40 (26 is standard) triangular filters that we apply to the periodogram power spectral estimate from step 2. Our filterbank comes in the form of 26 vectors of length 257 (assuming the FFT settings fom step 2). Each vector is mostly zeros, but is non-zero for a certain section of the spectrum. To calculate filterbank energies we multiply each filterbank with the power spectrum, then add up the coefficents. Once this is performed we are left with 26 numbers that give us an indication of how much energy was in each filterbank. For a detailed explanation of how to calculate the filterbanks see below. Here is a plot to hopefully clear things up:

Plot of Mel Filterbank and windowed power spectrum

4. Take the log of each of the 26 energies from step 3. This leaves us with 26 log filterbank energies.

5. Take the Discrete Cosine Transform (DCT) of the 26 log filterbank energies to give 26 cepstral coefficents. For ASR, only the lower 12-13 of the 26 coefficients are kept.

The resulting features (12 numbers for each frame) are called Mel Frequency Cepstral Coefficients.

Computing the Mel filterbank

To get the filterbanks shown in figure 1(a) we first have to choose a lower and upper frequency. Good values are 300Hz for the lower and 8000Hz for the upper frequency. Of course if the speech is sampled at 8000Hz our upper frequency is limited to 4000Hz. Then follow these steps:

Using equation 1, convert the upper and lower frequencies to Mels. In our case 300Hz is 401.25 Mels and 8000Hz is 2834.99 Mels.
For this example we will do 10 filterbanks, for which we need 12 points. This means we need 10 additional points spaced linearly between 401.25 and 2834.99. This comes out to:
```
m(i) = 401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74, 
       1949.99, 2171.24, 2392.49, 2613.74, 2834.99
```
Now use equation 2 to convert these back to Hertz:
```
f(i) = 300, 517.33, 781.90, 1103.97, 1496.04, 1973.32, 2554.33, 
       3261.62, 4122.63, 5170.76, 6446.70, 8000
```
Notice that our start- and end-points are at the frequencies we wanted.
Now we create our filterbanks. The first filterbank will start at the first point, reach its peak at the second point, then return to zero at the 3rd point. The second filterbank will start at the 2nd point, reach its max at the 3rd, then be zero at the 4th etc. A formula for calculating these is as follows:
$\text{[math]}$
where $\text{[math]}$ is the number of filters we want, and $\text{[math]}$ is the list of M+2 Mel-spaced frequencies.

The final plot of all 10 filters overlayed on each other is:

Plot of 10 filter Mel Filterbank — A Mel-filterbank containing 10 filters. This filterbank starts at 0Hz and ends at 8000Hz.

Deltas and Delta-Deltas

Also known as differential and acceleration coefficients. The MFCC feature vector describes only the power spectral envelope of a single frame, but it seems like speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit (if we have 12 MFCC coefficients, we would also get 12 delta coefficients, which would combine to give a feature vector of length 24).

To calculate the delta coefficients, the following formula is used:

$\text{[math]}$

where $\text{[math]}$ is a delta coefficient, from frame $\text{[math]}$ computed in terms of the static coefficients $\text{[math]}$ to $\text{[math]}$ . A typical value for $\text{[math]}$ is 2. Delta-Delta (Acceleration) coefficients are calculated in the same way, but they are calculated from the deltas, not the static coefficients.

References

Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357-366

X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.

저작자표시 (새창열림)

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출4. Mel_filterbank (0)	2013.08.21
현재 진행상황 (0)	2013.08.19
특징추출2. 영어음성 STFT하기 (1)	2013.08.13
특징추출1. Preemphasis (0)	2013.08.12
과제진행순서 (0)	2013.08.08

Posted by 십자성군

특징추출2. 영어음성 STFT하기

Matlab코딩한것을 해석하면서 진행합니다.

해밍윈도우 등은 Matlab함수로 구현 가능하지만 Java에서의 구현을 위하여 일일이 구현하였습니다.

사용한 음성은 umbrella.wav파일입니다.

%test STFT

clear

[Y,FS,NBITS, Readinfo]=wavread('umbrella.wav');

%umbrella.wav라는 파일을 볼러왔습니다.

%Y : 신호정보

%FS : Sampling Frequency 22050Hz

%NBITS: BPS(Bit Per Sample), 16bit

%Readinfo : 파일헤더 구조체. 각종 정보 포함

sound(Y,FS); %사운드 출력

len=length(Y); %음성신호 길이

time=len/FS; %파일의 재생시간: 초당 22050의 신호를 캡처하기 때문에 신호 Y의 길이가 20160이라면 재생시간은 0.9몇초가 된다.

figure;

x=linspace(1,time,len); %그래프의 시간축 설정. 0~0.9몇초를 신호의 길이만큼 분할

plot(x,Y); %신호 그래프 출력

%start stft

y=Y(1:len,1); %음성신호를 y로 옮겨놓았음

%% windowing

R =2^9; % R: window len. Windowing할 때의 창의 길이

W =0.54-0.46*cos(2*pi*(1:R)/R); W=W'; % w :해밍윈도우. 공식 적용한 값

N = 2^(nextpow2(length(W)));%2^9; % N: FFT resolution.주파수 분해능,주파수, 그래프의 날카로움

L = ceil(R*0.1); % L: number of non-overlap samples

overlap = R-L; % Overlap(겹침처리)이 클수록 분해능이 좋음(속이 비거나 차는것) 현재 4

%%Type1

c=1;

h=(1+(len-R))/(1+R/2); %must be integer. 낮을수록 시간축 고분해능. R의 크기에 따라 자동으로 정수값으로 잡도록 하였음

h=fix(h*3/3); %FS/h= overlap

d = zeros((1+R/2),1+fix((len-R)/h));

for b = 0:h:(len-R) %(b+1:b+R)의 길이=윈도우의 길이

u = W.*y((b+1):(b+R));

t = fft(u); %해밍윈도우에 신호값을 곱해준 후 fft를 통하여 주파수도메인으로 분석

d(:,c) = t(1:(1+R/2))'; %각 윈도우 영역마다의 값을 대입

c = c+1;

end;

time=linspace(0,timeleng,size(d,2)); %시간벡터. 재생시간을 그래프 배열크기에 맞추어 분할해줍니다.

freq=linspace(0,FS/2,size(d,1)); %주파수벡터. 위와 같은맥락

[X,Y]=meshgrid(time,freq); %3차원 그래프를 출력하기 위하여 2차원 평면 X와 Y 2개를 만듭니다.

figure;

mesh(X,Y,abs(d)); %spectrogram 출력

STFT Type1

%%Type2

%Matlab내 함수 이용

[S F T] = spectrogram(y,W,overlap,N,FS);

%S : signal

%F : 관측할 수 있는 주파수배열

%T : 관측할 수 있는 시간배열. Overlap에 의한 분해능에 의존

%R에도 의존(Window 크기)

%R>Overlap

Z=abs(S);

[mX,mY]=meshgrid(T,F); %2차원 평면 mX, mY 생성

figure;

mesh(mX,mY,Z);

figure;

contour(mX,mY,Z); %등고선 그래프 출력

STFT_Type2

Contour 그래프

현재 Type1과 Type2의 Spectrogram의 정확도? 가 다르다는걸 눈으로 보아도 알 수 있습니다... 물론 보는 각도 때문에 그렇게 보일수도 있습니다만.

Type1의 경우 overlap을 계산하면 약290이 나옵니다.

반면 Type2의 경우 overlap을 490으로 설정해놓았습니다.

Type1의 h에 대한 식 h=fix(h*3/3); 에서 다음과 같이 바꾸어 보겠습니다.

h=2^(nextpow2(h)); 이때 오버랩은 344정도 나옵니다. 그리고 그래프는

왼쪽이 Type1이고 오른쪽이 아까와 같은 오버랩의 Type2입니다. 거의 똑같죠?

굳이 nextpow쓸 필요는 없구요. h값을 낮추어주면 오버랩이 오릅니다. 오버랩이 클수록 그래프가 좀더 빽빽해지고 속이 덜 비겠죠?

Matlab함수에서 인수로 주는 오버랩 값 등의 제한에 대해서는 직접 함수를 사용해서 확인해보기 바랍니다.

위의 N은 nfft라고 FFT resolution입니다. 이 값이 낮으면 그래프가 삐죽빼죽 날카록게 되요. 덜 연속적이라서 그런거죠.

R값은 window길이인데, 매트랩 함수에서는 이 값도 잘 생각해줘서 넣어줘야 합니다. 높을수록 많은 정보를 담을 수 있겠지만 지나치면 연상량이 너무 많아지겠죠?

STF

2차원의 파워스펙트로그램입니다. 이걸 직접 구현할 수 있으면 좋겠는데 아직 거기까지는 잘 안되네요.

이부분은 추후에 구현하기로 하고 오늘 STFT는 여기서 끝!

다음편은 Mel-Filter bank 적용입니다.

저작자표시 (새창열림)

'Project > 음성평가시스템' 카테고리의 다른 글

현재 진행상황 (0)	2013.08.19
특징추출3. Mef filter bank 영문설명 (0)	2013.08.14
특징추출1. Preemphasis (0)	2013.08.12
과제진행순서 (0)	2013.08.08
Flow_chart (0)	2013.08.07

Posted by 십자성군

특징추출1. Preemphasis

영어발음 Apple에 Preemphasis를 해봅시다.

문제점 : 방음공간에서 된 샘플데이터라서 Preemphasis의 중요도가 낮아집니다. 잔기침 등등이 섞여있다면 좋겠습니다만..

추. Preemphasis 및 잡음제거는 나중으로 돌립니다.

우선은 특징추출부터 먼저 하고(음원이 깨끗하기 때문에) 유사도평가 후 잡음제거 및 Preemphasis를 하도록 하겠습니다

저작자표시 (새창열림)

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출3. Mef filter bank 영문설명 (0)	2013.08.14
특징추출2. 영어음성 STFT하기 (1)	2013.08.13
과제진행순서 (0)	2013.08.08
Flow_chart (0)	2013.08.07
괜찮은 음성분석 소프트웨어 [Praat] (0)	2013.08.07

Posted by 십자성군

과제진행순서

Matlab을 이용한 구현 및 신뢰도 확부후 Java로 재구현

저작자표시 (새창열림)

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출2. 영어음성 STFT하기 (1)	2013.08.13
특징추출1. Preemphasis (0)	2013.08.12
Flow_chart (0)	2013.08.07
괜찮은 음성분석 소프트웨어 [Praat] (0)	2013.08.07
개발환경준비 (0)	2013.08.06

Posted by 십자성군

Flow_chart

설계부분. 미완성

저작자표시 (새창열림)

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출1. Preemphasis (0)	2013.08.12
과제진행순서 (0)	2013.08.08
괜찮은 음성분석 소프트웨어 [Praat] (0)	2013.08.07
개발환경준비 (0)	2013.08.06
프로젝트 개요 (1)	2013.08.06

Posted by 십자성군

십자성군의 비밀의 방

'Project/음성평가시스템'에 해당되는 글 14건

피치분석

'Project > 음성평가시스템' 카테고리의 다른 글

Pratt 기능을 Matlab으로 구현

'Project > 음성평가시스템' 카테고리의 다른 글

LPC를 통한 포만트 분석

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출4. Mel_filterbank

'Project > 음성평가시스템' 카테고리의 다른 글

현재 진행상황

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출3. Mef filter bank 영문설명

Mel Frequency Cepstral Coefficient (MFCC) tutorial

Steps at a Glance

Why do we do these things?

What is the Mel scale?

Implementation steps

Computing the Mel filterbank

Deltas and Delta-Deltas

References

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출2. 영어음성 STFT하기

'Project > 음성평가시스템' 카테고리의 다른 글

특징추출1. Preemphasis

'Project > 음성평가시스템' 카테고리의 다른 글

과제진행순서

'Project > 음성평가시스템' 카테고리의 다른 글

Flow_chart

'Project > 음성평가시스템' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

글 보관함

달력

링크

티스토리툴바