Elmar Melcher, Joseana Fechine, Adalberto Teixeira, Jorgeluis Guerra, Karina Medeiros
Center of Electrical Engineering and Informatics CEEI
Campina Grande, PB Brazil
For voice processing it is important to ensure that the signal to be analyzed actually contains relevant information, especially if the system is operating in a real-time. This paper presents an IPcore speech detector for real-time systems, focusing on identification of segments of silence or voice, used in pre-processing of input signals to Speaker Recognition and Verification Systems. The IP-core was designed to be able to be adapted to different environments of use and based on energy of samples to classify them as voice or silence.
In speech processing systems it is always necessary to pre-process the input signal. Part of this pre-processing comprehends to identify the speech itself and the silence segments of the captured voice signal. By silence we mean the segments of signal that has only noise in it, in other words, the voiceless segments. It is usually the initial parts, the segments between phrases and the end parts of the captured signal. Detecting what really is voice and what is not is the purpose of a Speech Detector, also called Endpoint Detector. This paper presents an IP-core named Voice Detector (VD), that is a component block of SPVR project , successfully verified, prototyped in FPGA and already fabricated as part of SPVR ASIC.
The silence segments of the phrase have no important information with respect to the features extraction phase of a speaker verification system, so its detection impacts directly in system performance and accuracy of given hypotheses. If there is no implementation that uses these pieces of silence (for example, to treat or filter the noise present in utterance or in a similar application), they become useless and have to be discarded as early as possible to save processing time in folowing steps and ensure that these segments of silence do not affect decision making. This consideration is even more significant for real-time systems, in which saving computing time is more critical. So it justifies the fact that the first block of a speech processing system is usually a speech detector. Thus, this IP-core can be used to cut out these not used silence segments of the speech signal.
II. ALGORITHM AND IMPLEMENTATION
Basically, the algorithm of this implementation do the detection based on the energy of subframes, that are sets of 55 audio samples - approximately 5 ms of signal, for the used sampling frequency of 11,025Hz.
There are two basic phases in the speech detection: first the IP-core searches for the beginning of the voice; after the beginning is detected, the IP-core changes to end searching and once this is accomplished goes back to search for the beginning, keeping this loop indefinitely.
Figure 01: Voice Detector basic algorithm
Figure 02: Beginning and end points in a speech signal.
The calculus of energy is based on the following equation to determine the energy of a discrete signal :
where x[n] is a voice sample of a subframe and size is the number of voice samples that makes up a subframe (55 voice samples in this case).
In the first phase, the search for beginning of speech, the calculated energy of each subframe is compared with energy thresholds to do an evaluation of how much energy the subframe has. After a certain amount of subframes in sequence, each one with the total energy higher than the threshold, this set of subframes are considered as the beginning of the speech and from this moment on all the signal samples are no more discarded, they are outputted, including the set of subframes detected as the beginning, until the end of the speech is detected.
The search process for the ending is very similar to the process of search for beginning. The difference is that now the comparison is made with another energy threshold, that usually is not the same used in previous phase and the IP-core looks now for a certain amount of subframes with total energy lower than this threshold.
Figure 03: Solution for hysteresis: start and endSubframesAmount.
A set of four thresholds are used as configuration inputs for the VD IP-core: two for energy values (startEnergyThreshold and endEnergyThreshold) and two for the number of consecutives subframes (startSubframesAmount and endSubframesAmount) that must be analyzed for making the decision about beginning and end point of the speech.
Due to hysteresis in voice signals, this IP-core uses the two different thresholds for subframes counting in the evaluation of the beginning and end points. As depicted in figure 03, if the same threshold for start was used for end, since there is a drop in energy value before the speech ending, the end point would be not correctly detected.
As the sample values are normalized between [- 1:1) the maximum interval for the energy of a subframe is [0:55], however the interval choosed for the energy thresholds is [0:4), since the energy of the voiceless segments does not increases more than this, considering a acceptable acquisition of the voice was made and the noise levels are low. This choice is justified by the fact that the VD IPcore uses a fixed-point representation for the samples and the thresholds of energy, and less bits for the integer part means more bits for fractional part of the number, increasing the precision of representation.
The amount of subframes expected before the evaluation about beginning or ending is within [0:25]. If the two values of energy thresholds and/or the two values of subframe thresholds are set to zero, the VD is disabled and all the signal pass through it.
Figure 04: Voice Detector inputs and output.
The VD IP-core uses fixed-point representation for all real numbers. The both input data and output data are represented in fixed-point. The four thresholds use the same input, even they having different representation (the energy ones are fixedpoint while the thresholds for the subframes counting are integers). The two inputs and the output protocols are AMBA AXI 3.0 compliant . Usually in non real-time systems, the speech signal is fully recorded and then the maximum and minimum energy of the signal are calculated. In this implementation, the IP-core searches for speech and silence whilst the voice is still being sampled and digitalized, so we have to choose these thresholds previously, using recorded voice samples and choosing the values that detected better what is speech and what is silence. This has to be done eventually, if the environment changes, but the process of thresholds choice can be easily automated using the reference model for the IPcore. Thus, the real-time processing results in a more environment dependable system, but this possibility to change the thresholds makes it adaptable to different environments.
A. FINITE-STATE MACHINE (FSM)
After the reset, the first state of the FSM is named RECEIVE_SAMPLE. In this state the voice samples are received and always stored in the memory, since the samples are all analysed, either for the beginning or for the end evaluation. Whenever the number of samples stored fill a subframe, the FSM changes to SUBFRAME_ANALYSIS state. If there is no valid sample to be stored and if there is samples to be sent, the state changes to PREPARE_SEND, otherwise, the FSM remains in this state.
Figure 05: VD finite state machine.
In the SUBFRAME_ANALYSIS state, each subframe is analysed as to its energy. If the IP-core is searching for the beginning of speech and the energy of the subframe is lower than the higherEnergyThreshold, the subframe is marked to be discarded, otherwise the IP-core waits for the amount of subframes required by startSubframesAmount. When this amount is reached, the beginning of the speech was found, so it changes to end searching and marks these subframes as valid samples to be sent. If the IPcore is searching for the end of speech and the energy of the subframe is higher than the lowerEnergyThreshold, the subframe is marked to be sent, otherwise the IP-core waits for the amount of subframes required by endSubframesAmount. When this amount is reached, the end of the speech was found, so it changes again to beginning searching and marks these subframes as samples to be discarded. After the analysis of the subframe, the FSM goes back to RECEIVE_SAMPLE state.
The PREPARE_SEND state if only to prepare to send a sample: the sample is transferred from the memory to output and the protocol activated. Then, in the TRY_SEND state, the VD IP-core try to send the sample, and than goes back to RECEIVE_SAMPLE state. If the sending is not made, the same sample will be used in the next attempt.
A SPRAM of 4096x16 bits was used as a circular buffer to store the samples while they are evaluated and are waiting to be sent. The minimal size of the SPRAM is estimated by the equation:
Min memory size = Subframe size * (startSubframesAmount + endSubframesAmount)
The size of the SPRAM used was overestimated, weve choose a size 50% greater than the given by the previous equation, capable of store almost 3 times the quantity of voice samples required for beginning and/or end.
To manage the receiving, the sending and the discarding of the samples stored in memory, four pointers are used: receive_address (points to the current position where the samples are being stored), send_address (points to the next sample to be sent), valid_address (indicates the address of the next valid samples, after a block of discarded samples) and jump_address (address with the next valid samples when purge_pending flag is active).
Figure 06: Memory
III. DEVELOPMENT METHODOLOGY
The VD IP-core was implemented in SystemVerilog RTL using the same development process as SPVR itself and its other blocks: the ipPROCESS . The functional verification was made following the Brazil-IP Verification Methodology (BVM) , that uses the OVM library . For the verification was implemented a transaction-level (TL) object-oriented reference model, also in SystemVerilog, as well as the entire testbench. Thereafter, the back-end was made for 0.35μm fabrication technology.
The functional verification stimuli of the VD IP-core, were sinusoidal, quadratic, sawtooth and noisy stimuli with pseudo-random amplitudes and frequencies, varying during the simulation and real voice files also. The verification process ensures the project compliance with the specification and more: allowed that after each step of back-end phase, the netlist generated from the layout could be re-verified, to ensure that no functional error or timing violation was inserted in design.
We have obtained the following results for the VD IP-core:
- Total logic elements: 426
- Total infered memory bits: 65,536
The figure 07 shows the area occupied by the VD block in the SPVR layout:
Figure 07: SPVR layout
The total area of the core of SPVR layout is 41mm2.
The Speech Detector IP-core presented has proved to be an efficient endpoint detector, whereas the appropriate thresholds are used. Thus, it can used in different environments. Only observing the signal behavior is possible differ significants samples from non-significants, making a selection in real-time, optimizing then all the system processing. It is a good solution to be integrated in voice processing systems, has been validated by reliable methodology applied in the development and being checked by the good results obtained.
 OPPENHEIM, Alan V.; WILLSKY, Alan S.; NAWAB, Hamid; with S. Hamid (1998). Signals and Systems. Pearson Education; 2nd Edition, 1996.
 FECHINE, J. M. ; TEIXEIRA JÚNIOR, A. G. ; MELO, F. G. L. ; ESPINOLA, S. ; PAIXÃO, L. L. . SPVR: An IP-core for Real-Time Speaker Verification. In: IP Based SoC Design Conference & Exhibition, 2010, Grenoble.
 LIMA, M., Aziz, A., Alves, D., Lira, P., Schwambach, V., Barros, E. (2005), ipPROCESS: Using a Process to Teach IP-core Development.IEEE, 0-7695-2374-9/05.
 OLIVEIRA, Helder Fernando de Araujo (2010). BVM : Reformulação da Metodologia de Verificação Funcional VeriSC. Dissertation - Federal University of Campina Grande.
 GLASSER, Mark. Open Verification Methodology Cookbook. Springer; 1st Edition, 2009.
 RABINER, L. R.; SCHAFER, R. W. (1978), Digital Processing of Speech Signals. Prentice Hall, Upper Sadd le River, New Jersey.
 QI LI; Jinsong Zheng; Tsai, A.; Qiru Zhou (2002), Robust endpoint detection and energy normalization for real-time speech and speaker recognition. Multimedia Commun. Res. Lab, Lucent Technol. Bell Labs, Murray Hill, NJ.
 AMBA AXI Protocol v1.0 Specification (ARM IHI 0022).