Roughly two types of systems for voice synthesis have been proposed. One is based on the time domain pitch synchronous overlap-add (TD-PSOLA [2]), which synthesizes a voice using the short time waveform directly extracted from the input signal. The other is based on a vocoder [3], which analyzes a voice in terms of its pitch (fundamental frequency; F0) and timbre (spectral envelope) and synthesizes it with the estimated parameters.
TD-PSOLA直接使用从音频库中提取的波形图合成语音;vocoder分析语音的音调(基频)和音色(频谱包络)并结合一些估计得到的参数合成语音
TD-PSOLA and vocoders have trade offs. TD-PSOLA synthesizes voice with better quality than vocoders; however, vocoders can manipulate pitch and voice timbre independently.
TD-PSOLA合成效果好,但是vocoder可以控制音调和音色
The STRAIGHT [7] and TANDEM- STRAIGHT [8] have been proposed to solve this problem. They use the pitch synchronous analysis [9] to improve the estimation performance of the spectral envelope
pitch synchronous analysis是什么?待续
Furthermore, aperiodicity is used as the parameter to represent not only the periodic signal but also the aperiodic signal
AP可以表示周期信号以及非周期信号
In STRAIGHT and TANDEM-STRAIGHT, the aperiodicity is defined as the spectrum to synthesize both periodic and aperiodic signals.The periodic and the aperiodic spectra are calculated using the spectral envelope and aperiodicity, and the periodic and aperiodic signals are individually calculated
AP是可以生成周期信号和非周期信号的频谱图。周期频谱和非周期频谱是用频谱包络和AP计算得到,而且周期信号和非周期信号是独立计算的。
This approach cannot represent the phase of the input voice because the periodic signal is calculated as the minimum phase response, and the vocal tract response hðtÞ generally includes not only minimum phase response but also maximum phase response. To accurately synthesize a voice, it is essential to extract the phase of the input voice. We used a waveform-based parameter as a new parameter instead of aperiodicity.
待续
PLATINUM extracts the waveform- based parameter to reconstruct the input voice.
** PLATINUM提取波形图的参数,重建输入语音。**
The proposed system equals vocoder-based systems except that it uses the excitation signal instead of aperiodicity, which therefore suggests that it is possible for the proposed system to independently manipulate the F0 and spectral envelope like vocoder-based systems.
该系统等价于vocoder-based systems,但是他使用激励信号替换了AP,这表明这个系统可以独立的控制音调(基频)和音色(频谱包络)
The observed spectrum Yð!Þ is defined as the product of the spectral envelope Hð!Þ and target spectrum for reconstructing the waveform. The target spectrum Xð!Þ is given by,Since the phase of Hð!Þ for vocoder-based systems is generally the minimum phase, the maximum phase of the input voice is included in Xð!Þ. The power of Xð!Þ is nearly flat, provided that the spectral envelope is accurately estimated. If Hð!Þ does not include any zeros, the inverse spectrum can be calculated reliably.
观察到的频谱Y是由频谱包络和用于重建波形图的目标频谱的产物。目标频谱X是有如下公式获得的。既然频谱包络中的相位是最小相位,那么输入信号的最大相位在X中,X的能量几乎平稳的,这表明频谱包络的估算值准确的。如果H中不包含0,那么H的倒数可以计算获得。
To estimate Xð!Þ, determining the temporal positions for
windowing is an important problem. PLATINUM uses the F0 contour and waveform. First, the voiced section is estimated based on the F0 contour, and the temporal position with maximum value of yðtÞ2 is then extracted as the basic temporal position. The other positions are automatically calculated based on the basic position and F0 contour.
在估计X的时候,测定窗口的位置是关键。PLATINUM使用基频等高线和波形图。首先,语音部分由基频登高线估计得到,然后获得具有最大值yt2的时间位置,并以此作为基础时间位置,剩下的位置自动的通过基础时间位置和基频等高线计算得到。
总结
f0基频代表音调的高低,女生偏高,男生偏低。sp代表音色,吉他和钢琴的音色就不一样,ap代表说话的内容,比如”你好吗“,ap可能涉及到拼音中的1234声。用提取激励信号的方式代替ap,能取得更好的结果。