Sunday, March 4, 2012

Vocaloid Technology-Synthesis Engine

The Synthesis Engine receives score information contained in dedicated MIDI messages called Vocaloid MIDI sent by the Score Editor, adjusts pitch and timbre of the selected samples in frequency domain, and splices them to synthesize singing voices. When Vocaloid runs as VSTi accessible from DAW, the bundled VST plug-in bypasses the Score Editor and directly sends these messages to the Synthesis Engine.

Timing adjustment
In singing voices, the consonant onset of a syllable is uttered before the vowel onset is uttered. The starting position of a note called "Note-On" must be the same as that of the vowel onset, not the start of the syllable. Vocaloid keeps the "synthesized score" in memory to adjust sample timing so that the vowel onset should be strictly on the "Note-On" position. No timing adjustment would result in delay.

Pitch conversion
Since the samples are recorded in different pitches, pitch conversion is required when concatenating the samples. The engine calculates a desired pitch from the notes and attack and vibrato parameters, and then selects the necessary samples from the library.

Timbre manipulation
The engine smooths the timbre around the junction of the samples. The timbre of a sustained vowel is generated by interpolating spectral envelopes of the surrounding samples. For example, when concatenating a sequence of diphones "s-e, e, e-t" of the English word "set", the spectral envelope of a sustained ē at each frame is generated by interpolating ē in the end of "s-e" and ē in the beginning of "e-t".

Transforms
After pitch conversion and timbre manipulation, the engine does transforms such as Inverse Fast Fourier transform (IFFT) to output synthesized voices.

No comments:

Post a Comment