Skip to the content.

Abstract

Singing voice synthesis has achieved remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as mixed voice, falsetto, and breathy tones, limiting the expressive potential of synthetic voices. To address this, we introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger employs a flow-matching-based model to accurately reproduce these techniques. To enhance the diversity of training data, we develop a technique detection model that annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control.

overall

Technique-Controllable Singing Voice Synthesis(SVS)

To assess the performance of TechSinger and baseline models in the technique-controllable singing voice task, we randomly select samples with unseen singers from the test set as targets via different controlling strategies. And in order to represent singing techniques more simply, we use the technique id shown below.

Technique ID: 0-No Technique; 1-Mixed Voice; 2-Falsetto; 3-Breathy; 4-Pharyngeal; 5-Vibrato; 6-Glissando; 7-Bubble; 8-Strong; 9-Weak.

GT

GT represents the technique controlling strategies that the techniques are obtained from the annotated technique sequences.

Word: 抓 不 住 爱 情 的 我 <AP> 总 是 眼 睁 睁 看 它 溜 走

Phoneme with Technique: zh(1) ua(1) b(1) u(1) zh(1) u(1) ai(1) q(1) ing(1) d(1) e(1) uo(1) <AP> z(1) ong(1) sh(1) i(1) ian(2) zh(2) eng(2) zh(2) eng(2) k(1) an(1) t(1) a(1) l(1) iou(1,6) z(1) ou(1)

Ground Truth
DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: 想 你 时 你 在 脑 海

Phoneme with Technique: x(0) iang(0) n(0) i(0) sh(0) i(0) n(0) i(6) z(0) ai(0) n(0) ao(6) h(0) ai(0)

Ground Truth
DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: 一 次 就 好 <AP> 我 带 你 去 看 天 荒 地 老

Phoneme with Technique: i(1,6) c(1) i(1) j(1) iou(1) h(1) ao(1) <AP> uo(1) d(1) ai(1) n(1) i(1) q(1) u(1) k(1) an(1) t(2,6) ian(2) h(2,6) uang(2) d(2) i(2) l(2,6) ao(2,6)

Ground Truth
DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: i would never leave when she needs <AP> me most <SP>

Phoneme with Technique: AY1(1) W(1) UH1(1) D(1) N(1) EH1(1) V(1) ER0(1,6) L(1) IY1(1) V(1) W(1) EH1(1) N(1) SH(1) IY1(1,6) N(1) IY1(1) D(1) Z(1) <AP>(0) M(1) IY1(1) M(1) OW1(1) S(1) T(1) <SP>(0)

Ground Truth
DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: <SP> au fond du temple saint <AP>

Phoneme with Technique: <SP>(0) o(0) f(0) ɔ̃(5) d(0) y(0) t(0) ɑ̃(5) p(5) l(5) s(0) ɛ̃(5) <AP>(0)

Ground Truth
DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: <AP> esta noche es para amar <AP>

Phoneme with Technique: <AP>(0) e(1) s(1) t̪(1) a(1) n(1) o(1) tʃ(1) e(1) e(1) s(1) p(1) a(1) ɾ(1) a(1) a(1) m(1) a(1) ɾ(1) <AP>(0)

Ground Truth
DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Random

Random represents the technique controlling strategies that the model generates the techniques automatically and randomly.

Word: 我 知 道 <AP> 那 些 夏 天 <AP> 就 像 青 春 一 样 回 不 来

Phoneme Sequence: uo zh i d ao <AP> n a x ie x ia t ian <AP> j iou x iang q ing ch un i iang h ui b u l ai

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: 离 别 没 说 再 见 <AP> 你 是 否 心 酸

Phoneme Sequence: l i b ie m ei sh uo z ai j ian <AP> n i sh i f ou x in s uan

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: <SP> edelweiß <AP> edelweiß <AP>

Phoneme Sequence: <SP> eː d ɛ l v a ɪ s <AP> eː d ɛ l v a ɪ s <AP>

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Prompt-Guided

Prompt-Guided represents the technique controlling strategies that the techniques are predicted by our predictor based on the given prompts.

Word: 就 在 那 里 曾 是 你 和 我 <AP> 爱 过 的 地 方

Prompt: Generate a Chinese song where a female singer sings in medium vocal range using breathy technique.

Predicted Technique Sequence: j(3) iou(3) z(3) ai(3) n(3) a(3) l(3) i(3) c(3) eng(3) sh(3) i(3) n(3) i(3) h(3) e(3) uo(3) <AP>(0) ai(3) g(3) uo(3) d(3) e(3) d(3) i(3) f(3) ang(3)

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: 一 壶 清 酒 一 身 尘 灰

Prompt: Generate a Chinese pop song where a Tenor sings using mixed voice.

Predicted Technique Sequence: i(1) h(1) u(1) q(1) ing(1) j(1) iou(1) i(1) sh(1) en(1) ch(1) en(1) h(1) uei(1,5)

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: 一 个 多 情 的 痴 情 的 绝 情 的 无 情 的 人 来 给 我 伤 痕

Prompt: Create a pop song where a Chinese female singer sings using mixed voice and strong vocal.

Predicted Technique Sequence: i(1,8) g(1,8) e(1,8) d(1,8) uo(1,8) q(1,8) ing(1,8) d(1,8) e(1,8) ch(1,8) i(1,8) q(1,8) ing(1,8) d(1,8) e(1,8) j(1,8) ve(1,8) q(1,8) ing(1,8) d(1,8) e(1,8) r(1,8) en(1,8) l(1) ai(1) g(1) ei(1) uo(1) sh(1) ang(1) h(1) en(1)

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Word: i will be brave <AP> i will not let anything take

Prompt: Create a English song performed by a alto using vibrato.

Predicted Technique Sequence AY1(0) W(0) IH1(0) L(0) B(0) IY1(0) B(0) R(0) EY1(5) V(0) <AP> AY1(5) W(0) IH1(5) L(0) N(0) AA1(0) T(0) L(0) EH1(0) EH1(0) N(0) IY0(0) TH(0) IH2(0) NG(0) T(0) EY1(5) K(0)

DiffSinger Visinger2 StyleSinger TechSinger(Ours)

Ablation Study

we undertake ablation studies in Singing Voice Synthesis(SVS) to showcase the efficacy of various designs incorporated within TechSinger. Pitch denotes using the flow-matching pitch predictor or only diffusion decoder, Postnet means using the postnet or not, and CFG means take the classifier-free guidance strategy or not.

Word: 你 会 不 会 忽 然 的 出 现

Phoneme with Technique: n(3) i(3) h(3) uei(3) b(3) u(3) h(3) uei(3) h(3) u(3) r(3) an(3) d(3) e(3) ch(3) u(3) x(3) ian(3,5)

Gronud Truth TechSinger w/o Pitch w/o Postnet w/o CFG

Word: 为 了 爱 孤 军 奋 斗 <AP> 早 就 吃 够 了 爱 情 的 苦

Phoneme with Technique: uei(1) l(1) e(1) ai(1) g(1) u(1) j(1) vn(1) f(1) en(1) d(1) ou(1) <AP> z(2) ao(2) j(2) iou(2) ch(2) i(2) g(2) ou(2) l(2) e(2) ai(2) q(2) ing(2) d(2) e(2) k(2) u(2)

Gronud Truth TechSinger w/o Pitch w/o Postnet w/o CFG