• Rezultati Niso Bili Najdeni

V tabeli 5.3 so prikazani POM rezultati najboljˇsih treh sistemov TTS.

Sistem TTS POM

FT2 + HG2 3.8 ± 0.28

FT3 + HG2 4.07 ± 0.32

FT6 + HG2 3.1 ± 0.34

Originalni zvoˇcni posnetki 4.56 ± 0.19

Tabela 5.3: POM ocene testiranih sistemov TTS s 95% intervalom zaupanja.

Najboljˇso POM oceno 4.07 dobimo z modelom FT3, za njim pa zaostaja model FT2. Za primerjavo: Tacotron 2 doseˇze oceno 4.52, FastSpeech oceno 3.83, DeepVoice pa oceno 3.78. Vsi trije modeli so bili nauˇceni z angleˇsko podatkovno mnoˇzico LJSpeech [23].

30 Tom ˇSabanov Model FT3 generira najbolj naraven govor izmed vseh modelov. Model dobro ohrani govorne lastnosti originalnega govorca, vendar se pri sintezi daljˇsih stavkov zaˇcuti, da je govor umeten.

Osnovni model ForwardTacotron FT1 ne generira dobre kakovosti govora, vendar je dovolj dober za prilagajanje drugih, ˇse nevidenih govorcev. Model FT2 je generiral skoraj tako dobre rezultate kot FT3, kljub temu, da je za uˇcenje potreboval le polovico korakov modela FT3. Veˇcja pomankljivost modela FT2 je, da je obdrˇzal hitrost govora osnovnega modela FT1, kar je pripomoglo k slabˇsim rezultatom pri sintezi daljˇsih stavkov.

Najslabˇse rezultate dosegajo modeli, ki so bili prilagojeni z manj podatki.

Izmed teh dosega najboljˇse rezultate FT6, ki je nauˇcen na dveh urah govora.

Modeli FT4, FT5 in FT6 dobro posnemajo zvok govorca, vendar izpuˇsˇcajo dele besed. Najmanj se to opazi pri modelu FT6. To bi lahko bila posledica uporabe arhitekture ForwardTacotron, ki ne uporablja sloja pozornosti, ki skrbi, da ne pride do izpada besed iz stavkov. Manj podatkov za uˇcenje modelov pomeni tudi, da obstaja veˇcja moˇznost, da model ne dobi dovolj kombinacij fonemov za pravilno uˇcenje prozodije. Boljˇso kakovost so dosegali modeli z veˇc podatki.

Pri modelih FT4 in FT6 hitro opazimo, da se kakovost sintetiziranega glasu ne izboljˇsa, ˇce imamo dodatnih 30 minut govora. Oba modela dobro posnameta zvok govorca, vendar vsebuje generiran govor preveˇc artefaktov za robusten sistem TTS.

Pri primerjavi modelov Hifi-GAN je prilagojen model HG2 dosegal naj-boljˇse rezultate, najslabˇse pa osnovni model HG1, ki je bil nauˇcen z le 160k koraki. ˇCe ˇzelimo dosegati dobre rezultate pri univerzalnih vokoderjih, jih je potrebno uˇciti vsaj 500k korakov, kar lahko vzame tudi veˇc tednov. Osnovni model Hifi-GAN generira govor z veliko artefaktov pri sintezi ˇse nevidenega govorca. V primeru ˇze videnih govorcev najbolje generira glasove, ki so imeli veˇc podatkov v uˇcni mnoˇzici. Algoritem Griffin-Lim generira razumljiv go-vor, vendar vseeno zveni umetno. Univerzalni Hifi-GAN model je zmoˇzen sintetizirati veliko razliˇcnih glasov iz razliˇcnih jezikov. Generiran govor je

Diplomska naloga 31 razumljiv, vendar ne ohranja popolnoma glasu originalnega govorca, sicer pa je dobra osnova za prilagajanje modela z enim govorcem.

Vsi modeli ForwardTacotron ne prepoznajo konteksta besedila in zato ne znajo pravilno sklanjati ter razˇsirjati okrajˇsav. Prav tako imajo teˇzave z na-glasitvijo nekaterih besed. Reˇsitev bi bila implementacija fonetiˇcne abecede v uˇcenje modelov.

32 Tom ˇSabanov

Poglavje 6 Zakljuˇ cek

V diplomskem delu smo ustvarili sintetizator slovenskega govora. Zgradili smo ˇsest podatkovnih mnoˇzic ter uporabili arhitekturi ForwardTacotron in Hifi-GAN za izgradnjo petih razliˇcnih TTS sistemov. Preverili smo, koliko podatkov potrebujemo za prilagajanje novega modela sinteze govora na pod-lagi ˇze zgrajenega osnovnega ForwardTacotron modela. Ugotovili smo, da za dobro kakovost sinteze potrebujemo vsaj 60 minut govora, vendar za posne-manje glasu zadostuje ˇze samo 30 minut govora novega govorca. Primerjali smo tri najboljˇse TTS sisteme in originalne zvoˇcne posnetke s pomoˇcjo ocene POM.

Zgradili smo tudi dva Hifi-GAN modela in ugotovili, da za uˇcenje univer-zalnega vokoderja potrebujemo veliko veˇc ˇcasa kot za uˇcenje univerzalnega modela Tacotron. Prilagodili smo obstojeˇci univerzalni model za naˇsega go-vorca, s katerim smo dobili zelo dobro kakovost generiranega glasu.

Sploˇsni model ForwardTacotron FT1 je primeren za nadaljne izboljˇsave.

Model bi lahko nadgradili z veˇcjo koliˇcino moˇskih glasov, za univerzalni model pa bi bilo potrebno prilagoditi model FT1 z veliko koliˇcino zvoˇcnih posnetkov ˇzenskega in moˇskega glasu. Za pravilno naglaˇsevanje bi morali uˇciti model ForwardTacotron s fonetiˇcno abecedo. Vhod besedila v ForwardTacotron bi bilo treba obdelati in pravilno oznaˇciti besede za dosego boljˇsega naravnega govora.

33

34 Tom ˇSabanov Uˇcenje lastnega univerzalnega vokoderja brez metode mehkega zaˇcetka zahteva ogromno koliˇcino podatkov in ˇcasa. Za ustvaritev univerzalnega vokoderja je zato bolje vzeti obstojeˇc univerzalni model in ga prilagoditi na veˇcjo mnoˇzico moˇskih in ˇzenskih glasov.

Literatura

[1] Forwardtacotron. Dosegljivo: https://github.com/as-ideas/

ForwardTacotron/. [Dostopano 30. 7. 2021].

[2] Hifi-GAN. Dosegljivo: https://github.com/jik876/hifi-gan. [Do-stopano 30. 7. 2021].

[3] Brazen head. Dosegljivo: https://en.wikipedia.org/wiki/Brazen_

head, 2015. [Dostopano 30. 7. 2021].

[4] Abien Fred Agarap. Deep Learning using Rectified Linear Units (ReLU).

Dosegljivo: https://arxiv.org/pdf/1803.08375.pdf, 2019.

[5] J. Allen. Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(3):235–238, 1977.

[6] Alastalo Antti. Finnish end-to-end speech synthesis with Tacotron 2 and WaveNet. Magistrska naloga, Aalto University, School of Science, 2021.

[7] Sercan ¨O. Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. Deep voice: Real-time neural text-to-speech. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Ma-chine Learning, volume 70 ofProceedings of Machine Learning Research, pages 195–204, 2017.

35

36 Tom ˇSabanov [8] International Phonetic Association. The international phonetic alpha-bet. Dosegljivo: https://www.internationalphoneticassociation.

org/sites/default/files/IPA_Kiel_2015.pdf, 2015. [Dostopano 30.

7. 2021].

[9] Mariette Awad and Rahul Khanna. Hidden Markov Model, pages 81–

104. Apress, 2015.

[10] Johan Bjorck, Carla Gomes, Bart Selman, and Kilian Q. Weinberger.

Understanding Batch Normalization. Dosegljivo: https://arxiv.org/

pdf/1806.02375.pdf, 2018.

[11] Sam Thornton Boyang Zhang, Jared Leitner. Audio Reco-gnition using Mel Spectrograms and Convolution Neural Ne-tworks. Dosegljivo: http://noiselab.ucsd.edu/ECE228_2019/

Reports/Report38.pdf, 2019. [Dostopano 30. 7. 2021].

[12] Nikola Paveˇsi´c Boˇstjan Vesnicer, France Miheliˇc. Vrednotenje na pri-kritih markovovih modelih temeljeˇcega sistema za umetno tvorjenje slo-venskega govora. Jezikovne tehnologije: zbornik B 7. mednarodne multi-konference Informacijska druˇzba IS, pages 98–102, 2004.

[13] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. Generative adversarial networks:

An overview. IEEE Signal Processing Magazine, 35(1):53–65, 2018.

[14] Jae Lim D. Griffin. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Proces-sing, Volume 32, Issue: 2, Apr 1984, pages 98–102, 1984.

[15] Matjaˇz Gams. Govorec - sistem za slovenski raˇcunalniˇski govor. Novice IJS. ˇst. 83, pages 3–4, 2000.

[16] Matjaˇz Gams. Sintetizator govora za slovenˇsˇcino ebralec. Zbornik kon-ference Jezikovne tehnologije in digitalna humanistika, pages 180–185, 2016.

Diplomska naloga 37 [17] Fredrick Geissler. Notes, 32(4):775–777, 1976.

[18] Griffin, D. and Jae Lim. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.

[19] Mike Schuster Heiga Ze, Andrew Senior. Statistical parametric speech synthesis using deep neural networks. IEEE international conference on acoustics, speech and signal processing, pages 7962–7966, 2013.

[20] Hochreiter, Sepp and Schmidhuber, Jurgen. Long Short-Term Memory.

Neural Computation, 9(8):1735–1780, 1997.

[21] A.J. Hunt and A.W. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In1996 IEEE Internati-onal Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376 vol. 1, 1996.

[22] Quoc V. Le Ilya Sutskever, Oriol Vinyals. Sequence to Sequence Learnin-gwith Neural Networks. In Advances in Neural Information Processing Systems (NIPS 2014).

[23] Keith Ito and Linda Johnson. The LJ Speech Dataset. https:

//keithito.com/LJ-Speech-Dataset/, 2017.

[24] Jerneja ˇZganec Gros, Nikola Paveˇsi´c, France Miheliˇc. Text-to-Speech synthesis: a complete system for the Slovenian language. Journal of computing and information technology, pages 11–19, 1997.

[25] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.

Dosegljivo: https://arxiv.org/pdf/2010.05646.pdf, 2020.

[26] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Ben-gio, and Aaron Courville. MelGAN: Generative Adversarial Networks

38 Tom ˇSabanov for Conditional Waveform Synthesis. Dosegljivo: https://arxiv.org/

pdf/1910.06711.pdf, 2019.

[27] Sneha Lukose and Savitha S. Upadhya. Text to speech synthesizer-formant synthesis. In 2017 International Conference on Nascent Tech-nologies in Engineering (ICNTE), pages 1–4, 2017.

[28] Xudong Mao, Qing Li, Haoran Xie, Raymond Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks.

Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2017.

[29] Gregoire Montavon, Wojciech Samek, and Klaus-Robert Muller. Me-thods for interpreting and understanding deep neural networks. Digital Signal Processing, 73:1–15, 2018.

[30] Pertti Palo. A review of articulatory speech synthesis. Magistrska na-loga, Helsinki University of Technology, 2006.

[31] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, 2013.

[32] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A Flow-based Generative Network for Speech Synthesis. pages 3617–3621, 2019.

[33] Simon Rozman. Sinteza govornega signala na osnovi metode HNM.

Magistrska naloga, Univerza v Ljubljani, Fakulteta za raˇcunalniˇstvo in informatiko, 2005.

[34] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Nav-deep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui

Diplomska naloga 39 Wu. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectro-gram Predictions. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783, 2018.

[35] Blaˇz Simˇciˇc. Metoda glavnih komponent in manjkajoˇci podatki. Magi-strska naloga, Univerza v Ljubljani, Fakulteta za druˇzbene vede, 2014.

[36] Y. Stylianou. Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1):21–29, 2001.

[37] Tom ˇSabanov. Generiranje slovenskega govora na podlagi uˇcnih mnoˇzic veˇc govorcev. Dosegljivo: https://tomsabanov.gitlab.io/

generiranje-slovenskega-govora-tacotron/, 2021. [Dostopano 30.

7. 2021].

[38] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Ko-ray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio, 2016.

[39] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and De-mis Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th Inter-national Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3918–3926, 2018.

[40] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J.

Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A.

Saurous. Tacotron: Towards end-to-end speech synthesis. Dosegljivo:

https://arxiv.org/pdf/1703.10135.pdf, 2017.

40 Tom ˇSabanov [41] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu. FastSpeech: Fast, Robust and Controllable Text to Speech.

Dosegljivo: https://arxiv.org/pdf/1905.09263.pdf, 2019. [Dosto-pano 30. 7. 2021].

[42] Heiga Zen, Keiichi Tokuda, and Alan W. Black. An HMM-based speech synthesis system applied to English. IEEE Speech Synthesis Workshop, pages 227–230, 2002.

[43] Jerneja ˇZganec Gros. eBralec – sintetizator govora za slovenˇsˇcino. Do-segljivo: http://videolectures.net/jota_zganec_gros_ebralec/, 2018. [Dostopano 30. 7. 2021].