Abstract:
Text-to-speech systems generally require large amounts of annotated speech data with the quality of both the annotations and speech being a huge factor. As a result, most of the research performed has been on highly curated data collected in labs. This problem is not a huge factor while dealing with the generation of popular speech like American English or common English, but the problem arises while generating not so common accents. It is not viable to create a focus group to create a dataset every time we need to generate some new type of accented speech. This report tries to explore a new approach to generate not so common accented speeches using a popular, labeled dataset to learn the language and another unlabelled dataset of the accent we wish to learn. The approach uses GANs, a new concept introduced by Ian Goodfellow in 2014. The approach broadly aims to balance the comprehensibility-accent replication gap by using an ASR (Automatic Speech Recognition) in conjunction with a discriminator trained to recognize the accent we wish to recreate. The balance can be maintained by using a weighted sum of the two error functions to train the generative model. We intend to use the proposed model to recreate Chinese accented English for a sanity check and to show the correctness of our model followed by which we will tackle more obscure tasks like animal accented speech.