Voice Banking involves the recording of a list of sentences into a computer. When enough recordings have been captured, software chops them up into individual sounds, phonetic units. A synthetic voice can then be built out of these phonetic units, this is called Concatenative speech synthesis. The number of sentences or statements needed to build a good quality English language synthetic voice using this process varies but is somewhere between 600 and 3500. This will take at least 8 hours of constant recording. Most people break it up over a few weeks which is recommended as voice quality will deteriorate over the course of a long session. So 20 minutes to half an hour in the morning (when most people’s voices are clearer) would be a good approach. The more recordings made the better quality the resulting voice will be.

There are a number of services offering voice banking and we have listed some that we are aware of below. The technology used varies from service to service and this post isn’t intended to be a guide to which service may be appropriate to a particular user. Our advice would be to investigate all options before making a decision as this process will be a considerable investment of time and in some cases money.

A person might choose to bank their voice for a number of reasons. The most common reason would be if someone has been diagnosed with a progressive illness like Motor Neuron Disease (MND/ALS) or similar that will result in the loss of speech. A voice is a very personal thing and being able to keep this aspect of individuality and identity can be important. The MND Association have detailed information Voice Banking on their website here. People unable to speak from birth can also take advantage of this technology. The VocalID service (although expensive) seems to offer good options in this regard. A family member could donate their voice by going through the voice banking process (or they could choose an appropriate donated voice). This synthetic voice could then be modified with filters modelled on the users own vocalisations. The result is a unique and personal voice with some of the regional qualities (accent, pronunciation) that reflect their background and heritage. Irish AAC user have historically had little choice when it came to selecting a voice, most grudgingly accepting the BBC newsreader upper-class English voice that was ubiquitous in communication devices. In Ireland, where an accents can vary significantly over such small geographical areas, how you speak is perhaps even more tied to your identity than other countries. Hopefully in the near future we will be hearing AAC users communicating in Cork, Limerick and Dublin accents!

ModelTalker

For research purposes I used the ModelTalker service to create a synthetic voice. I wanted to see how well it dealt with the Irish accent. The ModelTalker service is run out of the Nemours Speech Research Laboratory (SRL) in the Nemours Center for Pediatric Auditory and Speech Sciences (CPASS) at the Alfred I. duPont Hospital for Children in Wilmington, Delaware. It is not a commercial service, only costing a nominal $100 to download your voice once banked. They offer an Online Recorder that works directly in the Chrome Browser or you can download and install their MTVR App if you are using the Windows OS. The only investment you need to make to begin banking your voice is a decent quality USB headset. I used the Andrea NC-181 (about €35). For the best quality they recommend you record about 1600 sentences but they can build a voice from 800. As this was just an experiment I recorded the minimum 800. At the beginning of each session you go through a sound check. Consistency is an important factor contributing to the overall quality of the finished voice. This is why you need to keep using the same computer and microphone throughout the whole process, ideally in the same location. When you begin you will hear the first statement read out, you then record the statement yourself. A colour code will give you feedback on whether the recording was acceptable or not. Red means it wasn’t good enough to use and so you should try again. Yellow is okay, could be better and green means perfect, move on. I found the Irish accent resulted in a lot of yellow. Don’t let this worry you too much. A nice feature for Irish people who want to engage in this process is the ability to recording custom sentences. They recommend that you at least record your own name. So many names and places in Ireland are anglicised versions of Irish that it would be worthwhile spending a bit of time on these custom sentences. “Siobhán is from Drogheda” for example would be incomprehensible using most Text to Speech. At the end of each session you upload your completed sentences which are added to your inventory (if using the browser based recorder they are added as you go). When you feel you have enough completed you can request your voice. When the voice is ready you need to audition it, this process allows you to fine tune how it sounds. I made a screen recording of this process and I will add it to this post when I have edited it down to a manageable length.

Click play below to hear a sample of my synthesized voice. Yes, unfortunately I do kind of sound like that J

Speech synthesis is an area of technology that is progressing rapidly thanks to the interest of big multinationals like Google (listen to their DeepMind powered WaveNet Voices here) and Adobe (caused a stir and even concern in some quarters with project VoCo in 2016). Looking at the two previous examples it’s not hard to imagine that a high quality unique voice could be built from a short sample in the near future.