OK, so at this point everyone should be familiar with the computer voice. The siri voice for example, is functional, but still has no life. What is the problem, and how do they fix it? Well to put it as simple as possible, the technology, at least on an individual device is still based on word by word pronunciation. Call siri attractive and she will say “there, there.” It sounds like 2 completely separate words, instead of a full phrase. Similar with the droid version. It knows to lower the pitch of the second word because it is at the end of the segment, but it is just not treating the phrase / combination of words like we would. We naturally assign pitch and timing to various words in a group, so that we wind up emphasizing particular words to have a particular effect. For instance, everyone will read that last sentence and naturally put a slight delay after the comma, but also after the word “timing” and also after the second instance of the word “words.” There are also, more subtle delays taking place in other portions of the sentence. I will write it out in a way to illustrate this timing break.
We naturally assign pitch and timing
to various words
in a group,
so that we wind up emphasizing
particular words
to have a particular effect.
I think to put it as simple as possible the program needs to be reading ahead, and making grammatical determinations. This shouldn’t be too big an issue, since Word has been doing this for us for a while. What I’m describing would be a “next step” in the analysis of the text on the part of the computer. The grammar drives the pitch and timing of the phrase, and it doesn’t need to be a comma or period to have a noticeable amount of delay. I have read others claiming that there would have to be more data involved during the writing process in order to put emphasis where it should be, and I say nonsense. We are able to figure out where it goes. The AI should too.
There is a second issue with current text to voice technology. Currently, when I set up a customer with their own messaging software that includes a text to voice solution, they are required to read a lengthy script of words, for the computer to sample and piece together a basic vocabulary of sounds.
To put this simply, the computer is taking particular words and sounds and piecing other words from those samples. The technology exists already for voice to text. Yet it is not being utilized in this, it’s most needed application. The software should be simply sampling from any length of speech and determining the words it is hearing, then (as mentioned above) analyzing the grammar cues from the sentence structure, then comparing the actual pitches and delays with the grammar determinations made, and assigning a character to the voice it is hearing. Combining this with the actual tone of the voice, and the way in which certain letter and letter combinations sound, a realistic synthesis occurs. The data pulled from the sample, is put against a saved template.
I will use the analogy of a Barbie type doll. Currently sounds are chopped and combined like parts of a doll glued together. I’m talking about a new doll being formed complete, after simply observing another doll. Any samples used, like hair color, height, facial features, etc, are simply compared to what the software has already saved as a standard template for what a doll should look like, and the samples are not having to be gathered by taking snapshots of the doll in certain, particular poses, in particular angles. The software would just need a few shots, and then would be making the calculations and going immediately into synthesis mode. Obviously, the more footage you got, the more accurate the facsimile to the original.
The real fun, and inevitable conclusion of this system being utilized, is simply recording a particular voice, (perhaps Morgan Freeman from the movie War of the Worlds), with your smart phone. Then letting your app synthesize that voice, and it sounding perfect. Everyone would be available. Easily have whatever voice you want for your GPS or voicemail message by just listening to that person for a short while. Taking any book pdf file you have and having Morgan Freeman read it to you, without having to pay Morgan Freeman to sit in a room and actually read through the whole thing.
Issues- Obviously if a paragraph was full of improper writing and poor sentence structure, the grammar determinations may be ugly, but that would be that way in text form already, so that isn’t the fault of the software, that’s the fault of the writer.
IDEA – whew, this is a first for me so allow me to revel in a small delight. A wonderful idea comes to me whilst writing about another. This is immense. Since the software will be making grammar determinations and reading the sentences back to you, the following scenario would be possible, and is absolutely revolutionary.
The user, in this case a high school student, is preparing an essay. He simply reads into the PC or phone and the device takes dictation and prepares the essay. When he is finished he asks the device for a quick review. Making comparisons to it’s existing standardized database of proper english stylings, as mentioned above, the device comes across a particular sentence and says,
” the prepositional phrase, “under the bridge” in sentence 15, would flow better if placed before the word “because.” May I demonstrate? ”
The device could then read the sentence back with the improvement, or could be set up to automatically correct for improved flow. Why not, while your at it, look up the professor online and after determining his or her home town, modify my story to contain certain references to landmarks in the town of their youth.