#26916 - nmcconnell - Tue Sep 28, 2004 11:03 pm
I was reading a post on this forum from 2003 about speech synth, which got me thinking -
it seems like the most complicated aspects of speech synthesis could all be prerendered. This might make Text-To-Speech relatively easy on the GBA. This would only work for text that already exists at compile time, but it seems to me like it should be feasible. The Festival Text-To-Speech site lists the following steps for converting text to sound:
------------------------------------------------------------------------
http://festvox.org/docs/manual-1.4.3/festival_14.html#SEC47
Tokenization
Converting the string of characters into a list of tokens. Typically this means whitespace separated tokesn of the original text string.
Token identification
identification of general types for the tokens, usually this is trivial but requires some work to identify tokens of digits as years, dates, numbers etc.
Token to word
Convert each tokens to zero or more words, expanding numbers, abbreviations etc.
Part of speech
Identify the syntactic part of speech for the words.
Prosodic phrasing
Chunk utterance into prosodic phrases.
Lexical lookup
Find the pronucnation of each word from a lexicon/letter to sound rule system including phonetic and syllable structure.
Intonational accents
Assign intonation accents to approrpiate syllables.
Assign duration
Assign duration to each phone in the utterance.
Generate F0 contour (tune)
Generate tune based on accents etc.
Render waveform
Render waveform from phones, duration and F) target values, this itself may take several steps including unit selection (be they diphones or other sized units), imposition of dsesired prosody (duration and F0) and waveform reconstruction.
-------------------------------------------------------------------------
So, I'm thinking that everything except rendering the waveform, and probably generating the F0 contour, could be done in advance. This would mean that all the most complicated steps could be handled by existing code libraries running on the development system, and only the last few steps would have to be set up to run on the GBA.
To be clear (and to avoid pedantry in the replies), yes, I realise that the whole thing could be prerendered to a wave file, but if the last two steps were done on the gba, could you store rendered speech in a format that is, say, less than ten times bigger than just the plain text?
Does anyone have any experience with this? Can you comment on feasibility?
Also, I'm totally new to speech synth (and audio in general, really) so if there are some good learning resources, can you let me know? thanks.
--Neil
it seems like the most complicated aspects of speech synthesis could all be prerendered. This might make Text-To-Speech relatively easy on the GBA. This would only work for text that already exists at compile time, but it seems to me like it should be feasible. The Festival Text-To-Speech site lists the following steps for converting text to sound:
------------------------------------------------------------------------
http://festvox.org/docs/manual-1.4.3/festival_14.html#SEC47
Tokenization
Converting the string of characters into a list of tokens. Typically this means whitespace separated tokesn of the original text string.
Token identification
identification of general types for the tokens, usually this is trivial but requires some work to identify tokens of digits as years, dates, numbers etc.
Token to word
Convert each tokens to zero or more words, expanding numbers, abbreviations etc.
Part of speech
Identify the syntactic part of speech for the words.
Prosodic phrasing
Chunk utterance into prosodic phrases.
Lexical lookup
Find the pronucnation of each word from a lexicon/letter to sound rule system including phonetic and syllable structure.
Intonational accents
Assign intonation accents to approrpiate syllables.
Assign duration
Assign duration to each phone in the utterance.
Generate F0 contour (tune)
Generate tune based on accents etc.
Render waveform
Render waveform from phones, duration and F) target values, this itself may take several steps including unit selection (be they diphones or other sized units), imposition of dsesired prosody (duration and F0) and waveform reconstruction.
-------------------------------------------------------------------------
So, I'm thinking that everything except rendering the waveform, and probably generating the F0 contour, could be done in advance. This would mean that all the most complicated steps could be handled by existing code libraries running on the development system, and only the last few steps would have to be set up to run on the GBA.
To be clear (and to avoid pedantry in the replies), yes, I realise that the whole thing could be prerendered to a wave file, but if the last two steps were done on the gba, could you store rendered speech in a format that is, say, less than ten times bigger than just the plain text?
Does anyone have any experience with this? Can you comment on feasibility?
Also, I'm totally new to speech synth (and audio in general, really) so if there are some good learning resources, can you let me know? thanks.
--Neil