gbadev.org forum archive

I was reading a post on this forum from 2003 about speech synth, which got me thinking -

it seems like the most complicated aspects of speech synthesis could all be prerendered. This might make Text-To-Speech relatively easy on the GBA. This would only work for text that already exists at compile time, but it seems to me like it should be feasible. The Festival Text-To-Speech site lists the following steps for converting text to sound:
------------------------------------------------------------------------
http://festvox.org/docs/manual-1.4.3/festival_14.html#SEC47

Tokenization
Converting the string of characters into a list of tokens. Typically this means whitespace separated tokesn of the original text string.

Token identification
identification of general types for the tokens, usually this is trivial but requires some work to identify tokens of digits as years, dates, numbers etc.

Token to word
Convert each tokens to zero or more words, expanding numbers, abbreviations etc.

Part of speech
Identify the syntactic part of speech for the words.

Prosodic phrasing
Chunk utterance into prosodic phrases.

Lexical lookup
Find the pronucnation of each word from a lexicon/letter to sound rule system including phonetic and syllable structure.

Intonational accents
Assign intonation accents to approrpiate syllables.

Assign duration
Assign duration to each phone in the utterance.

Generate F0 contour (tune)
Generate tune based on accents etc.

Render waveform
Render waveform from phones, duration and F) target values, this itself may take several steps including unit selection (be they diphones or other sized units), imposition of dsesired prosody (duration and F0) and waveform reconstruction.

-------------------------------------------------------------------------

So, I'm thinking that everything except rendering the waveform, and probably generating the F0 contour, could be done in advance. This would mean that all the most complicated steps could be handled by existing code libraries running on the development system, and only the last few steps would have to be set up to run on the GBA.

To be clear (and to avoid pedantry in the replies), yes, I realise that the whole thing could be prerendered to a wave file, but if the last two steps were done on the gba, could you store rendered speech in a format that is, say, less than ten times bigger than just the plain text?

Does anyone have any experience with this? Can you comment on feasibility?

Also, I'm totally new to speech synth (and audio in general, really) so if there are some good learning resources, can you let me know? thanks.

--Neil

Totally doable. There's a wide range of quality to choose from, so you can simplify the encoded data to save space at the cost of quality.

Very simple algorithms might use data that's close to the size of the original text -- a typical 6-letter word might be composed of 4 or 5 phonemes, and each of those could be encoded in a byte or two if you supply minimal information.

I just ran across last night in my pile of Atari disks a copy of S.A.M., the software speech synthesizer. If the Atari could do it with 64k of RAM, 96k of disk space, a 1.79Mhz CPU and no PCM channels, the GBA should easily be able to pull it off :)

If you know ahead of time what you are going to be saying, you should get very good results in a minimal amount of space.

Sounds like a fun project, I'll be interested in seeing how this works out :)
_________________
dennis

Thanks Sajimori, SmileyDude.

Since my original post, I've run across Festival Lite (flite), which is a free ANSI-C speech synthesis library. I'm going to spend a little time seeing how much hassle would be involved in getting it running on the GBA. The major difficulty is likely to be memory constraints, as this library renders a whole utterance to a waveform at once, which is likely to take up more memory than is available.

Anyway, I'm going to check it out. If that flops, I'll be looking for something simpler.

Thanks again for your encouragement.

--Neil

Perhaps you know the tune "Das boot"?
The robotish voice in that tune is a made by a Atari speech syntheziser.
That syntheziser can play any given piece of text in realtime on a Atari 520, so there is no need to prerender anything if you don't want to.

[Synth vs. wave]
You claimed that you're willing to spend about 200 bytes per second on speech data. If you just want to compress a speech waveform, you might get acceptable results at low bitrates with an LPC vocoder, but you'll need to hire a voice actor.

[Festival]
An entire utterance at 18 kHz should fit in EWRAM if shorter than 14 seconds per breath. However, I wonder if the more advanced techniques such as those in Festival would run in real time in 16 MHz.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

There was even SAM and RECITER on the C64 (i think the Atari version was a port of this), the software and docs can be found here:
http://www.members.tripod.com/the-cbm-files/speak/

Samples and english to phonetic tables are there too.

I think it should be rather easy to port it to the GBA or DS

If you want better samples, you can find them here:
http://www.student.oulu.fi/~vtatila/reviews_of_speech_synths.html

gbadev.org forum archive

Coding > Text to Speech, voice synthesis (can it be done?)

#26916 - nmcconnell - Tue Sep 28, 2004 11:03 pm

#26918 - sajiimori - Tue Sep 28, 2004 11:19 pm

#26930 - SmileyDude - Wed Sep 29, 2004 5:33 pm

#26932 - nmcconnell - Wed Sep 29, 2004 5:55 pm

#26935 - ampz - Wed Sep 29, 2004 6:44 pm

#26938 - tepples - Wed Sep 29, 2004 9:18 pm

#70113 - amsterdam - Sat Feb 04, 2006 12:11 pm