#16405 - krom - Sun Feb 15, 2004 2:58 am
    A WIP audio engine demo called Sand2 has been announced by Tursi:
Hi there! I wanted to report that I've released my first GBA demo -
just a simple little scroller, but it's main purpose is to highlight
my WIP audio engine.
Check out the audio engine demo at: http://harmlesslion.com/games/sand2     
 
    #16408 - tepples - Sun Feb 15, 2004 5:57 am
    Music, at 1.63 bits per sample? Wow. That port of GSM RPELTP beats even the 2.7 bits per sample codec I was working on to replace 8ad. I wonder if it would work well on anything other than DDR music.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
    
 
    #16427 - Tursi - Mon Feb 16, 2004 1:40 am
    It's not *exactly* 5:1, more like 4.8:1 (33 bytes in -> 160 bytes out). Close enough, though. The internal accuracy is 13 bits, so some is wasted here.
GSM is meant, of course, for 8-13khz speech. Using it for music like this is a gross abuse of the codec, and I'm still using the default tables - they could probably be tweaked better for music, though possibly at a greater accuracy cost. Dunno yet.
Just to answer the question, though, I linked a different song into my Sand demo and uploaded it to a temporary folder:
http://www.harmlesslion.com/temp/Sand2_Pentium44k.zip
This one plays 'All About the Pentiums' at 44khz, which is roughly the top end for the CPU in the current version of the code. The real GBA gets about 5-7 fps on the graphics while the music is playing, emulators tend to vary (if it stutters a lot, your emu isn't keeping up).
The GSM compressed version, (mono, 44khz), came out to 1907k, which compares well against the MP3 (stereo, 160kbps) at 4193k. (stereo->mono and slightly less than half the size).
The codec really prefers rich music with a good range, especially something that can mask it's decoding errors (especially with my GBA port, since so much has been removed for speed reasons). Tinny, simple music (bells, xylophone, that sorta thing), and quiet/bassy sounds (such as the rumbling after my thunderclap at the beginning of the demo) so far generate a lot of noise and sound pretty lousy. I'll be working on trying to compensate that.
Stereo can be achieved, of course, for twice the CPU usage, though I don't have a demo of that at this time.
I realized that 'Sandstorm' made a lousy demonstration, since it's so repetitive and synthy, but I was pretty addicted to the tune when I started. ;) It also sounded decent through GSM at 16khz.     
 
    #16430 - DekuTree64 - Mon Feb 16, 2004 3:17 am
    Hey, the quality is darn impressive for that kind of compression rate. And if you used that to compress a voice-only channel and played it alongside a regular MOD-type song, it would sounds totally awesome. If you've ever played Tales of Phantasia for the SNES (Japanese only though, so you have to find a ROM), they did something to that effect in the opening demo. Very cool, especially for how hard I've heard it is to work with the SNES sound chip. 
Another possibility would be to compress individual sounds and decompress to EWRAM for MOD playing. Could save abunch of ROM space at the cost of limiting the max samples for a song/EWRAM usage for everything else.
In any case, great work so far
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
    
 
    #16462 - tepples - Tue Feb 17, 2004 5:13 am
    Plan on releasing the source code for the optimized decoder any time soon?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
    
 
    #16512 - Tursi - Wed Feb 18, 2004 8:56 am
    Considering it. I want to spend a bit more time trying to clean up the audio first, at least do a simple filter on it to stop the pops. :)
    
 
    #16515 - tepples - Wed Feb 18, 2004 3:55 pm
    I looked at your program's audio output in VisualBoyAdvance, and it seems you're returning to zero for several samples during each "pop". Such dropouts have nothing to do with the GSM codec at all and everything to do with losing synchronization between the FIFOs and the DMA. That's why I plan to use a more "GBA friendly" sample rate such as 18157 Hz, where I can switch buffers on vblank.
Did you make changes to the encoder that change its bitstream format away from the .gsm format that Toast uses? Or did you just optimize the free decoder a bit and stick it in IWRAM?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.     
 
    #16560 - tepples - Thu Feb 19, 2004 5:47 pm
    Update: While waiting, I took a crack at porting the free GSM code myself. I stripped out all encoding code, put the rest into IWRAM, and took a few shortcuts in Short_term_synthesis_filtering() (aka the Workhorse). I managed to get performance on GBA hardware up to about 30 kHz at 100% CPU, or 18 kHz (my planned sustained rate) at 60% CPU.
This has implications even for commercial games. At 30 Kbits/s, or about 225 KBytes/min, will this signal the beginning of the end of MIDI-sounding music in 2D GBA games?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
    
 
    #16561 - poslundc - Thu Feb 19, 2004 6:00 pm
     | tepples wrote: | 
  | This has implications even for commercial games. At 30 Kbits/s, or about 225 KBytes/min, will this signal the beginning of the end of MIDI-sounding music in 2D GBA games? | 
I doubt it will at 60% CPU consumption... any idea what it would be like more around 11 KHz? (Based on 11 KHz probably being a reasonable minimum for acceptable playback quality on the GBA, although MOD instruments are only recorded at 8 KHz, so who knows?)
Dan. 
    #16567 - Tursi - Thu Feb 19, 2004 7:48 pm
    Heyas. :)
It's just a straight port, all data has been encoded with Toast.
I got my performance by moving to IWRAM, a great deal of code optimization (do away with some of the code tricks they pull here and there, the optimizer does a lousy job with tricks), manual loop unrolls, and playing with the optimization options to see what worked best. 
I hadn't yet had a chance to look at the output waveform, since my real work is eating most of my time at the moment. The zero bytes aren't what I was expecting. I'm a bit surprised that the DMA would fall out of sync regardless of the sample rate (do the interrupts tend to lag?) - and I would expect it to be much worse in the 44khz sample?
Tepples - I'd be interested in sharing notes to see what you trimmed from Short_term_synthesis_filtering(). :) I don't believe I changed the codepath in there at all, so if we put our heads together maybe we can improve the CPU use even more.
Poslundc - Although it's not 100% accurate to do so, the CPU usage as the bitrate increases is relatively linear, so knowing the usage at one bitrate, you can reasonably estimate the usage at another.
    
 
    #16573 - Miked0801 - Thu Feb 19, 2004 9:58 pm
    How large is its ROM/EWRAM/IWRAM footprint?  If you're taking a larger portion of RAM, it's still not useful.
    
 
    #16579 - tepples - Thu Feb 19, 2004 10:49 pm
     | poslundc wrote: | 
  | I doubt [waveform codecs such as GSM RPE-LTP] will [replace .s3m and the like in GBA games] at 60% CPU consumption | 
How much CPU time do commercial 2D games use for game and graphics logic?
 | Tursi wrote: | 
  | I hadn't yet had a chance to look at the output waveform | 
My port is set up to decode an entire waveform to EWRAM before playing any of it. If you have clicks, you might want to try adding that as a debug feature. Or you might want to try building the decoder as a native program, which I did to familiarize myself with the codebase before attempting the GBA port.
 | Tursi wrote: | 
  | The zero bytes aren't what I was expecting. I'm a bit surprised that the DMA would fall out of sync regardless of the sample rate (do the interrupts tend to lag?) | 
DMA, such as a big copy triggered on vblank or in the main thread, will delay an interrupt from triggering. I notice that you're doing a bit of palette animation in the clickiest part of the demo; do you use DMA to copy a block of palette data? I just synchronize the entire engine to vblank so that I can make sure that an interrupt will never happen at the same time as a DMA. This does, however, restrict the choice of sample rates, which is why I chose 18 kHz.
 | Tursi wrote: | 
  | Tepples - I'd be interested in sharing notes to see what you trimmed from Short_term_synthesis_filtering(). | 
1. get rid of the 'register' keywords; GCC usually seems to do a better job dynamically
2. change the 'word' (i.e. s16) variables to 'int' variables and ditch the '0x0ffff &' in the expressions, so I could...
3. change the GSM_SUB and GSM_ADD to regular subtraction and addition
There were also some calls to gsm_sub() and the like that I got rid of. The lowercase version of gsm_sub() incurs the overhead of a function call; the macro version GSM_SUB() doesn't.
Here's what my cut-down GSM decoder library looks like so far; it's under the same license as Toast.
Did you use any GCC optimization flags other than -O3?
 | Tursi wrote: | 
  | I don't believe I changed the codepath in there at all | 
I used PALRAM[0] to zero in on the identity of the bottleneck. PALRAM[0] profiling changes the background color at the beginning of each function to produce horizontal bars proportional in thickness to the time taken by a function. I assigned yellow to Short_term_synthesis_filtering() and found that it took most of the time.
 | Tursi wrote: | 
  | Although it's not 100% accurate to do so, the CPU usage as the bitrate increases is relatively linear | 
In my tests, it is linear. It takes a nearly constant time to decode one 160-sample RPE-LTP frame; the only difference from one sample rate to another is how many RPE-LTP frames per second it uses.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick. 
    #16594 - Tursi - Fri Feb 20, 2004 3:33 am
    Miked0801: The decoder itself currently places everything, including it's lookup tables, in IWRAM, and uses about 8904 bytes (according to the MAP file). (And it leaves about 376 bytes of code in ROM - just set up functions.) The lookup tables could be placed in EWRAM or even ROM without a major impact, but they use less than 200 bytes.
Tepples:
 | Quote: | 
  | My port is set up to decode an entire waveform to EWRAM before playing any of it. | 
Might be a good idea for troubleshooting, yes. I still need to test your theory about the DMA falling behind (probably this weekend.) My demo does not do very much DMA besides the playback itself, and it double-buffers the audio, processing the 'next' buffer as soon as it swaps the banks, so it's very unlikely that it's a buffer underrun. It is possible that the buffer swap (which is interrupt-triggered) is being delayed by something, especially since my audio interrupt may be triggered at literally any time. I considered using the VBlank and can't remember why I decided not to use a compatible rate (I think when I made my decision I was still in the early stages of the engine and targetted 16khz as a magic value).
 | Quote: | 
  | 1. get rid of the 'register' keywords; GCC usually seems to do a better job dynamically 2. change the 'word' (i.e. s16) variables to 'int' variables and ditch the '0x0ffff &' in the expressions, so I could...
 3. change the GSM_SUB and GSM_ADD to regular subtraction and addition
 | 
So I'd already done 2 and 3, in fact I simplified a lot of the math code (at some cost of accuracy). I never thought to try 1, though, I will go ahead and see how that helps. :)
 | Quote: | 
  | Did you use any GCC optimization flags other than -O3? | 
I did use unroll-loops, and I felt it did help.
I'll certainly look at Short_term_synthesis_filtering(), but after a quick look I don't see that I can do much (besides insisting on register variables, which I'll remove and test). Might try unrolling the loop.
 | Quote: | 
  | In my tests, it is linear. It takes a nearly constant time to decode one 160-sample RPE-LTP frame; the only difference from one sample rate to another is how many RPE-LTP frames per second it uses. | 
While this is true, there's some overhead in the interrupt calls in my decoder, which of course would happen more often. I'm not sure if there's additional cost, there probably is NOT, but I tend to err on the side of caution when providing estimates. ;)     
 
    #16604 - Paul Shirley - Fri Feb 20, 2004 4:30 pm
    removed
    
 
    #21562 - Tursi - Mon May 31, 2004 11:54 pm
    This is late, but I wanted to let everyone know that I have now released the full source code to Sand2, including the optimized GSM decoder.
After seeing the PSP and GB-DS at E3, I had a change of heart regarding hoarding it. ;)
The code may be used freely in any *free* production, but I'll want some compensation for my time if you use it in anything for charge. ;) As for how to plug it into your own code - play with it a bit, it's not too hard. My demo shows two different ways (interrupt driven in the main demo code, and polled in the 'Tursi' screen code).
http://harmlesslion.com/games/sand2
I'd be very pleased if anyone who used it dropped me an email so I could see what they did. :)     
 
    #21698 - tepples - Thu Jun 03, 2004 5:02 am
    I just sent some money your way for inspiring, nay, challenging me to write my own GSM decoder, by proving that it 1. was possible in real time on a GBA and 2. didn't totally screw up dance music.
I took a brief look at your optimized decoder ("combined.c" which corresponds roughly in scope to my decoder's "gsmcode.c"), and it appears you got rid of each instance of struct gsm_state *S, making it a global variable instead. Did that optimization in fact make the code run nearly twice as fast, as it would appear from comparison of your results to mine? It would also seem to restrict the decoder to one simultaneous mono GSM stream. I can think of a few cases where this restriction would be a pain.
However, it appears your "non-commercial" license precludes linking your codec with Free software (as defined by gnu.org). I'd suggest a copyleft license such as the GNU GPL if you don't want the proprietary software industry to make money off your hard work.
Oh, and regarding Vblank-compatible sample rates: I just made a tool for those.
Honest comparison so far:
Tursi's GSM decoder: non-commercial license only, one simultaneous stream, uses 30% of CPU
Tepples's GSM decoder: permissive license, multiple simultaneous streams in theory, uses 60% of CPU
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.     
 
    #21700 - Tursi - Thu Jun 03, 2004 5:36 am
    Thanks very much for your donation! I appreciate that and the comments. I'm glad to have inspired a challenge. ;)
To be honest, I don't imagine the change to a global variable provided the speed boost - in fact testing with other globals actually cost me speed, as far as I could tell the compiler was less likely to store global pointers in registers than locals. The biggest reason I made it a global was to reduce stack usage, back before I increased the size of my interrupt stack.
I didn't really think about the multi-streaming considerations of that, but at the same time, my thinking on stereo was to decode one channel, then the other. Each 33-byte GSM frame is completely independant of the frames around it, and so as long as you decode whole blocks at a time, concurrency is not a problem. This does double the CPU usage, and I was considering ways to pack the data to work around it, but I didn't really have time to get very far. (One thing that I wanted to try to take advantage of was the 13-bit resolution of GSM, when we only need 8 bits. I considered packing 6 bits per channel, but being a lossy codec, I haven't yet come up with a decent way to pack the channels.)
The largest speed boost, I believe, came from the fact that I completely removed one stage of the decoder. I can't remember anymore what that stage was, though. I also believe I started from an older version of the decoder than you did, though again, I have no idea where I found it anymore. :) Not very useful, eh? I played a *lot* with removing/simplifying stages in the decoder to see what was needed and get some idea what was really going on.
The other thing that helped a lot was the removal of the clipping code. This means that some samples could sound very bad if this isn't taken into account, although playing with the de-emphasis value can compensate some for that, which is why it's exposed.
It's worth noting that this version doesn't click anymore. ;) That turned out to be letting the interrupt timer tick one too many times before reloading the buffers. Sometimes we made it, sometimes we didn't. ;)
I'm not really a fan of GPL for every case, but you're right, my license prevents people from using it in their own GPLd software, which would be the most useful place for something like this (ie: Pogoshell? ;) ). On the other hand, I did the work specifically to use it in a non-free, non-open project, and I felt releasing it as GPL then using it as not myself was, at best, a small conflict. But, that may be a lessor issue now...
I will consider this. :)
    
 
    #21701 - tepples - Thu Jun 03, 2004 6:25 am
     | Tursi wrote: | 
  | I'm not really a fan of GPL for every case, but you're right, my license prevents people from using it in their own GPLd software, which would be the most useful place for something like this (ie: Pogoshell? ;) ). | 
A PogoShell plug-in would be a great idea. But I thought of one thing: If one were to use GSM RPE-LTP in a PogoShell plug-in, then one would have to specify the sample rate. The PC codecs (such as QuickTime, which installed itself as the *.gsm handler on my winbox) assume 8000 Hz, the same rate used in actual mobile telephony, but 8000 Hz isn't so hot for music.
 | Quote: | 
  | On the other hand, I did the work specifically to use it in a non-free, non-open project | 
So would the code be a "work made for hire" under some contract with a hypothetical employer?
 | Quote: | 
  | and I felt releasing it as GPL then using it as not myself was, at best, a small conflict. | 
If you own copyright in the work, then dual-licensing is perfectly fine, as seen in LZO.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick. 
    #25695 - Tursi - Fri Aug 27, 2004 5:07 am
     | Quote: | 
  | So would the code be a "work made for hire" under some contract with a hypothetical employer? | 
I still see hits for this thread in my referrer log, so I should probably clear up this one thing - the hypothetical employer was myself, so this is not a work for hire and only myself and the original authors hold any claim to it. :)
The project got shelved anyway. So if someone wants to use this, just ask me. Even if it's going into a GPL project, I'll probably say yes (and then lose it, but whatever ;) ).