gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Audio > New sound clipping algorithm

#168603 - Ruben - Sat May 09, 2009 11:57 am

Hey guys.
So I was trying to optimize the juice out of my latest sound mixer and I came up with a slightly faster clipping algorithm for signed samples. Here's the code (I'll try to comment it):

Code:
@ Before the loop . . .
mov     ip, #0x7F               @ BYTE_MAX

@ During the loop . . .

mov     r5, r6, asr #0x14       @ Scale our data down (in this
                                @ case, it's 0~127 volumed data
                                @ and pre-scaled by .3, and
                                @ packed as RRRRLLLL, so this
                                @ would get the right data.)
movs    lr, r5, asr #0x07       @ Ok, so we start off by getting
                                @ the highest bits which we don't
                                @ want
mvnmis  lr, lr                  @ If it was negative, !x so we can
                                @ see if the data really did
                                @ overflow.
subne   r5, ip, r5, asr #0x1F   @ This trick is similar to the one
                                @ found in TONC for "safe" division.
                                @ if(overflowed) move 127-sign
                                @ Sign will be 0 or -1, and
                                @ 127 - -1 = 127 + 1 = 0x80 = -128
                                @ If it was 0 (or positive number)
                                @ 127 - 0 = 127

#168613 - kusma - Mon May 11, 2009 7:04 pm

Ruben wrote:
and I came up with a slightly faster clipping algorithm

Faster than what?

#168614 - Ruben - Mon May 11, 2009 7:11 pm

Faster than cmp, movgt, cmn, movlt ;)

#168615 - Tyler24 - Mon May 11, 2009 8:21 pm

If the # of instructions are the same, and they aren't accessing the memory, why are these faster? I thought all instructions that didn't involve memory and stuff took the same # of CC to run?

... or not?

#168616 - Ruben - Mon May 11, 2009 9:01 pm

Please note that it says *before* loop and *during* loop.. :P
The actual clipping with "standard" stuff is..

Code:
mov     r5, r6, asr #0x14
cmp     r5, #0x7F
movgt   r5, #0x7F
cmn     r5, #0x80
mvnlt   r5, #0x7F


EDIT: Oh and there's an instruction that doesn't necessarily run at 1c all the time: MUL (along with MLA, SMULL, UMULL, MLAL, etc). The mulitplication for 32-bit (MUL/MLA) is ~2 cycles, but takes longer depending on the amount of significant bits in the 2nd operand. MLA is MUL+ADD, so it's ~3 cycles, and I'm not sure about the timings for 64-bit multiply, but I think it was 1 or 2 extra cycles.
Other than MUL and its friends, I think all non-memory instructions run at 1c + waitstate.

#168618 - FluBBa - Mon May 11, 2009 9:30 pm

I didn't quite understand it at first, but I see how it works now. Really good.
Though I just got an idea but it's not as bulletproof as yours, it only works for values between -256 & +255, instead of
Code:

movs    lr, r5, asr #0x07
mvnmis  lr, lr
subne   r5, ip, r5, asr #0x1F

you can do
Code:

teq    r5, r5, lsl #0x18
subpl   r5, ip, r5, asr #0x1F

If I remember correctly the teq instruction is an eors without a destination, I think that should work.
_________________
I probably suck, my not is a programmer.

#168619 - Ruben - Mon May 11, 2009 9:50 pm

Yeah, teq is EOR[S] without a destination.. but.. @_________@

Give me a few minutes to work that out :P

#168620 - FluBBa - Mon May 11, 2009 10:03 pm

Hmm, that might have to be a
Code:

submi   r5, ip, r5, asr #0x1F

The code compares the sign bit of the byte with the sign bit of the whole (long)word. If they are the same the value has not overflown, if they differ the value has overflown.
_________________
I probably suck, my not is a programmer.

#168621 - kusma - Mon May 11, 2009 10:36 pm

Did you compare your clipper to mine? ;) Pimpmobile does clipping of unsigned samples with DC-offset correction in 3.25 cycles per sample (not including memory accesses).

#168622 - Ruben - Mon May 11, 2009 10:38 pm

kusma: Yeahuh :P It's pretty good, too, but I mixed in signed mode using yet another trick I came up with (namely, filling in the sign bits manually) so yeah. :P

#168623 - kusma - Mon May 11, 2009 10:57 pm

Ruben wrote:
I mixed in signed mode using yet another trick I came up with (namely, filling in the sign bits manually) so yeah. :P

How do you do that efficiently? I mean, "ldrb, mla, add" is a pretty nice loop (when under-sampling). I can't see any room for manual sign-extension without loosing performance.

#168624 - Ruben - Mon May 11, 2009 11:02 pm

It's one cycle slower than unsigned mixing. Kinda like...


Code:
muls  lr, r2, lr @ lr = y0 + s(y1-y0), r2 = 00RR00LL, 0~127 each
orrmi lr, lr, #0x7F0000
bicpl lr, lr, #0x7F0000
add   \Out, \Out, lr, asr #7 @ In interpolated mixing, pre-shifting sounds better :-S


And my mixer is more quality centered than speed centered; it uses Q23 linear interpolation and Q23 stepping

#168625 - kusma - Mon May 11, 2009 11:22 pm

Ruben wrote:
It's one cycle slower than unsigned mixing. Kinda like...
That looks to me like it's two cycles slower.

Ruben wrote:
And my mixer is more quality centered than speed centered; it uses Q23 linear interpolation and Q23 stepping

I'm not sure what you mean by Q23. Care to explain?

By the way, I've considered adding linear interpolation for the oversampling-case (since that's the only place it makes sense - aliasing is aliasing no matter if it's -slightly- low-pass filter-ish'ed by the lerping), at pretty low cost. I'm also considering adding stereo panning at only a constant performance cost in the clipper (and not by modifying the mixing-loop at all) by expanding the volume to 0xRRRRLLLL, and mixing the lower volume channels first. This is something you can't really do with signed samples ;)

But first I've got to listen to the "crowd" and add a sensible API for sound effects.

#168626 - Ruben - Mon May 11, 2009 11:33 pm

Well, in most cases, when mixing unsigned you gotta clear the "bad" bits. In this case, I'm clearing/filling the bad bits, and only takes 1 cycle longer (cos it has to take care of both cases). The ldrb, mla, add is only useful for very little channels or very little volume difference, unless you've got mono, in which case you can just go ahead and mla all you want since you'll have 32 bits of data.

And by Q23, I mean 9.23 fixed point. That is, something like..


Code:
@ Before loop
mul    r4, ip, r4 @ r4 = Freq<<23/MFreq

.macro MixSamp Out
mul    lr, r3, r10 @ lr = s(y1-y0) (r3 = Error/Position, .23)
add    lr, r9, lr, asr #23 @ lr = y0 + s(y1-y0)
muls  lr, r2, lr @ 00RR00LL * Samp
orrmi lr, lr, #0x7F0000 @ Fill sign
bicpl lr, lr, #0x7F0000 @ Clear top bits
add   \Out, \Out, lr, asr #7

add   r3, r3, r4 @ Position += Delta
movs  lr, r3, lsr #23 @ (int)Delta
beq   1f @ (int)Delta == 0

ldrsb r9,[r2, lr]! @ Get next sample

cmp   r2, r1, lsr #4 @ if(Src >= End)
blge  .LOutOfData

ldrsb r10, [r2, #1]


1:
.endm

@ Then the mixing code...

.LOutOfData:
ldr       r2, [sp]
tst       r1, #1
ldrne    r2, [r2, #0x14-0x40]
ldrnesb r9, [r2]
movne   pc, lr

mov  r0, #0
strb r0, [r2, #-0x40]
b    .LChannelKill


EDIT: Anyway, I'm off to school now.. I'll see if I can get in here during my business and accounting class.

#168627 - kusma - Tue May 12, 2009 12:01 am

Ruben wrote:
Well, in most cases, when mixing unsigned you gotta clear the "bad" bits. In this case, I'm clearing/filling the bad bits, and only takes 1 cycle longer (cos it has to take care of both cases). The ldrb, mla, add is only useful for very little channels or very little volume difference, unless you've got mono, in which case you can just go ahead and mla all you want since you'll have 32 bits of data.

I'm not sure I understand / agree with you here. What I'm talking about is filling each 32 bit word with one left and one right sample. 16 bits should be enough headroom and sub-sample precision for 8 bit sounds output, so there's no need to clear anything, just let the multiplier move the bits where you need them.

Quote:
Anyway, I'm off to school now.. I'll see if I can get in here during my business and accounting class.

I'm going to bed myself - it's 01:00 AM here, and I have an important meeting in the morning ;)

#168631 - Ruben - Tue May 12, 2009 6:35 am

Ah, I see what you mean, but there's 3 problems with that:

-Speed: You have to multiply twice for left/right, and an extra 32-bit load/store, in which case I may as well as put the playback buffer in IWRAM and write to that
-You have to have double the caching buffer for left/right in 32-bit each
-ZOMG! ALIASING!! :-O
From what I've noticed, I'm not sure why it happens, but keeping precision gives the final sound a whistly quality, which sounds downright terrible, imo, which is why I'm doing ASR 7 now; only realized this ~3 days ago. I can hazard a guess though: Keeping the accuracy will introduce slight error from scaling at all, which, when added up with other channels, will create something rather awkward.

#168653 - Ruben - Wed May 13, 2009 9:42 am

FluBBa:
Ohhhhh, now I see what it does. Yeah, it should be submi, because it's the last bit of the whole thing. Hm, I think that works with 16-bit, too

Code:
@ Before loop . . .
mov   r5, #0x7F00           @ r5 = 0x7F00
add   r5, r5, #0xFF         @ r5 = 0x7FFF = SHORT_MAX

@ During loop . . .
teq   r5, r5, lsl #0x10     @ Sign ^ gSign
submi r5, ip, r5, asr #0x1F @ If bit 31 is on, then
                            @ it means that the signs
                            @ differ; clip


EDIT: Oh and the teq/submi thing isn't --quite-- as fullproof, as it assumes that the 2nd last bit out of n+1 (where n is the number of bits) will be different than the sign. It usually will be, but if it isn't, it will fail.