gbadev.org forum archive

Hey guys.
So I was trying to optimize the juice out of my latest sound mixer and I came up with a slightly faster clipping algorithm for signed samples. Here's the code (I'll try to comment it):

Code:

@ Before the loop . . .
mov ip, #0x7F @ BYTE_MAX

@ During the loop . . .

mov r5, r6, asr #0x14 @ Scale our data down (in this
@ case, it's 0~127 volumed data
@ and pre-scaled by .3, and
@ packed as RRRRLLLL, so this
@ would get the right data.)
movs lr, r5, asr #0x07 @ Ok, so we start off by getting
@ the highest bits which we don't
@ want
mvnmis lr, lr @ If it was negative, !x so we can
@ see if the data really did
@ overflow.
subne r5, ip, r5, asr #0x1F @ This trick is similar to the one
@ found in TONC for "safe" division.
@ if(overflowed) move 127-sign
@ Sign will be 0 or -1, and
@ 127 - -1 = 127 + 1 = 0x80 = -128
@ If it was 0 (or positive number)
@ 127 - 0 = 127

Ruben wrote:

and I came up with a slightly faster clipping algorithm

Faster than what?

Faster than cmp, movgt, cmn, movlt ;)

If the # of instructions are the same, and they aren't accessing the memory, why are these faster? I thought all instructions that didn't involve memory and stuff took the same # of CC to run?

... or not?

Please note that it says *before* loop and *during* loop.. :P
The actual clipping with "standard" stuff is..

Code:

mov r5, r6, asr #0x14
cmp r5, #0x7F
movgt r5, #0x7F
cmn r5, #0x80
mvnlt r5, #0x7F

EDIT: Oh and there's an instruction that doesn't necessarily run at 1c all the time: MUL (along with MLA, SMULL, UMULL, MLAL, etc). The mulitplication for 32-bit (MUL/MLA) is ~2 cycles, but takes longer depending on the amount of significant bits in the 2nd operand. MLA is MUL+ADD, so it's ~3 cycles, and I'm not sure about the timings for 64-bit multiply, but I think it was 1 or 2 extra cycles.
Other than MUL and its friends, I think all non-memory instructions run at 1c + waitstate.

I didn't quite understand it at first, but I see how it works now. Really good.
Though I just got an idea but it's not as bulletproof as yours, it only works for values between -256 & +255, instead of

Code:

movs lr, r5, asr #0x07
mvnmis lr, lr
subne r5, ip, r5, asr #0x1F

you can do

Code:

teq r5, r5, lsl #0x18
subpl r5, ip, r5, asr #0x1F

If I remember correctly the teq instruction is an eors without a destination, I think that should work.
_________________
I probably suck, my not is a programmer.

Yeah, teq is EOR[S] without a destination.. but.. @_________@

Give me a few minutes to work that out :P

Hmm, that might have to be a

Code:

submi r5, ip, r5, asr #0x1F

The code compares the sign bit of the byte with the sign bit of the whole (long)word. If they are the same the value has not overflown, if they differ the value has overflown.
_________________
I probably suck, my not is a programmer.

Did you compare your clipper to mine? ;) Pimpmobile does clipping of unsigned samples with DC-offset correction in 3.25 cycles per sample (not including memory accesses).

kusma: Yeahuh :P It's pretty good, too, but I mixed in signed mode using yet another trick I came up with (namely, filling in the sign bits manually) so yeah. :P

Ruben wrote:

I mixed in signed mode using yet another trick I came up with (namely, filling in the sign bits manually) so yeah. :P

How do you do that efficiently? I mean, "ldrb, mla, add" is a pretty nice loop (when under-sampling). I can't see any room for manual sign-extension without loosing performance.

It's one cycle slower than unsigned mixing. Kinda like...

Code:

muls lr, r2, lr @ lr = y0 + s(y1-y0), r2 = 00RR00LL, 0~127 each
orrmi lr, lr, #0x7F0000
bicpl lr, lr, #0x7F0000
add \Out, \Out, lr, asr #7 @ In interpolated mixing, pre-shifting sounds better :-S

And my mixer is more quality centered than speed centered; it uses Q23 linear interpolation and Q23 stepping

Ruben wrote:

It's one cycle slower than unsigned mixing. Kinda like...

That looks to me like it's two cycles slower.

Ruben wrote:

And my mixer is more quality centered than speed centered; it uses Q23 linear interpolation and Q23 stepping

I'm not sure what you mean by Q23. Care to explain?

By the way, I've considered adding linear interpolation for the oversampling-case (since that's the only place it makes sense - aliasing is aliasing no matter if it's -slightly- low-pass filter-ish'ed by the lerping), at pretty low cost. I'm also considering adding stereo panning at only a constant performance cost in the clipper (and not by modifying the mixing-loop at all) by expanding the volume to 0xRRRRLLLL, and mixing the lower volume channels first. This is something you can't really do with signed samples ;)

But first I've got to listen to the "crowd" and add a sensible API for sound effects.

Well, in most cases, when mixing unsigned you gotta clear the "bad" bits. In this case, I'm clearing/filling the bad bits, and only takes 1 cycle longer (cos it has to take care of both cases). The ldrb, mla, add is only useful for very little channels or very little volume difference, unless you've got mono, in which case you can just go ahead and mla all you want since you'll have 32 bits of data.

And by Q23, I mean 9.23 fixed point. That is, something like..

Code:

@ Before loop
mul r4, ip, r4 @ r4 = Freq<<23/MFreq

.macro MixSamp Out
mul lr, r3, r10 @ lr = s(y1-y0) (r3 = Error/Position, .23)
add lr, r9, lr, asr #23 @ lr = y0 + s(y1-y0)
muls lr, r2, lr @ 00RR00LL * Samp
orrmi lr, lr, #0x7F0000 @ Fill sign
bicpl lr, lr, #0x7F0000 @ Clear top bits
add \Out, \Out, lr, asr #7

add r3, r3, r4 @ Position += Delta
movs lr, r3, lsr #23 @ (int)Delta
beq 1f @ (int)Delta == 0

ldrsb r9,[r2, lr]! @ Get next sample

cmp r2, r1, lsr #4 @ if(Src >= End)
blge .LOutOfData

ldrsb r10, [r2, #1]

1:
.endm

@ Then the mixing code...

.LOutOfData:
ldr r2, [sp]
tst r1, #1
ldrne r2, [r2, #0x14-0x40]
ldrnesb r9, [r2]
movne pc, lr

mov r0, #0
strb r0, [r2, #-0x40]
b .LChannelKill

EDIT: Anyway, I'm off to school now.. I'll see if I can get in here during my business and accounting class.

Ruben wrote:

Well, in most cases, when mixing unsigned you gotta clear the "bad" bits. In this case, I'm clearing/filling the bad bits, and only takes 1 cycle longer (cos it has to take care of both cases). The ldrb, mla, add is only useful for very little channels or very little volume difference, unless you've got mono, in which case you can just go ahead and mla all you want since you'll have 32 bits of data.

I'm not sure I understand / agree with you here. What I'm talking about is filling each 32 bit word with one left and one right sample. 16 bits should be enough headroom and sub-sample precision for 8 bit sounds output, so there's no need to clear anything, just let the multiplier move the bits where you need them.

Quote:

Anyway, I'm off to school now.. I'll see if I can get in here during my business and accounting class.

I'm going to bed myself - it's 01:00 AM here, and I have an important meeting in the morning ;)

Ah, I see what you mean, but there's 3 problems with that:

-Speed: You have to multiply twice for left/right, and an extra 32-bit load/store, in which case I may as well as put the playback buffer in IWRAM and write to that
-You have to have double the caching buffer for left/right in 32-bit each
-ZOMG! ALIASING!! :-O
From what I've noticed, I'm not sure why it happens, but keeping precision gives the final sound a whistly quality, which sounds downright terrible, imo, which is why I'm doing ASR 7 now; only realized this ~3 days ago. I can hazard a guess though: Keeping the accuracy will introduce slight error from scaling at all, which, when added up with other channels, will create something rather awkward.

FluBBa:
Ohhhhh, now I see what it does. Yeah, it should be submi, because it's the last bit of the whole thing. Hm, I think that works with 16-bit, too

Code:

@ Before loop . . .
mov r5, #0x7F00 @ r5 = 0x7F00
add r5, r5, #0xFF @ r5 = 0x7FFF = SHORT_MAX

@ During loop . . .
teq r5, r5, lsl #0x10 @ Sign ^ gSign
submi r5, ip, r5, asr #0x1F @ If bit 31 is on, then
@ it means that the signs
@ differ; clip

EDIT: Oh and the teq/submi thing isn't --quite-- as fullproof, as it assumes that the 2nd last bit out of n+1 (where n is the number of bits) will be different than the sign. It usually will be, but if it isn't, it will fail.

gbadev.org forum archive

Audio > New sound clipping algorithm

#168603 - Ruben - Sat May 09, 2009 11:57 am

#168613 - kusma - Mon May 11, 2009 7:04 pm

#168614 - Ruben - Mon May 11, 2009 7:11 pm

#168615 - Tyler24 - Mon May 11, 2009 8:21 pm

#168616 - Ruben - Mon May 11, 2009 9:01 pm

#168618 - FluBBa - Mon May 11, 2009 9:30 pm

#168619 - Ruben - Mon May 11, 2009 9:50 pm

#168620 - FluBBa - Mon May 11, 2009 10:03 pm

#168621 - kusma - Mon May 11, 2009 10:36 pm

#168622 - Ruben - Mon May 11, 2009 10:38 pm

#168623 - kusma - Mon May 11, 2009 10:57 pm

#168624 - Ruben - Mon May 11, 2009 11:02 pm

#168625 - kusma - Mon May 11, 2009 11:22 pm

#168626 - Ruben - Mon May 11, 2009 11:33 pm

#168627 - kusma - Tue May 12, 2009 12:01 am

#168631 - Ruben - Tue May 12, 2009 6:35 am

#168653 - Ruben - Wed May 13, 2009 9:42 am