gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Optimize mixing routine

#34164 - ProblemBaby - Fri Jan 14, 2005 2:46 pm

Hello do someone have any good iedas how I can speed up this:

for (n = 0; n < MAX_CHANNELS; n++)
{
if (pChannel->Data)
MixChannel(TempBuffer, pChannel, SampleCount);
}

Code:

.global MixChannel
MixChannel:
      STMFD sp!, { r4-r11 }
      
      @ r0: DestBuffer
      @ r1: Channel
      @ r2: Count
      
      LDR r3, [r1, #MODULECHANNEL_POSITION]
      LDR r4, [r1, #MODULECHANNEL_INCREMENT]
      LDR r5, [r1, #MODULECHANNEL_DATA]
      LDR r7, [r1, #MODULECHANNEL_LENGTH]
      LDR r8, [r1, #MODULECHANNEL_LOOPLENGTH]
      LDRH r12, [r1, #MODULECHANNEL_FINALVOLUME]

      STMFD sp!, { r1 }
      MOV r1, r12

   Copy:
      MOV r9, r3, ASR#(SOUND_MIXING_ACCURACY)
      ADD r3, r3, r4
      
      LDRSB r11, [r5, r9]         @ Sampledata
      
      AND r10, r1, #0xFF         @ Byte 0 = LeftVolume
      MUL r9, r10, r11         @ Volume
      
      MOV r10, r1, ASR#8         @ Byte 1 = RightVolume
      MUL r10, r10, r11         @ Volume


      LDMFD r0!, { r11, r12 }      
      
      ADD r11, r11, r9
      ADD r12, r12, r10

      STMFD r0!, { r11, r12 }
      ADD r0, r0, #8


   
      
      CMP r3, r7
      BLT Keep
      
      CMP r8, #0
      BEQ End

   Back:
      SUB r3, r3, r8
      CMP r3, r7
      BGE Back
      
      B Keep
   End:
      MOV r8, #0
      LDMFD sp!, { r1 }
      STR r8, [r1, #MODULECHANNEL_DATA]
      B End2
      
   Keep:
      
      SUBS r2, r2, #1
      BNE Copy
      LDMFD sp!, { r1 }
      STR r3, [r1, #MODULECHANNEL_POSITION]
   
   End2:
      
      LDMFD sp!, { r4-r11 }
      BX lr


As you see the TempBuffer is u32 thats because of two things:
1. I dont have to make two shifts in the loop instead just one in the copy to mixbuffer routine.
2. The Volume and overall mix is more correct (well I dont think I will here much difference)

Is it worth it?
Do you see something that can improve this function a lot in speed?

Right now it takes ~6% per channel just for this function (31536hz)
effect/envelope/volumecolumn process not included.

Ive read about mixers that takes about 1% per channel how is that possible,?

#34166 - FluBBa - Fri Jan 14, 2005 4:12 pm

In PCEAdvance I mix 6 channels with stereo separation at 18kHz and it takes around 10% cpu.
But I add the data unsigned and then add an offset, and the waveform data is only 32bytes for each channel.
The mix loop mixes all channels at once and then only writes 2 bytes to the destination.
You can download the source from my homepage, the mixer is in sound.s
_________________
I probably suck, my not is a programmer.


Last edited by FluBBa on Sun Jan 16, 2005 1:04 am; edited 1 time in total

#34173 - ProblemBaby - Fri Jan 14, 2005 8:19 pm

How is it possible to mix it unsigned??

#34174 - DekuTree64 - Fri Jan 14, 2005 8:24 pm

ProblemBaby wrote:
Is it worth it?
Do you see something that can improve this function a lot in speed?

Right now it takes ~6% per channel just for this function (31536hz)
effect/envelope/volumecolumn process not included.

Ive read about mixers that takes about 1% per channel how is that possible,?


Nope, you'll never get it down to 1%. The beauty of mixer optimizing is that you don't have to do everything directly. As long as you end up with all the channels incrementing and multiplying by their volumes and added together in the end, anything goes.

The easiest speedup is to load and store multiple samples from the temp buffer with ldmia/stmia. As Flubba says, unsigned mixing helps too (add 128 to all your sample values before you put them in ROM, so you can load unsigned bytes. Then subtract 128*volume to get back to signed after mixing).

Still, it won't be incredibly fast. Try to come up with creative ways to get the job done, like mixing multiple channels in one pass, packing multiple values into registers to get more mileage out of each multiply, or not loading a new sample at all if only the fractional portion of the position chaged.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#34182 - ProblemBaby - Sat Jan 15, 2005 12:10 am

Flubba: your code looks very intersting but I doesnt understand it=)
how do you handle loops?

Code:

pcmmixloop
   ldrb r0,[r10,r4,lsr#27]         ;Channel 0
   add r4,r4,r4,lsl#16
   ldrb r1,[r10,r4,lsr#27]
   add r4,r4,r4,lsl#16
   orr r0,r0,r1,lsl#16
vol0_L
   movs r1,#0x00               ;volume left
   mul r2,r0,r1
vol0_R
   movs r1,#0x00               ;volume right
   mul r3,r0,r1


What is this doing in detail? how is the pos/freq stored
If you come down to ~2% percent per channel Iam very interested!

Thanks in advance

#34199 - DekuTree64 - Sat Jan 15, 2005 9:40 am

ProblemBaby wrote:
Flubba: your code looks very intersting but I doesnt understand it=)
how do you handle loops?

Code:

pcmmixloop
   ldrb r0,[r10,r4,lsr#27]         ;Channel 0
   add r4,r4,r4,lsl#16
   ldrb r1,[r10,r4,lsr#27]
   add r4,r4,r4,lsl#16
   orr r0,r0,r1,lsl#16
vol0_L
   movs r1,#0x00               ;volume left
   mul r2,r0,r1
vol0_R
   movs r1,#0x00               ;volume right
   mul r3,r0,r1


What is this doing in detail? how is the pos/freq stored
If you come down to ~2% percent per channel Iam very interested!

Thanks in advance


Indeed some interesting code. Looks like r10 is the data address, and r4 stores the position in the upper 16 bits, and the increment in the lower 16. Similarly, r5-r9 are the pos/inc for the other 5 channels. Then the volumes are plugged into the mov 0 instructions by means of opcode hacking.
What's interesting though, is that the source datas all seem to be 32 bytes apart. I don't see any code to cache samples in there, do your channels only play short 32 sample chip sounds (and loop by natural position overflow)?
You might actually be able to cache 32 samples for each channel and mix until one of the channels runs out, and then load more. Must think about that idea.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#34206 - ProblemBaby - Sat Jan 15, 2005 1:19 pm

hmm yeah I dont understand this instruction
Code:

ldrb r0,[r10,r4,lsr#27]


does it shift r4 by 27 or r0?
what accuracy do you use how big samples is it possible to mix?

#34225 - ProblemBaby - Sat Jan 15, 2005 10:51 pm

And what is the advantage to mix them unsigned?

#34226 - tepples - Sat Jan 15, 2005 11:03 pm

In some mixer designs, unsigned mixing gives you access to more addressing modes, which might shave one or two cycles per sample. However, adding the bias to all the channels at the end might eat up all your savings.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#34228 - ProblemBaby - Sat Jan 15, 2005 11:20 pm

I thought BIAS just added a value to the samples.
not subtracted?

Dekutree: I read at AAS site that their mixer mix at ~1% per active channel

#34232 - DekuTree64 - Sun Jan 16, 2005 12:30 am

ProblemBaby wrote:
I thought BIAS just added a value to the samples.
not subtracted?

Dekutree: I read at AAS site that their mixer mix at ~1% per active channel


Sorry, I meant you'd never get that particular algorithm down to 1%. It's quite possible using crazy new ideas that lessen the work (though I've never managed it).

And the bias can be thought of as adding or subtracting, just wether you make it a negative value or not :). Basically you're reversing the conversion to unsigned, which is to add 128.
The advantage is that if you have say 4 channels, you can add up the bias values for each of them into one master bias, and only have to add that in once at the end. If it saves you one cycle per sample on each of the 4 channels, then even if adding it in took 3 cycles, you'd still get a net gain.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#34275 - ProblemBaby - Sun Jan 16, 2005 11:47 pm

I didnt get that..=)

Is it possible to change the BIAS so Instead
of storing each signed byte seperatly in the final destbuffer
I can make them unsigned and ORR/Shift them together and put four bytes at once?
Well it is possible with signed data too but Ive heard that it takes about the same time to ORR together four signed bytes as Store them separatly.

Thanks in adv

#34285 - DekuTree64 - Mon Jan 17, 2005 2:52 am

Hmm, let's see if I can give a bit clearer of an explanation. Say you have an original 8-bit signed sample, which we'll call S, which ranges -128 to 127. To convert to unsigned, you add 128, pushing the range up to 0 to 255. Then you store that unsigned sample (which is S+128) in ROM.

In your mixer, you load the unsigned sample with ldrb (which can do something like ldrb rTemp, [rData, rPos, lsr #16], to get rid of the fractional portion of rPos without an explicit add instruction).
So, the value you just loaded is S+128, and you need to subtract that 128 to get it back to what the hardware wants. However, just subtracting it would completely defeat the purpose, since the sub would take just as long as the add that you saved by using ldrb instead of ldrsb.

Now if you multiply that still-unsigned sample by the volume, you get:
newSample = (S+128)*vol = (S*vol + 128*vol)
S*vol is what you actually want to add into the mix anyway, and you can still get it back by subtracting 128*vol:
newSample = (S*vol + 128*vol - 128*vol) = S*vol
And then you can add that straight into the mix. The -128*vol is what we call the bias value.

When you have multiple channels, get get something more like this:
mixedSamples = (S1*vol1 + 128*vol1 - 128*vol1) + (S2*vol2 + 128*vol2 - 128*vol2) + (S3*vol3 + ...

The important thing is that the volume doesn't change, so 128*vol for each of the channels is a constant. If you reorder things like this:
mixedSamples = (S1*vol1 + 128*vol1) + (S2*vol2 + 128*vol2) + (S3*vol3 + 128*vol3) + (-128*vol1 - 128*vol2 - 128*vol3)

The Sx*volx + 128*volx are the original unsignedSample*vol, and the -128*volx are all constants. Because they're constants, we can add them all up ahead of time:
bias = (-128*vol1 - 128*vol2 - 128*vol3)

And the formula above looks just the same, except with all the -128*volx replaced by that bias:
mixedSamples = (S1*vol1 + 128*vol1) + (S2*vol2 + 128*vol2) + (S3*vol3 + 128*vol3) + bias

And that's it. End result, you saved one add w/ shift to calculate the address to load from on each channel, but added one add instruction per final mixed sample. With 3 channels here, you still saved 2 cycles per final sample, and the savings only grow with more channels.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#34300 - ProblemBaby - Mon Jan 17, 2005 9:45 am

DekuTree46: Thanks alot for the explanation!
Now I just have to ask one thing about bias, the bias value you talks about is that an own or can I set REG_SGBIAS to this value and it will work I think I read that it max can be 512, maybe iam wrong?

And then I wonder if it is somehow possible Load/store data from an address in some other way then using LDR/STR
You talked about opcode hacking.