gbadev.org forum archive

So I decided to start writing my FFIV sequel from scratch again, and am close to running out of cycles, so I'm thinking about speeding up the only part 'optimizable' in the sound mixing code: 16-bit to 8-bit, 4 samples at once. (That is, make 4x 8-bit samples from 4x 16-bit samples quickly)

What I have at the moment is this...

Code:

@ r0: Destination (8-bit, word aligned)
@ r1: Source (16-bit, word aligned)
@ r2: Count
@ r4: Mask 0x00FF00FF
.LMix:
ldmia r1!, {r5-r8} @ RRRRLLLL

and r5, r4, r5, lsr #0x08 @ 00RR00LL
and r6, r6, r4, lsl #0x08 @ RR00LL00
orr r5, r5, r6 @ RRRRLLLL

mov r6, r5, lsr #0x10 @ 0000RRRR
bic r5, r5, r6, lsl #0x10 @ 0000LLLL
mov r6, r6, ror #0x10 @ RRRR0000

and r7, r4, r7, lsr #0x08 @ 00RR00LL
and r8, r8, r4, lsl #0x08 @ RR00LL00
orr r7, r7, r8 @ RRRRLLLL

orr r5, r5, r7, lsl #0x10 @ LLLLLLLL
orr r6, r6, r7, lsr #0x10 @ RRRRRRRR
mov r6, r6, ror #0x10

str r6, [r0, #BUFFER_LEN*BUFCNT]
str r5, [r0], #0x04

subs r2, r2, #0x04
bne .LMix

I know there's gotta be a way to speed this up by getting rid of the ror's but... HOW? Can anyone help me here? Please? ><'

Well I don't have any brilliant ideas at the moment, but this should do it in one cycle less:

Code:

and r5, r4, r5, lsr #8
and r6, r6, r4, lsl #8
orr r5, r5, r6

and r7, r4, r7, lsr #8
and r8, r8, r4, lsl #8
orr r6, r7, r8

mov r7, r5, lsl #16
mov r8, r6, lsl #16
orr r7, r8, r7, lsr #16

bic r8, r6, r8, lsr #16
orr r8, r8, r5, lsr #16

I'll post again if I come up with anything better.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

Thanks for the idea, Deku! It sped up a fair bit, considering that it does 528 samples per frame, with 16 channels + linear interpolation.

Latest code by Deku, which I believe is as fast as it can get:

Code:

@ r0: Dest
@ r1: Source
@ r2: Count
@ r3: Mask (0000FFFF; register is FFFFFFFF)
@ r4: Mask (00FF00FF)

.LMix:
ldmia r1!, {r5-r8}

and r5, r4, r5, lsr #0x08 @ 00R100L1
and r6, r6, r4, lsl #0x08 @ R200L200
orr r5, r5, r6 @ R2R1L2L1

and r7, r4, r7, lsr #0x08 @ 00R300L3
and r8, r8, r4, lsl #0x08 @ R400L400
orr r6, r7, r8 @ R4R3L4L3

and r7, r5, r3, lsr #0x10
orr r7, r7, r6, lsl #0x10

bic r8, r6, r3, lsr #0x10
orr r8, r8, r5, lsr #16

str r8, [r0, #BUFFER_LEN*BUFCNT]
str r7, [r0], #0x04

subs r2, r2, #0x04
bne .LMix

How about this? Untested, but I think the idea is sound.

Code:

@ BbAa,DdCc,FfEe,HhGg -> GECA,HFDB

@ r0 : dst (8bit)
@ r1 : src (16bit)
@ r2 : count
@ r4 : 0x00FF00FF
.Lmix:
ldmia r1!, {r5-r8} @ BbAa,DdCc,FfEe,HhGg

@ Downsample r5 and r6
and r5, r4, r5, lsr #8 @ 0B0A
and r6, r4, r6, lsr #8 @ 0D0C
orr r5, r5, r6, lsl #8 @ DBCA

@ Downsample r7 and r8
and r7, r4, r7, lsr #8 @ 0F0E
and r8, r4, r8, lsr #8 @ 0H0G
orr r7, r7, r8, lsl #8 @ HFGE

@ Swap (DB) and (GE)
eor r6, r6, r5, lsr #16 @ HF(G^D)(E^B)
eor r5, r5, r6, lsl #16 @ GECA
eor r6, r6, r5, lsr #16 @ HFDB

Wow, that actually worked! Thanks! ^_^'

gbadev.org forum archive

ASM > Four-in-one 16->8-bit

#170357 - Ruben - Fri Sep 18, 2009 8:58 pm

#170358 - DekuTree64 - Fri Sep 18, 2009 9:28 pm

#170359 - Ruben - Fri Sep 18, 2009 9:51 pm

#170360 - Ruben - Fri Sep 18, 2009 10:18 pm

#170404 - Cearn - Mon Sep 21, 2009 6:41 pm

#170413 - Ruben - Mon Sep 21, 2009 11:17 pm