gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Four-in-one 16->8-bit

#170357 - Ruben - Fri Sep 18, 2009 8:58 pm

So I decided to start writing my FFIV sequel from scratch again, and am close to running out of cycles, so I'm thinking about speeding up the only part 'optimizable' in the sound mixing code: 16-bit to 8-bit, 4 samples at once. (That is, make 4x 8-bit samples from 4x 16-bit samples quickly)

What I have at the moment is this...
Code:
@ r0: Destination (8-bit, word aligned)
@ r1: Source (16-bit, word aligned)
@ r2: Count
@ r4: Mask 0x00FF00FF
.LMix:
    ldmia   r1!, {r5-r8}          @ RRRRLLLL
   
    and     r5, r4, r5, lsr #0x08 @ 00RR00LL
    and     r6, r6, r4, lsl #0x08 @ RR00LL00
    orr     r5, r5, r6            @ RRRRLLLL
   
    mov     r6, r5, lsr #0x10     @ 0000RRRR
    bic     r5, r5, r6, lsl #0x10 @ 0000LLLL
    mov     r6, r6, ror #0x10     @ RRRR0000
   
    and     r7, r4, r7, lsr #0x08 @ 00RR00LL
    and     r8, r8, r4, lsl #0x08 @ RR00LL00
    orr     r7, r7, r8            @ RRRRLLLL
   
    orr     r5, r5, r7, lsl #0x10 @ LLLLLLLL
    orr     r6, r6, r7, lsr #0x10 @ RRRRRRRR
    mov     r6, r6, ror #0x10
   
    str     r6, [r0, #BUFFER_LEN*BUFCNT]
    str     r5, [r0], #0x04
   
    subs    r2, r2, #0x04
    bne     .LMix

I know there's gotta be a way to speed this up by getting rid of the ror's but... HOW? Can anyone help me here? Please? ><'

#170358 - DekuTree64 - Fri Sep 18, 2009 9:28 pm

Well I don't have any brilliant ideas at the moment, but this should do it in one cycle less:
Code:
and r5, r4, r5, lsr #8
and r6, r6, r4, lsl #8
orr r5, r5, r6

and r7, r4, r7, lsr #8
and r8, r8, r4, lsl #8
orr r6, r7, r8

mov r7, r5, lsl #16
mov r8, r6, lsl #16
orr r7, r8, r7, lsr #16

bic r8, r6, r8, lsr #16
orr r8, r8, r5, lsr #16

I'll post again if I come up with anything better.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#170359 - Ruben - Fri Sep 18, 2009 9:51 pm

Thanks for the idea, Deku! It sped up a fair bit, considering that it does 528 samples per frame, with 16 channels + linear interpolation.

#170360 - Ruben - Fri Sep 18, 2009 10:18 pm

Latest code by Deku, which I believe is as fast as it can get:
Code:
@ r0: Dest
@ r1: Source
@ r2: Count
@ r3: Mask (0000FFFF; register is FFFFFFFF)
@ r4: Mask (00FF00FF)

.LMix:
   ldmia  r1!, {r5-r8}
   
   and    r5, r4, r5, lsr #0x08 @ 00R100L1
   and    r6, r6, r4, lsl #0x08 @ R200L200
   orr    r5, r5, r6            @ R2R1L2L1
   
   and    r7, r4, r7, lsr #0x08 @ 00R300L3
   and    r8, r8, r4, lsl #0x08 @ R400L400
   orr    r6, r7, r8            @ R4R3L4L3
   
   and    r7, r5, r3, lsr #0x10
   orr    r7, r7, r6, lsl #0x10
   
   bic    r8, r6, r3, lsr #0x10
   orr    r8, r8, r5, lsr #16
   
   str    r8, [r0, #BUFFER_LEN*BUFCNT]
   str    r7, [r0], #0x04
   
   subs   r2, r2, #0x04
   bne    .LMix

#170404 - Cearn - Mon Sep 21, 2009 6:41 pm

How about this? Untested, but I think the idea is sound.

Code:
@ BbAa,DdCc,FfEe,HhGg -> GECA,HFDB

    @ r0 : dst (8bit)
    @ r1 : src (16bit)
    @ r2 : count
    @ r4 : 0x00FF00FF
.Lmix:
    ldmia   r1!, {r5-r8}            @ BbAa,DdCc,FfEe,HhGg
   
    @ Downsample r5 and r6
    and     r5, r4, r5, lsr #8      @ 0B0A
    and     r6, r4, r6, lsr #8      @ 0D0C
    orr     r5, r5, r6, lsl #8      @ DBCA
   
    @ Downsample r7 and r8
    and     r7, r4, r7, lsr #8      @ 0F0E
    and     r8, r4, r8, lsr #8      @ 0H0G
    orr     r7, r7, r8, lsl #8      @ HFGE
   
    @ Swap (DB) and (GE)
    eor     r6, r6, r5, lsr #16     @ HF(G^D)(E^B)
    eor     r5, r5, r6, lsl #16     @ GECA
    eor     r6, r6, r5, lsr #16     @ HFDB

#170413 - Ruben - Mon Sep 21, 2009 11:17 pm

Wow, that actually worked! Thanks! ^_^'