gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Coding > which is faster? DMA3 or asm linear buffer copy?

#22328 - Marill - Fri Jun 18, 2004 6:43 am

I'm not too sure about the speeds, I am trying to evaluate which method should I use to copy sprites into VRAM.

Immediately, DMA3 comes to mind.

But looking at the SGADE source code, instead of using DMA3 to perform the transfer, they used a fast asm linear buffer copy instead.

This insterest me because with all the loading of sprites into memory, SGADE uses DMA3 copy. (eg. SoSpriteMemManaerLoad).

But, with SoSpriteMemManagerLoadFromImage() function, instead of using DMA3 copy, the asm linear buffer copy is used.

Anyone has any idea why this is so?

I am thinking it may be a speed issue, which is faster? DMA3 or the asm?

the asm is here in it's full glory:
(i checked the SGADE license and i think it allows me to post source code here, if I am mistaken, please let me know!)

Code:

@ ---------------------------------------------------------------------------------
@ Title:    SpriteCopy
@ File:     sprite_copy.text.iwram.s
@ Author:   Willem Kokke (Gabriele Scibilia)
@ Created:  March 20 2002 (June 29 2003)
@
@ Info:This file contains the assembly implementation
@      of a function to copy a linear buffer to sprite format, which is
@      organised in a linear array of 8*8 blocks
@
@      This function is located in iwram
@
@      For optimal performance, make sure the source buffer is in iwram
@      If your destination buffer is in vram, make sure to only use this in vblank,
@      since the accestimes to vram are undefined while refreshing the display
@
@      The function Willem wrote takes ~ 2% cpu time for a 64*64 sprite, Gabriele
@      unrolled the loop and now it is even faster. The latest implementation
@      takes about 14% for a 240*160 fullscreen (while the oldest was ~ 18%)


@ ---------------------------------------------------------------------------------
@ Initialize;
@ ---------------------------------------------------------------------------------

        .ARM
        .ALIGN
        .GLOBL  SoTileSetCopyFromLinearBuffer

@ ---------------------------------------------------------------------------------
@ Externals;
@ ---------------------------------------------------------------------------------


@ ---------------------------------------------------------------------------------
@ SpriteCopy
@
@ Prototype:
@
@ __attribute__ (( long_call ))
@void SpriteCopy( u32* source, u32* dest, u32 width, u32 height );
@
@ Parameters:
@
@ source:   r0 = the start of the linear buffer in iwram
@ dest:     r1 = the start of the sprite in vram
@ width:    r2 = the width of the iwram buffer in pixels
@ height:   r3 = the height of the iwram buffer in pixels
@ ---------------------------------------------------------------------------------

SoTileSetCopyFromLinearBuffer:

   @ Store the registers we crush on the stack;

       stmfd   sp!,{r0-r12,r14}

   @ calculate the number of 8*8 blocks across the width and height
   @ r4 indicates how many pixels to copy per line
   @ r3 indicates how many lines of pixels to copy

       mov     r4, r2


   @ Copy a 8*8 block from the linear buffer to vram
   @ r14 is the temporary source pointer


   Copy8x8Block:

       mov     r14,r0              @ set the start position for the new block
       add     r0, r14, #8         @ save the start position for the next block
       ldmia   r14,{r5, r6}        @ load 8 bytes from iwram
       add     r14,r14, r2         @ increase the source pointer with "width" bytes
       ldmia   r14,{r7, r8}        @ load 8 bytes from iwram
       add     r14,r14, r2         @ increase the source pointer with "width" bytes
       ldmia   r14,{r9, r10}       @ load 8 bytes from iwram
       add     r14,r14, r2         @ increase the source pointer with "width" bytes
       ldmia   r14,{r11,r12}       @ load 8 bytes from iwram
       stmia   r1!,{r5 -r12}       @ store 32 bytes in vram, and writeback the pointer
       add     r14,r14, r2         @ increase the source pointer with "width" bytes

       ldmia   r14,{r5, r6}        @ repeat this 2 times for a full 8*8 block
       add     r14,r14, r2
       ldmia   r14,{r7, r8}
       add     r14,r14, r2
       ldmia   r14,{r9, r10}
       add     r14,r14, r2
       ldmia   r14,{r11,r12}
       stmia   r1!,{r5 -r12}

       subs    r4, r4, #8          @ Substract 8 from the total number of pixels to copy
       bne     Copy8x8Block        @ Zero left?? then go to next row, else copy the next block

       subs    r3, r3, #8          @ Decrease the number of rows with 8
       beq     CopyEnd             @ Zero left?? then branch to end

       mov     r4, r2              @ A new line, so a new blocks per row counter
       add     r0, r14, #8         @ Move the source pointer to the next 8 rows

       b       Copy8x8Block        @ Start the next row

   @ we're finished, restore the registers and lets get outta here

   CopyEnd:

       ldmfd   sp!,{r0-r12,r14}
       bx      lr

@ ---------------------------------------------------------------------------------
@ EOF;
@ ---------------------------------------------------------------------------------

#22329 - Marill - Fri Jun 18, 2004 6:52 am

From the comments in the source code

Quote:
The function Willem wrote takes ~ 2% cpu time for a 64*64 sprite, Gabriele unrolled the loop and now it is even faster


Okies I think I missed that previously.

So let's take a 64x64 sprite.

This asm takes less than 2% cpu time from the comments above.

How long will DMA3 halt the CPU for a 64x64 sprite transfer? Assuming 256 color sprite.

Thanks in advance! :)

#22330 - tepples - Fri Jun 18, 2004 7:05 am

A 256 color sprite cel at 64x64 pixels takes 1024 bytes.

Assuming 3/1 wait state (the default is 4/2; most commercial games switch to 3/1 on start), each copy of 2 bytes takes 3 cycles (1 wait, 1 read ROM, 1 write VRAM). A 1024 byte DMA transfer from ROM will take 1024 / 2 * 3 = 1536 cycles, or 0.54 percent of one frame's total time, or 1.83 percent of one frame's vblank time, or slightly longer than one scanline. If you need subscanline interrupt latency (such as if you're doing hblank tricks or serial communication, or if your mixer's double buffer is clocked off an interrupt rather than off vblank), use the BIOS function CpuFastSet to do such copies.

Thing is, that assembly code also converts the cel from a raw linear bitmap (as would be found in .bmp) to a tile-format bitmap. DMA can't do this conversion. Most commercial games seem to store their sprite cels in ROM either as tile-format bitmaps or in some compressed format.

And is there a specific reason you need to use 256-color sprites? Using 256-color sprites means you need to use the same palette for all sprites unless you use complicated hblank tricks. Most commercial games actually use 16-color sprites.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#22332 - Marill - Fri Jun 18, 2004 7:40 am

thanks for the explanation, tepples. I am using 16 color sprites.

the reason the assumption is in 256 color sprites is because the higher level SGADE function that calls this asm works on 256 color sprites, so I'm just using 256 color sprites for comparison with DMA copy.

#22337 - Marill - Fri Jun 18, 2004 1:20 pm

okies I think I understand why linear buffer is used instead, thanks tepples for the pointers.

You can DMA the data into VRAM if your sprite data have already been formatted into 8x8 blocks. This is used coz the SGADE sprite data are pre-foramtted into 8x8 blocks (1D mode)

For loading of Images, the SGADE images are stored in linear buffer, and thus cannot be DMA'ed into the VRAM diretly. The asm is used to convert the linear buffer into 8x8 format first, tehn copy into VRAM.

it all makes sense now! thanks! ;)