gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Coding > Loading a linear framebuffer into Vram

#7715 - wizardgsz - Mon Jun 23, 2003 2:18 pm

Has someone got a fast routine to load a linear framebuffer into OBJ/BG Vram?
This silly C implementation takes so many cycles:(

Hear you soon!

Code:

void CopyFromLinearBuffer( u16* a_Dest, u16* a_Source, u32 a_WidthInBlocks, u32 a_HeightInBlocks )
{
   // Dummy counters;
   u32 x, y;

   // Source data pointers to the 8x8 block
   u16 *s0, *s1, *s2, *s3, *s4, *s5, *s6, *s7;

   // Calculate the number of 16 bit halfwords for each image row
   u32 widthInHalfwords = a_WidthInBlocks << 3 >> 1;

   // Let's start from the first upper-left block
   for ( y = 0; y < a_HeightInBlocks; y++ )
   {
      s0 = &a_Source[ widthInHalfwords * y * 8 ];
      s1 = s0 + widthInHalfwords;
      s2 = s1 + widthInHalfwords;
      s3 = s2 + widthInHalfwords;
      s4 = s3 + widthInHalfwords;
      s5 = s4 + widthInHalfwords;
      s6 = s5 + widthInHalfwords;
      s7 = s6 + widthInHalfwords;

      // Process the entire 8x8 block
      for ( x = 0; x < a_WidthInBlocks; x++ )
      {
         *a_Dest++ = *s0++;
         *a_Dest++ = *s0++;
         *a_Dest++ = *s0++;
         *a_Dest++ = *s0++;

         *a_Dest++ = *s1++;
         *a_Dest++ = *s1++;
         *a_Dest++ = *s1++;
         *a_Dest++ = *s1++;

         *a_Dest++ = *s2++;
         *a_Dest++ = *s2++;
         *a_Dest++ = *s2++;
         *a_Dest++ = *s2++;

         *a_Dest++ = *s3++;
         *a_Dest++ = *s3++;
         *a_Dest++ = *s3++;
         *a_Dest++ = *s3++;

         *a_Dest++ = *s4++;
         *a_Dest++ = *s4++;
         *a_Dest++ = *s4++;
         *a_Dest++ = *s4++;

         *a_Dest++ = *s5++;
         *a_Dest++ = *s5++;
         *a_Dest++ = *s5++;
         *a_Dest++ = *s5++;

         *a_Dest++ = *s6++;
         *a_Dest++ = *s6++;
         *a_Dest++ = *s6++;
         *a_Dest++ = *s6++;

         *a_Dest++ = *s7++;
         *a_Dest++ = *s7++;
         *a_Dest++ = *s7++;
         *a_Dest++ = *s7++;
      }
   }
}

_________________
http://www.geocities.com/gabriele_scibilia/

#7716 - niltsair - Mon Jun 23, 2003 2:23 pm

Use Dma. There's plenty of example about it on this site, just make a search on it.

#7717 - wizardgsz - Mon Jun 23, 2003 2:27 pm

Dma or cpuCopy swi call for 8 byte?
I thought it was expensive to set up a Dma copy for it.

Many thanks
Ga
_________________
http://www.geocities.com/gabriele_scibilia/

#7718 - niltsair - Mon Jun 23, 2003 2:32 pm

There's at least 8x8Bytes to copy for a Tile.

#7721 - wizardgsz - Mon Jun 23, 2003 3:45 pm

niltsair wrote:
There's at least 8x8Bytes to copy for a Tile.


Yeah! but since I'm copying from a linear framebuffer there are 8 non-consecutive rows (8bytes a row) to be considered. Am I wrong:-?
Set up 8 Dma-copies can be slower than a simple 4halfwords copy repeated for 8 times (one per row). Can't it?
_________________
http://www.geocities.com/gabriele_scibilia/

#7722 - niltsair - Mon Jun 23, 2003 3:53 pm

They all seem to be consecutive.

#7723 - wizardgsz - Mon Jun 23, 2003 3:58 pm

niltsair wrote:
They all seem to be consecutive.


Ehm, since the routine above works...
it copies 4 halfwords from different rows (8) within a linear framebuffer into a consecutive 64byte tile/block (in Vram).
The source isn't consecutive memory.

I'm certainly missing something, excuse me;)

Ga
_________________
http://www.geocities.com/gabriele_scibilia/

#7731 - tepples - Mon Jun 23, 2003 6:15 pm

OP is trying to copy from a linear framebuffer to a tile-based framebuffer. Such a copy does not take place in "consecutive memory".
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#7752 - Paul Shirley - Tue Jun 24, 2003 12:06 am

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 10:04 pm; edited 1 time in total

#7759 - wizardgsz - Tue Jun 24, 2003 7:20 am

Quote:

1: use u32's instead of u16's in the copy, less instructions -> faster code


Can I use 32 bit access to Vram?

Quote:

This is a classic piece of code that will be dramatically easier to write well in assembler.


Well, my asm skill is about zero:)
Many thanks for your precious suggestions.
_________________
http://www.geocities.com/gabriele_scibilia/

#7761 - Paul Shirley - Tue Jun 24, 2003 8:59 am

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 10:04 pm; edited 1 time in total

#7957 - wizardgsz - Sun Jun 29, 2003 1:17 pm

I'm glad to read your tips.
I received a fast ldmia/stmia implementation working 8byte by 8byte (8 ldmia/stdmia), I edited a little to work 16byte a time (four 32-bit registers, 4 load&stores). I'll try to figure out how use 8 registers and do it with 2 sets of load&stores as suggested, thank you again!
_________________
http://www.geocities.com/gabriele_scibilia/

#7972 - wizardgsz - Sun Jun 29, 2003 6:32 pm

Excuse me again, I hope I'm working on the right direction.

I properly setted up a 64bytes block-copy (rows 0-3 first, 4-7 then), is it the way you suggested?
I got about 14% of cpu time for a full screen 240x160 image.
This code is necessarily explanatory, repeat this 2 times for a full 8*8 block:

Code:

@ source:   r14 = the start of the linear buffer
@ dest:     r1  = the start of the tileset in vram
@ width:    r2  = the width of the buffer in pixels

ldmia   r14,{r5, r6}        @ load 8 bytes from iwram
add     r14,r14, r2         @ increase the source pointer with "width" bytes
ldmia   r14,{r7, r8}        @ load 8 bytes from iwram
add     r14,r14, r2         @ increase the source pointer with "width" bytes
ldmia   r14,{r9, r10}       @ load 8 bytes from iwram
add     r14,r14, r2         @ increase the source pointer with "width" bytes
ldmia   r14,{r11,r12}       @ load 8 bytes from iwram
stmia   r1!,{r5 -r12}       @ store 32 bytes in vram, and writeback the pointer
add     r14,r14, r2         @ increase the source pointer with "width" bytes


I have to thanks all the readers!
_________________
http://www.geocities.com/gabriele_scibilia/

#7975 - Paul Shirley - Sun Jun 29, 2003 7:15 pm

removed

Last edited by Paul Shirley on Sun Mar 28, 2004 10:04 pm; edited 1 time in total

#8247 - wizardgsz - Sat Jul 05, 2003 10:40 am

Quote:

Getting there, you can improve it substantially by allocating 4 source pointers and using writeback mode in the ldmia. Remember: you've got a lot of registers to play with so use them.


Yeahh!! That's could be really a speed up, I cannot imagine how asm could speed your work up.

Today I'm trying to modify my routine using your new hint... withous success anyway:(

First, I haven't got 3 registers left and I'm running out of registers for the loop counters (across width and height).
Well, I don't know asm very well (I know it absolutely nothing), excuse me.
Anyhow I'm trying to write just a "row" of tiles but I got weird stuffs; the source pointers point something else they should.
The loop itself works (I verified it with imaginary 32bit values assigned myself without the ldmia instructionS, just some ldr xx, =((150)+(150<<8)+(150<<16)+(150<<24))) but using ldmia it loads (and stores then) trash.

Assuming such a prototype

Code:

void Copy( u32* source, u32* dest, u32 width, u32 height );
@ source:   r0 = the start of the linear buffer in iwram
@ dest:     r1 = the start of the sprite in vram
@ width:    r2 = the width of the iwram buffer in pixels
@ height:   r3 = the height of the iwram buffer in pixels


I'm trying with

Code:

        @ r11 source pointer
        @ r12 destination pointer

            mov     r11,r0
            mov     r12,r1

        @ r14 indicates how many pixels to copy per line
        @ r3 indicates how many lines of pixels to copy

            mov     r14,r2


        CopyNextRow:

        @ r8,r9,r10,r11 source row pointers for row,row+1,row+2,row+3

            mov     r8, r11
            add     r9, r8, r14
            add     r10,r9, r14
            add     r11,r10,r14

        Copy8x8Block:

        @ Copy a 8*8 block from the linear buffer to vram
        @ r0-r7 working store for half a char

            ldmia   r8!, {r0-r1}        @ load 8 bytes from iwram, and writeback the pointer
            ldmia   r9!, {r2-r3}        @ load 8 bytes from iwram, and writeback the pointer
            ldmia   r10!,{r4-r5}        @ load 8 bytes from iwram, and writeback the pointer
            ldmia   r11!,{r6-r7}        @ load 8 bytes from iwram, and writeback the pointer
            stmia   r12!,{r0-r7}        @ store 32 bytes in  vram, and writeback the pointer

            ldmia   r8!, {r0-r1}        @ repeat this 2 times for a full 8*8 block
            ldmia   r9!, {r2-r3}
            ldmia   r10!,{r4-r5}
            ldmia   r11!,{r6-r7}
            stmia   r12!,{r0-r7}

        @ I need r14 from "row" to "row" so don't trash it!
        @ I run out of registers

            subs    r14, r14, #8        @ Substract 8 from the total number of pixels to copy
            bne     Copy8x8Block        @ Zero left?? then go to next row, else copy the next block


@            subs    ??, ??, #8          @ Decrease the number of rows with 8
@            beq     CopyEnd             @ Zero left?? then branch to end

@            b       CopyNextRow         @ Start the next row

        @ we're finished

        CopyEnd:

_________________
http://www.geocities.com/gabriele_scibilia/