#443 - Vortex - Tue Jan 07, 2003 7:35 pm
Hello !
All of the tutorials recommend using the following code to display an image in mode 3:
Code: |
void ShowImage( u8* image )
{
u16* img = (u16*) image;
for( u16 y = 0; y < ScreenHeight; y ++ )
for( u16 x = 0; x < ScreenWidth; x ++ )
VideoBuffer[ x + y * ScreenWidth ] = *img++;
}
|
I am wondering are any benefits using memcpy() instead:
Code: |
void ShowImage( u8* image )
{
memcpy( VideoBuffer, image, ScreenWidth * ScreenHeight * 2 ); // memcpy works with bytes
}
|
The same question applies for using memset() in ClearScreen.
Thanks
#445 - Touchstone - Tue Jan 07, 2003 7:45 pm
Defenitely a speed increase if you use memcpy becuase you'll get rid of the horrible [x + y * ScreenWidth] thing. You'll get some increase in having just one loop instead of two aswell. But what you really would want to do is DMA copy the whole thing instead of doing it with CPU. DMA is syncronus with the CPU so don't plan to use the CPU while copying but at least you will get sequential accesses instead of a lot of random accesses. I think. :)
_________________
You can't beat our meat
#505 - jaymzjulian - Wed Jan 08, 2003 5:36 am
but you should be aware that memcpy() is still *horrendously slow* on the GBA - better to use DMA or an asm funciton with a list of LDM/STM pairs
- jj
#697 - Nessie - Thu Jan 09, 2003 8:47 pm
There is another way it could be coded that would offer a performance boost over your code...but probably wouldn't be quite as fast as a DMA or optimized memcpy.
Code: |
void ShowImage( u8* image )
{
u16* img = (u16*) image;
u16* dest = VideoBuffer;
for ( u16 i = 0; i < ScreenHeight * ScreenWidth; i++ )
{
*dest++ = *img++;
}
}
|
...anyway, hopefully you get the idea even if the code isn't quite right.
#698 - Lord Graga - Thu Jan 09, 2003 8:55 pm
Even more optimized it is:
Code: |
void ShowImage( u8* image )
{
u16* img = (u16*) image;
for ( u16 i = 0; i < 38400; i++ ) VideoBuffer = *img++;
}
|
#702 - Nessie - Thu Jan 09, 2003 9:12 pm
I think I know what you're getting at, but you still have to increment your videoBuffer pointer.
Also, any decent compiler will automatically compute the result of any constants......eg.:
Code: |
if ( dumbVariable > (4 * 2 * 3 * 2 + 5 * 2 * 4 * 2))
{
}
|
...should be compiled as:
Code: |
if ( dumbVariable > 128 ) //(8 * 6 + 10 * 8 == 48 + 80 == 128 )
{
}
|
...I usually tend to leave arithmetic with constants intact since it makes the code easier to read and maintain..and it in no way affects performance unless your compiler is crap.
You can verify how you compiler handles this by checking out the disassembly.[/code]
#704 - Touchstone - Thu Jan 09, 2003 9:27 pm
Hey Graga, your code is broken. VideoBuffer is a pointer.
I don't think you'll get it significantly much faster code than Nessie without unrolling the loop. You could of course do it like this Code: |
void ShowImage(u8* _pImage)
{
u16* pSrc = (u16*)_pImage;
u16* pDst = VideoBuffer;
int i = ScreenWidth * ScreenHeight;
do
{
*pDst++ = *pSrc++;
} while( --i > 0);
} |
This code doesn't require as many registers and variables for the asm code if compared to Nessies code since the expression while(--i > 0); can be evaluated to this asm code: Code: |
subs r0, r0, #1 // r0 = i
bne loopStart |
whereas Nessies for statement typically evaluates to: Code: |
add r0, r0, #1 // r0 = i
cmp r0, r1 // r1 = ScreenWidth*ScreenHeight
blt loopStart |
_________________
You can't beat our meat
#721 - Nessie - Thu Jan 09, 2003 11:41 pm
Ah, the fine art of optimizations.... thanks for the tip Touchstone..
#723 - Touchstone - Thu Jan 09, 2003 11:52 pm
Happy I could be of some advice for anyone.
_________________
You can't beat our meat
#740 - abstim - Fri Jan 10, 2003 2:50 am
most good compilers will make that optimization too.
#803 - Splam - Fri Jan 10, 2003 7:28 pm
If you're going to do it by hand instead of using dma then at least code it in asm in the 1st place, don't presume what the compiler will convert your C to, you'd be suprised if you looked at even the simplest function and the crap compilers can make of it. I've found after converting the most optimised C routine I could possibly make (for converting a commodore 64 screen to gba screen) after converting to asm I got ~600% speed increse.
As jaymzjulian said, dma or ldm/stm (using as many free registers as you have) would be best (and use u32 as well, saves on loop and therefore instruction fetches) unravelled loop would be even faster.
#815 - Nessie - Fri Jan 10, 2003 8:34 pm
Actually, I think everyone agreed that DMA or ASM would be the best solution.
But for people who maybe aren't very far along in learning GBA programming, knowing some simple speed-up tricks using standard C is probably more than adequate. Also, this question wasn't even asked in the ASM section, so perhaps people thought it more appropriate to explore faster C versions of the original code.
#823 - Vortex - Fri Jan 10, 2003 10:34 pm
Thank you for the answers. One more question: is it possible to use the suggested DMA approach in Mode 4 ? Does the video memory 16-bit per read/write limitation applies to the DMA access too ?
I guess the mode 4 DMA code should look like this:
Code: |
REG_DMA3SAD = PaletteAndImageData;
REG_DMA3DAD = 0x6000000;
REG_DMA3CNT = ScreenWidth * ScreenHeight / 2;
|
Thanks again
#857 - Splam - Sat Jan 11, 2003 10:15 am
There isn't a 16bit read/write limit on vram, you can write 32bit if you want.
#869 - Costis - Sat Jan 11, 2003 5:36 pm
Hi,
Yes, VRAM can be written in 16\32 bit bus-widths. However, you shouldn't use memcpy or memset to copy data or clear VRAM as it uses an 8-bit copy width. Copying over 32-bits at a time would be optimal. Actually, you could just have a fast ldmia\stmia loop if you don't want to use DMA. Here's an ARM code example of an ldmia\stmia loop implementation:
Code: |
CopyImageMode3
; r0 holds the address of the image data to be copied
mov r1, #0x6000000 ; VRAM base area in Mode 3
stmfd sp!, {r4-r12, lr}
mov r2, #160 ; Y line counter
ImageCopyLoop
; Copy one scanline per loop
ldmia r0!, {r3-r12, lr} ; Load 48 bytes into r3 through r12 and lr
stmia r1!, {r3-r12, lr} ; Store 48 bytes into the VRAM base in Mode 3
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
ldmia r0!, {r3-r12, lr}
stmia r1!, {r3-r12, lr}
subs r2, r2, #1
bne ImageCopyLoop
ldmfd sp!, {r4-r12, pc}
|
I haven't tested this exact code out yet, but I've used that trick many times to clear small buffers, etc. I believe that someone discovered that DMA is still faster than using this method some time ago. Also, the above code can be rolled\unrolled as much as you would like to suit your needs.
Costis
#870 - Splam - Sat Jan 11, 2003 6:12 pm
Yep, if I had to do it in code then thats the way I'd do it, and yes dma is still faster, doesn't matter how much you unroll the loop because once it's set up there is no overhead (instruction fetch etc).