#21398 - isildur - Fri May 28, 2004 7:36 pm
I am writing this 3d/Vector engine using mode 4. And, as most of us coding 3d for the GBA, I'm always trying to make the fastest code possible (at least as fast as I can come up with). Since I am using lots of pixel plotting and line drawing, I have to do a lot of single pixel read and write and because we can't access the vram one byte at a time, I always have to load 2 bytes, mask one of them to put a new pixel and finally write 2 bytes back. So I was wondering something.
Would it be faster to use a 240 x 160 buffer in EWRAM and do all the pixel and line drawing in there, then copy that buffer to vram in one chunk at the end of each frame? If I am not wrong, EWRAM is accessible one byte at a time?
Did someone here ever try that approach? What is your opinion?
#21402 - dagamer34 - Fri May 28, 2004 8:14 pm
Possible, yes. But you have to realize that you will be spending a good number of cycles in waiting for the DMA to finish copying the buffer from EWRAM to VRAM. Your best bet is to switch to either mode 4 or 5, depending on you needs.
_________________
Little kids and Playstation 2's don't mix. :(
#21404 - isildur - Fri May 28, 2004 8:36 pm
dagamer34 wrote: |
Your best bet is to switch to either mode 4 or 5, depending on you needs. |
... I am in mode 4... :)
Also, I don't absolutely have to use DMA... stmia could be used... but I will test it. It's a matter of how many cycles I would save or lose in the end. Because I would probably gain lots of cycles if I have direct access to pixel bytes. And all my drawing uses lines or single pixels, no sprites or bitmaps.
#21406 - Lord Graga - Fri May 28, 2004 9:18 pm
dagamer34 wrote: |
Possible, yes. But you have to realize that you will be spending a good number of cycles in waiting for the DMA to finish copying the buffer from EWRAM to VRAM. Your best bet is to switch to either mode 4 or 5, depending on you needs. |
I suggest the BIOS function CPUFastSet. Read it up.
#21408 - sajiimori - Fri May 28, 2004 9:48 pm
Why would the CPU be faster than DMA when copying from EWRAM to VRAM?
At first I thought an external buffer would be faster if you were plotting enough pixels to make up for the extraneous copy, but now I'm thinking about how much slower EWRAM is than VRAM. Plotting pixels into it will be faster, but clearing it will be slower. I'm thinking you'd have to plot more than a screenfull of pixels per frame before you make up for the loss.
If you could stand using mode 5, that would definitely make things a lot faster.
#21409 - Lord Graga - Fri May 28, 2004 9:52 pm
sajiimori wrote: |
Why would the CPU be faster than DMA when copying from EWRAM to VRAM?
At first I thought an external buffer would be faster if you were plotting enough pixels to make up for the extraneous copy, but now I'm thinking about how much slower EWRAM is than VRAM. Plotting pixels into it will be faster, but clearing it will be slower. I'm thinking you'd have to plot more than a screenfull of pixels per frame before you make up for the loss.
If you could stand using mode 5, that would definitely make things a lot faster. |
First of all, page flipping is the best method, no doubt about that. But if you plan going to another mode, here's the reason why CPUFastSet is faster than DMA.
First of all, CPUFastSet is one of the fastest memory copy functions you'll ever see for GBA. It's done from a "trick", where you use the stmia instruction to load several bytes at the time, and therefore, speed up the whole processs.
Second, DMA also verifies the data that it is copying.
#21412 - isildur - Fri May 28, 2004 10:01 pm
sajiimori wrote: |
If you could stand using mode 5, that would definitely make things a lot faster. |
Mode 5 is too small...
Right now, I don't have a big performance problem, but I know I will when I will add more and more polys...
I will experiment with the stmia and dma and cpufastset. We shall see... ;)
#21415 - sajiimori - Fri May 28, 2004 10:24 pm
Quote: |
First of all, page flipping is the best method, no doubt about that.
|
I'm not sure who you're talking to. We were talking about an external buffer, not the inactive pages of modes 4 and 5.
Quote: |
First of all, CPUFastSet is one of the fastest memory copy functions you'll ever see for GBA. It's done from a "trick", where you use the stmia instruction to load several bytes at the time, and therefore, speed up the whole processs.
Second, DMA also verifies the data that it is copying.
|
How many extra cycles does DMA use to verify each word? Is it enough to overshadow the additional instruction fetches and branches when using the CPU?
#21442 - Lupin - Sat May 29, 2004 8:10 am
Why don't you write 2 pixels at a time? The only thing you would have to do is to decrease the accuracy on the polygon edges but i think it wouldn't really matter because you still have a high resolution.
With clever optimizations you can unroll your loops and write 16 pixels in one instruction (assuming you have 4 registers free in your drawing loop). The 16 bit writes of vram are not a disadvantage, they are really a good advantage when it comes to speed.
Someone on the board (I think it was Derek) already tried to use external buffers and it was too slow...
_________________
Team Pokeme
My blog and PM ASM tutorials
#21449 - isildur - Sat May 29, 2004 2:17 pm
Lupin wrote: |
With clever optimizations you can unroll your loops and write 16 pixels in one instruction (assuming you have 4 registers free in your drawing loop).
|
That would be good only for horizontal lines of minimum 16 pixels. I don't see how it would work with diagonal lines. I already have an optimized horiz line routine to draw filled polys. It write 32 bits at a time when it can but there is always the conditions that I have to check if the first and last pixels are aligned, so the advantage is there only for long lines... If I could just write directly from the first pixel to the last without all this checking, it seems it would be faster.
Lupin wrote: |
Someone on the board (I think it was Derek) already tried to use external buffers and it was too slow...
|
Yeah, maybe the speed gain is lost when copying the buffer. But I still have a clear screen function at the beginning of each frame, so maybe there is something to do about that, I would not need to clear the VRAM...
#21459 - tepples - Sat May 29, 2004 4:08 pm
Lord Graga wrote: |
First of all, CPUFastSet is one of the fastest memory copy functions you'll ever see for GBA. It's done from a "trick", where you use the stmia instruction to load several bytes at the time, and therefore, speed up the whole process. |
DMA fills the data bus with data at all times, taking zero time for anything like 'ldmia' or 'stmia' instructions. It pauses only for EWRAM and ROM wait states.
Quote: |
Second, DMA also verifies the data that it is copying. |
No it doesn't. If it did, it wouldn't be able to copy to write-only registers such as the PCM audio FIFOs. It does, however, reread the source address each time.
I know of only two advantages of CPUFastSet over DMA: - CPUFastSet is interruptible, which is important if you're trying to copy while doing hblank effects or serial communication.
- CPUFastSet doesn't reread the source address, making it good for memset() (as opposed to memcpy()).
isildur wrote: |
Mode 5 is too small |
Not if you rot/scale it. Then it's blocky, but it fills the screen and makes fill rate that much faster.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#21482 - isildur - Sat May 29, 2004 10:04 pm
isildur wrote: |
"tepples
isildur wrote: | Mode 5 is too small |
Not if you rot/scale it. Then it's blocky, but it fills the screen and makes fill rate that much faster. |
Blocky, thats what I don't want ;-)
#21586 - sasq - Tue Jun 01, 2004 3:31 pm
Drawing lines and individual pixels is a bit more expensive in mode 4 but polygon-filling (as mentioned) only requires handling the odd edge starts and stops, the rest can be written 16 bits at a time (writing 32bits at a time doesnt give that much speed increase and requires more special handling so I think 16bit at a time is faster since VRAM is 16bit, unless your surfaces are really large).
FYI (and to brag :) my mode 4 gouraud-shader is only 2.5 (or 3 counting the VRAM stall which im not exactly sure when it occurs) cycles per pixel in the inner (unrolled) loop...
#21590 - isildur - Tue Jun 01, 2004 4:20 pm
Ok, I've done some actual hardware testing and here are my results:
My timing was done using frequency 00 so I'm not sure if it translates to actual cpu cycles... Also, I was in mode 4.
Standard method drawing directly to VRAM:
Clearing VRAM using cpufastset timed 0x63BE
*Clearing VRAM using DMA timed 0x8CF8
Plotting 5000 single pixels to VRAM timed 0x213400
Total (using clearscreen with cpufastset): 0x2197BE
---------------------------------------------------------------------
Drawing to EWRAM screen buffer method:
Clearing the EWRAM buffer using cpufastset timed 0xF9B8
Plotting 5000 single pixels to EWRAM timed 0x1360F
Copying the EWRAM buffer to VRAM using cpufastset timed 0x602B
Total: 0x28FF2
So the buffered method would be 13 times faster!
Of course I will have to test again with a complete implementation of all my drawing routines. But these results are quite interesting, enough to convince me to go that way.
Edit: By the way, the pixel plotting was done with a call to my plotpixel routine for each pixel, so the timings would be a lot lower if it had been done in a single loop. The pixel position in VRAM or in the buffer was computed for each pixel.
To sasq: I will take your advice about writing lines in 16 bits chunks instead of 32 bits, it makes sense, thanks.
#21595 - Miked0801 - Tue Jun 01, 2004 6:22 pm
Actually, your results tell me that your plot pixel routine is painfully slow and need to be optimized. Yes it is easier to work with a double buffer, but I Know that you could get direct plotting to be faster than updating the whole screen. Could you post your plot pixel routine? There are a bunch of previous threads with good ASM plot routines that might do it better.
#21600 - isildur - Tue Jun 01, 2004 7:13 pm
No problem, if you have a better mode 4 plotpixel, I'd be glad to see it :)
Code: |
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@
@ void PlotPixelA(u32 x, u32 y) // color is the current color
@
PlotPixelA:
ldr r3, =gba_bank @ Get address of bank.
ldr r2, [r3] @ Load bank.
ldr r3, [r3, r2, lsl #2] @ Load bank address.
add r0, r0, r1, lsl #8 @ calculate pixel pos
sub r0, r0, r1, lsl #4
add r0, r0, r3
ldr r2, =gba_colour @ point at colour
ldr r2, [r2] @ load colour word
tst r0, #0x1 @ halfword aligned?
beq PlotPixelAaligned
@ not aligned
ldrh r1, [r0, #-1]! @ grab short holding byte we're writing
bic r1, r1, #0xff00 @ blank pixel we're writing
orr r1, r1, r2, lsl #8 @ insert pixel we're writing
strh r1, [r0], #2 @ write & leave aligned to next halfword
bx r14
PlotPixelAaligned:
@ aligned
ldrh r1, [r0] @ grab halfword
bic r1, r1, #0x00ff @ clear left pixel
orr r1, r1, r2, lsr #24 @ insert left pixel
strh r1, [r0] @ plot
bx r14
.pool
|
#21602 - sajiimori - Tue Jun 01, 2004 7:43 pm
#21604 - isildur - Tue Jun 01, 2004 8:00 pm
I checked it out but it doesn't work well, at least for me. Some pixels are 2 pixels wide... Maybe it is because like jd mentions it, it was not tested...
Also, I get identical performance when I compare it to the one I am using.
Does it work fine for you?
#21608 - jd - Tue Jun 01, 2004 9:02 pm
isildur wrote: |
I checked it out but it doesn't work well, at least for me. Some pixels are 2 pixels wide... Maybe it is because like jd mentions it, it was not tested...
|
Ok, here's one that is tested:
Code: |
@ CODE_IN_IWRAM void DrawPixel( int x, int y, int col )
DrawPixel:
ldr r12,=VideoBuffer
add r12,r12,r0
ands r0,r0,#0x1
add r12,r12,r1,lsl #8
sub r12,r12,r1,lsl #4
ldreqb r1,[r12,#1]
addeq r1,r2,r1,lsl #8
ldrneb r1,[r12,#-1]
addne r1,r1,r2,lsl #8
strh r1,[r12,-r0]
bx lr
|
isildur wrote: |
Also, I get identical performance when I compare it to the one I am using.
|
Two things to check:
1) That you're putting the pixel plotter into IWRAM.
2) That the test loop isn't wasting a lot of CPU time somewhere.
#21611 - isildur - Tue Jun 01, 2004 9:44 pm
jd: I still have some double pixels with this one. I test it with a starfield I have and compare it with my own plotpixel routine. Yours still doesn't seem to mask out neighboring pixels...
And both my routine and yours were in IWRAM and are speed tested in the same conditions.
My routine would gain some speed if I get rid of the bank selection for VRAM. This should probably be done outside only once per frame. After that we both have much the same number of instructions. I have this extra code to select the current color but yours would have to do something similar outside the routine to pass it as param 3. So that is probably why there is not much speed difference between yours and mine.
So, to get back in the topic. It still looks faster to draw to a virtual buffer to avoid the alignment checking and masking.
To Miked0801: I don't think my pixel plotting routine is 'painfully slow' ;-)
#21612 - sajiimori - Tue Jun 01, 2004 10:01 pm
Passing the color via a register is faster than writing to memory, then loading the memory address and dereferencing it. Given that, the bank optimization, and the elimination of the branch, I agree that there is something wrong if you are getting the same performance from jd's plotter.
But if you really want to use an external buffer, use DMA to copy it to VRAM. CPUFastSet is still good for clearing, though.
#21621 - jd - Tue Jun 01, 2004 11:53 pm
isildur wrote: |
jd: I still have some double pixels with this one. I test it with a starfield I have and compare it with my own plotpixel routine. Yours still doesn't seem to mask out neighboring pixels... |
Are you sure you're using it right? Remember that "col" should just be an 8-bit number. What do you mean by it doesn't seem to mask out neighboring pixels?
isildur wrote: |
So, to get back in the topic. It still looks faster to draw to a virtual buffer to avoid the alignment checking and masking. |
I disagree - the figures you're using to come to this conclusion are incorrect. They indicate that writing to VRAM is taking 435 cycles per pixel. The correct figure should be more like 20 cycles per pixel (including calling overhead) - it's pretty easy to calculate this by looking at the code. Therefore, the only conclusion is that there is a flaw in the figures - either the timing is wrong, the test loop is taking up a lot more CPU time than it should or more than 5000 pixels are being drawn.
Basically, writing a single pixel to VRAM should take about 6 cycles longer than writing a single pixel to EWRAM. Unless you're planning to plot a very large number of pixels per frame, it's quicker to work on VRAM directly.
#21625 - Miked0801 - Wed Jun 02, 2004 12:48 am
Agreed - when you divide to get the average per plot, the VRAM version is coming out at roughly 435 cycles per pixel whilst the EWRAM version is coming out to roughly 16 cycles per pixel - nearly 30 times slower. This is why I said there must be something way the hell wrong with your pixel plotter. 435 cycles would be about how a debug, C, naive plotter would run at. As you say you are using a nice IWRAM asm version, this just cannot be. Either your timer was being interrupted by VBlank/HBlank processing, you weren't profiling the right code, or I don't know what else. I suggest re-profiling or perhaps posting the .bin for someone else to run a profile run on.
#21626 - Miked0801 - Wed Jun 02, 2004 12:53 am
BTW, with 2 waits to access EWRAM vs 0 wait state VRAM, I don't see why even with the read/mask, the plotter would be faster to EWRAM at all. The only thing I can think of is yu accidently loaded the pixel plotter into EWRAM instead of IWRAM where 2 wait states per access with 2 access per instruction would absolutely destroy performance. Also, how are you determining where to plot pixels? Is this function stable in execution time or could it be the culprit?
#21628 - isildur - Wed Jun 02, 2004 1:15 am
Ok, lets put something clear now, I never said my timings were cpu cycles. I said I was using the timer frequency 00. I use this a bit like a GetTickCount() in win programming. Just before entering the test loop, I start the timer and then just out of the test loop, I get how many ticks have passed. Don't worry, my timer overflows to the next one so I add the overflow too :)
I use the same loop for every test to give me an idea of the speed of a routine. So it is NOT cpu cycles. I know it would have been better to use real cpu cycles but, to me, the results are the same. One method takes more time than the other.
jd wrote: |
What do you mean by it doesn't seem to mask out neighboring pixels?
|
Well, at some screen positions, it plots 2 pixels side by side instead of just one. It does this at certain x positions regardless of the y position. So if it does this at 4,0 it will do this problem for every pixel plotted at that x position, so I guess it's because it doesn't mask out one of the bytes, It writes both bytes with the color value instead of just the pixel supposed to be plotted.
Miked0801 wrote: |
The only thing I can think of is yu accidently loaded the pixel plotter into EWRAM instead of IWRAM |
It is running in IWRAM, don't worry, I checked ;-)
So, I don't know what to believe, my timing tests or facts about wait states and so on...
#21630 - jd - Wed Jun 02, 2004 1:54 am
isildur wrote: |
Ok, lets put something clear now, I never said my timings were cpu cycles. I said I was using the timer frequency 00.
|
Timer frequency 00 is 16.7MHz - i.e. CPU cycles. But it doesn't matter even if it wasn't - a pixel plot to VRAM definitely does not take 27 times longer than a pixel plot to EWRAM. One extra memory write and a few conditional instructions simply cannot consume hundreds of cycles - it's clearly impossible. Since we're ruled out a serious flaw in the pixel plot routine itself, the problem must be with the test code or its timing.
isildur wrote: |
Well, at some screen positions, it plots 2 pixels side by side instead of just one. It does this at certain x positions regardless of the y position. So if it does this at 4,0 it will do this problem for every pixel plotted at that x position, so I guess it's because it doesn't mask out one of the bytes, It writes both bytes with the color value instead of just the pixel supposed to be plotted. |
Are you sure you're using it right? Remember that the "col" parameter should only use the lower 8-bits - the rest of the parameter must be filled with zeros. This is different from the way your pixel plot routine seems to work (it bit-shifts down gba_colour by 24 bits if the pixel is even, which is highly unusual). Try adding "and r2,r2,#0xff" at the beginning to see if that fixes the problem.
#21632 - isildur - Wed Jun 02, 2004 2:16 am
Ok, for the plotpixel problem, I will try at work tomorrow, I don't have the code here.
For the timing problem, can you point me towards a good method to time my code on the gba?
#21641 - sasq - Wed Jun 02, 2004 9:21 am
Ok some quick calculations;
Setting a pixel in mode4 after calculating the adress (which is the same for EWRAM) would take something like 7 cycles plus eventual vram-stall so lets say 8.
ldrh, tst, orreq, orrne, strh
in the EWRAM case its just a strb which is 4 cycles in 3/3/6 waitstates (and setting lower isn't safe AFAIK)
So each pixel takes 4 cycles longer in mode4.
Copying the screen from EWRAM to VRAM should take about 9 cycles per 32bit word (6 + 2 mem acess + ~ 1 cycle for opcodes) which means 86400 cycles (not including VRAM stall).
Which means that in this case, if you draw more than 21600 pixels the EWRAM case should be faster.
#21647 - isildur - Wed Jun 02, 2004 2:20 pm
jd wrote: |
Try adding "and r2,r2,#0xff" at the beginning to see if that fixes the problem. |
Yep, that fixed it. Like you said, r2 had all 4 bytes set to the color index.
I will now work on my timing functions, obviously theres a problem with em...
#21649 - isildur - Wed Jun 02, 2004 4:49 pm
I still don't understand what can be wrong with my timers. This is the code I use to time my functions:
Code: |
void StartCycleCount(u32 freq)
{
// start the timer
TM2CNT_H = freq | TIMER_ENABLE;
// zero the timer
TM2CNT_L = 0;
// cascade timer
TM3CNT_H = TIMER_CASCADE | TIMER_ENABLE;
// zero the timer
TM3CNT_L = 0;
}
u32 StopCycleCount()
{
u32 cycles;
u32 overflow;
cycles = TM2CNT_L;
overflow = TM3CNT_L;
//Disable timers
TM2CNT_H = 0;
TM3CNT_H = 0;
// cycles + overflow * 65535
cycles += (overflow << 16);
return cycles;
}
// example
StartCycleCount(TIMER_FREQUENCY_0);
PlotPixel(25, 57);
numCycles = StopCycleCount();
|
Is there something wrong with this?
Another test:
Code: |
gba_vsync();
StartCycleCount(TIMER_FREQUENCY_256);
gba_vsync();
numCycles = StopCycleCount();
|
For this example, numCycles == 1097. So 1097 * 60 = 65820
which is quite close to 65535. A vsync at TIMER_FREQUENCY_256 should be 1092. So it looks like it's reporting time the way it should.
I used this same method to make my EWRAM and VRAM tests. If someone sees something, then I'd like to know.
#21654 - Miked0801 - Wed Jun 02, 2004 6:43 pm
Do you have a crt0.s that allows multiple interrupts (interrupts allowed to interrupt)? If so, any other interrupt can come in while you are playing with your interrupts and screw up your timing. For safety, I'd turn off the IME reigster at the beginning of your routine and re-enable on exit.
#21665 - sajiimori - Wed Jun 02, 2004 7:47 pm
*edit* nevermind...
#21666 - isildur - Wed Jun 02, 2004 7:55 pm
Miked0801 wrote: |
Do you have a crt0.s that allows multiple interrupts (interrupts allowed to interrupt)? If so, any other interrupt can come in while you are playing with your interrupts and screw up your timing. For safety, I'd turn off the IME reigster at the beginning of your routine and re-enable on exit. |
I use the default crt0 from devkitarm. Also, I am not using any interrupts.
And if it was that, I would get different readings every time. But when I time a routine, I always get the same timing for that given routine. If interrupts would mess my timing, then I would get different readings, I guess.
But like I said in my previous post, if I use the same timing method to time a call to vsync, which is timed ok. Why would it not time other routines correctly?
#21668 - sajiimori - Wed Jun 02, 2004 8:14 pm
It might be time to post more of your code. I just ran this and got figures around 7BF00 (using jd's plotter):
Code: |
main()
{
while(1)
{
int i;
REG_TM2CNT_H = 0;
REG_TM3CNT_H = 0;
REG_TM2CNT_H = TIME_FREQUENCY_SYSTEM | TIME_ENABLE;
REG_TM3CNT_H = TIME_OVERFLOW | TIME_ENABLE;
for(i = 0; i < 5000; ++i)
DrawPixel(0, 0, 0);
vba_print(
"%X\n", REG_TM2CNT_L + (REG_TM3CNT_L << 16));
}
}
|
#21670 - isildur - Wed Jun 02, 2004 8:41 pm
sajiimori wrote: |
It might be time to post more of your code. I just ran this and got figures around 7BF00 (using jd's plotter):
|
Which is around 101 cycles per pixel!
I tested the same loop on hardware with my pixel plotter and I had an even better time: 0x5578D, which is 70 cycles per pixel. With jd's plotter I got 0x5F3A2, which is 78 cycles per pixel.
Also testing with coords 0,0 for 5000 times is maybe not the best test. My first tests for the EWRAM comparison used a loop of 1000 times 5 different coords to get all the plotter's case figures (alignment).
I tested with sajiimori's timing loop and also with the one I was using and the results are the same, so my timing method must be correct.
So sajiimori, are we saying we both basically have the same results? :)
#21673 - sajiimori - Wed Jun 02, 2004 9:21 pm
Maybe now, but before you said 0x213400 for 5k pixels, which is not even close.
edit: BTW, if I optimize the calling code, it's under 60 cycles per pixel.
#21674 - isildur - Wed Jun 02, 2004 9:55 pm
sajiimori wrote: |
Maybe now, but before you said 0x213400 for 5k pixels, which is not even close.
|
That's very true, I just ran the tests again and to my surprise, I got very different results from the first speed tests... I don't understand :(
I will run those tests again tomorrow and post my results.
#21677 - Miked0801 - Wed Jun 02, 2004 10:22 pm
Just to be sure, you are compiling with either -O2 or -O3 right?
#21682 - jd - Thu Jun 03, 2004 12:13 am
When testing something as fast as a pixel plotter the testing loop can easily dominate the measurement. To get a more accurate figure I wrote the following bit of assembler to plot 2500 pixels at a time:
Code: |
.TEXT
.SECTION .iwram,"ax",%progbits
.ALIGN
.ARM
.GLOBAL Put2500Pixels
@ CODE_IN_IWRAM void Put2500Pixels( int x, int y, int colour, void* video_buffer );
Put2500Pixels:
stmfd r13!,{r4,r5,r6,r7}
mov r4,r0
mov r5,r1
mov r6,r2
mov r7,r3
ldr r12,=2500
loop:
mov r0,r4 @ x
mov r1,r5 @ y
mov r2,r6 @ colour
mov r3,r7 @ video_buffer
add r3,r3,r0
ands r0,r0,#0x1
add r3,r3,r1,lsl #8
sub r3,r3,r1,lsl #4
ldreqb r1,[r3,#1]
addeq r1,r2,r1,lsl #8
ldrneb r1,[r3,#-1]
addne r1,r1,r2,lsl #8
strh r1,[r3,-r0]
subs r12,r12,#1
bgt loop
ldmfd r13!,{r4,r5,r6,r7}
bx lr
.pool
|
You'll notice that we aren't branching to the function, although there is a "subs" and "bgt" operation as a result of the test code which more than makes up for the missing "bx lr". The performance of this code was measured on real hardware as shown below:
Code: |
ptimer = GetTimer();
Put2500Pixels( 100, 100, 5, (void*)0x06000000 );
Put2500Pixels( 101, 100, 5, (void*)0x06000000 );
ntimer = GetTimer();
printf_special( "%d ticks\n", ntimer - ptimer );
|
The two calls to "Put2500Pixels" were to test the performance with even and odd pixels. This gives a figure of approximately 20 cycles per pixel, which is what you'd expect.
#21685 - Miked0801 - Thu Jun 03, 2004 12:37 am
Could you run that plotter to EWRAM for a comparison please (and perhaps IWRAM for fun)?
#21689 - jd - Thu Jun 03, 2004 1:14 am
Miked0801 wrote: |
Could you run that plotter to EWRAM for a comparison please (and perhaps IWRAM for fun)? |
VRAM: ~20 cycles/pixel
EWRAM: ~14 cycles/pixel
IWRAM: ~12 cycles/pixel
You can see that the test loop and setting the parameters are really dominating here. Note that I used the following pixel plot routine for the EWRAM and IWRAM tests:
Code: |
.GLOBAL Put2500Pixels2
@ CODE_IN_IWRAM void Put2500Pixels2( int x, int y, int colour, void* video_buffer );
Put2500Pixels2:
stmfd r13!,{r4,r5,r6,r7}
mov r4,r0
mov r5,r1
mov r6,r2
mov r7,r3
ldr r12,=2500
loop2:
mov r0,r4 @ x
mov r1,r5 @ y
mov r2,r6 @ colour
mov r3,r7 @ video_buffer
add r3,r3,r1,lsl #8
sub r3,r3,r1,lsl #4
strb r2,[r3,r0]
subs r12,r12,#1
bgt loop2
ldmfd r13!,{r4,r5,r6,r7}
bx lr
|
To go back to the original question, you'd need to be plotting ~20,000 pixels per frame to counteract the extra copying and blanking overhead of working from EWRAM.
#21776 - jd - Fri Jun 04, 2004 6:44 pm
I've managed to fix the faster VRAM pixel plotter mentioned earlier. The test code is below:
Code: |
.GLOBAL Put2500Pixels3
@ CODE_IN_IWRAM void Put2500Pixels3( int x, int y, int colour, void* video_buffer );
Put2500Pixels3:
stmfd r13!,{r4,r5,r6,r7}
mov r4,r0
mov r5,r1
mov r6,r2
mov r7,r3
ldr r12,=2500
loop3:
mov r0,r4 @ x
mov r1,r5 @ y
mov r2,r6 @ colour
mov r3,r7 @ video_buffer
add r3,r3,r1,lsl #8
sub r3,r3,r1,lsl #4
eor r0,r0,#1
ldrb r1,[r3,r0]!
ands r0,r0,#1
addeq r1,r1,r2,lsl #8
addne r1,r2,r1,lsl #8
strh r1,[r3,-r0]
subs r12,r12,#1
bgt loop3
ldmfd r13!,{r4,r5,r6,r7}
bx lr
|
The updated figures are as follows:
VRAM: ~19.5 cycles/pixel
EWRAM: ~14 cycles/pixel
IWRAM: ~12 cycles/pixel
This is a rather weird result - I would expect the new VRAM plotter to be 1 cycle faster than the old one. Has anyone got any ideas why this isn't the case?
#21781 - jd - Fri Jun 04, 2004 9:15 pm
* Deleted *
#21879 - isildur - Tue Jun 08, 2004 6:29 pm
Ok, sorry for the delay, I had too much real job work... I did all my tests again and have the results. I don't know what happened when I did the first tests, like Mike said, maybe I had turned off optimization or something.
So basically, like jd and sajiimori said, you start gaining speed with the EWRAM buffered method when you draw over 20000 pixels.
Drawing 5000 pixels:
------------------------
VRAM
Clearing the screen with a cpufastset and plotting 5000 in a C loop calling assembly routines took: 0x564AA; for 20000 it took: 0x14317C
EWRAM buffer
Clearing the buffer, drawing 5000 pixels and copying the buffer to VRAM took: 0x6A2E9; for 20000 it took: 0x13893B
=============================================
So, I guess that in a raster intensive game (3D for instance) you still have to draw all the 38400 pixels each frame. There won't be any bitmap block copy or sprites. Correct me if I'm wrong but in a case like that, the EWRAM buffered method would be faster. Not by much but still faster.
#21889 - jd - Tue Jun 08, 2004 9:00 pm
isildur wrote: |
So, I guess that in a raster intensive game (3D for instance) you still have to draw all the 38400 pixels each frame. There won't be any bitmap block copy or sprites. Correct me if I'm wrong but in a case like that, the EWRAM buffered method would be faster. Not by much but still faster. |
No, because in a situation like that you'd be able to draw your polygons as a series of spans. This would allow you to write out all but the pixels at the beginning and end of the spans with 16 or 32 bit writes, which would almost completely eliminate the downside of VRAM (the lack of 8-bit writes) whilst maintaining the main advantages over EWRAM (faster memory access, no need to copy to VRAM when you're done). Unless your polygons were extremely small, VRAM would still be faster in this case.
#21892 - isildur - Tue Jun 08, 2004 9:40 pm
jd wrote: |
No, because in a situation like that you'd be able to draw your polygons as a series of spans. This would allow you to write out all but the pixels at the beginning and end of the spans with 16 or 32 bit writes, which would almost completely eliminate the downside of VRAM (the lack of 8-bit writes) whilst maintaining the main advantages over EWRAM (faster memory access, no need to copy to VRAM when you're done). Unless your polygons were extremely small, VRAM would still be faster in this case. |
Yes, I know about those 16 or 32 bit writes, I do them in my horiz line plotter. The 16 bit writes would also be done to the EWRAM buffer without the first and last pixel align check. I will have to test this when I'm finished with my polygon drawing routine conversion to assembly.
#21894 - jd - Tue Jun 08, 2004 10:04 pm
isildur wrote: |
Yes, I know about those 16 or 32 bit writes, I do them in my horiz line plotter. The 16 bit writes would also be done to the EWRAM buffer without the first and last pixel align check. |
Yes. However, VRAM is faster than EWRAM so writing directly will always be faster except for very short spans - and even then you'll need to do a large number of them before the speed gain is enough to counteract the need to copy the data across afterwards with the EWRAM method.
#22665 - ken2 - Sat Jun 26, 2004 4:30 pm
I read a while back in this thread about rotating/scaling Mode 5 to make it fullscreen. Does anyone have a location on a decent tutorial for this?
#22673 - dagamer34 - Sat Jun 26, 2004 6:45 pm
ken2 wrote: |
I read a while back in this thread about rotating/scaling Mode 5 to make it fullscreen. Does anyone have a location on a decent tutorial for this? |
Yeah, you play around with the rot/scale registers for BG 2 to get the effect you want. Here's what you want:
Code: |
*(unsigned short*)0x4000020 = 0;
*(unsigned short*)0x4000022 = -256;
*(unsigned short*)0x4000024 = 128;
*(unsigned short*)0x4000026 = 0;
*(unsigned short*)0x4000028 = 319 << 7;
|
And to set the rot/scale registers back to normal (any mode, not just mode 5)
Code: |
*(unsigned short*)0x4000020 = 256;
*(unsigned short*)0x4000022 = 0;
*(unsigned short*)0x4000024 = 0;
*(unsigned short*)0x4000026 = 256;
*(unsigned short*)0x4000028 = 0;
|
Sorry it's not "glossy", but it works.
_________________
Little kids and Playstation 2's don't mix. :(