gbadev.org forum archive

The program I'm working on graphs a function y=f(x) and that function is changed frequenty (possibly every frame if possible). Currently I'm working on the graphics part of this program and I'm using dummy values u16 solution[240] to store the y values of the function.

The problem I'm getting is that it's taking about 20 frames just to draw the bar graph once and I was hoping someone can point out where I'm going wrong.

Code:

void DrawVLine(u16 x, u16 ymin, u16 ymax, u8 c)
{
int loop;

if (x & 0x0001)
{
   x=x>>1;
   for (loop = (ymin*120+x); loop < (ymax*120+x); loop+=120)
   {
      VBuffer[loop] = (VBuffer[loop] & 0x00FF) | (c << 8);
   }
}
else
{
   x=x>>1;
   for (loop = (ymin*120+x); loop < (ymax*120+x); loop+=120)
   {
      VBuffer[loop] = (VBuffer[loop] & 0xFF00) | c;
   }
}
}

void DrawBarGraph(u16 c)
{
u16 loop;
s16 yval;
for (loop=0;loop<240;loop++)
{
   yval=solution[loop];
   if (yval<=120 && yval>0)
      DrawVLine(loop, yval,120,c);
}
}

void clearscreen(u16 c)
{

   REG_DM3SAD = (u32)(&c);
   REG_DM3DAD = (u32)VBuffer;
   REG_DM3CNT = 120*160 |DMA_16FILLNOW;
}

my main loop:
while(1) {
   clearscreen(1);
   DrawBarGraph(64);
   WaitForVsync();
   Flip();
   //DMA Copy the OAM Table
   REG_DM3SAD = (u32)sprites;
   REG_DM3DAD = (u32)OAM;
   REG_DM3CNT = 128 * 4 |DMA_16NOW;
   //END DMA
}

This is my first program with GBA so I'm not sure I picked the correct video mode for my goal. My reasoning was since I needed max resolution mode 5 was out and since Mode 3 uses twice the memory it'd take twice as long to clearscreen (and my program can be done with 3-4 colors fine). But since I'm here I wanted to also ask the experts if Mode 4 was the correct decision.

I guess I could store the difference between the current values and the previous values and only draw/erase the difference, but even so the worse case scenario would be just as laggy...

I made a change that sped it up about 4 fold I think by drawing 2 lines at a time. Now I have half as many iterations in drawbargraph and the draw2Vline function no longer needs to lookup old value and mask it in innermost loop.

But it's still not enough...

Code:

void Draw2VLine(u16 x, u16 ymin1, u16 ymin2, u16 ymax, u8 c)
{
int loop;
u16 c16;
x=x>>1;

if (ymin1 > ymin2 )
{
   c16=c << 8;
   for (loop = (ymin2*120+x); loop < (ymin1*120+x); loop+=120)
      VBuffer[loop] = c16;//Just the right pixel
   c16=c | (c<<8);
   for (;loop < (ymax*120+x);loop+=120)
      VBuffer[loop] = c16;//will write C to both high and low
}
else
{
   c16=c;
   for (loop = (ymin1*120+x); loop < (ymin2*120+x); loop+=120)
      VBuffer[loop] = c16;//Just the left pixel
   c16=c | (c<<8);
   for (;loop < (ymax*120+x);loop+=120)
      VBuffer[loop] = c16;//will write C to both high and low
}
}

void DrawBarGraph(u16 c)
{
u16 loop;
for (loop=0;loop<240;loop+=2)
{//Note: range checking was moved elsewhere
   Draw2VLine(loop, solution[loop],solution[loop+1],120,c);
}
}

If you could make your bar-graph horizontal instead of vertical, you'd have much simpler, faster code as you could DMA your data to VRAM per f(x) line function at one time.

Still, I don't see why this is running as slow as you claim. You are compiled optimized and have your test data in IWRAM right? This code even as thumb in ROM will run plenty fast.

Mode 4 seems a good solution for your needs though mode 5 rotated 90 degrees so it fits on screen with double buffering would work as well (though perhaps not quite as fast.)

Wait, if you wanted a vertical graph, you could write it horizontally and rotate the screen 90 degrees giving you the speed I was talking about above with your functionality. :)

You could try using lines and then fill the space (not floodfill though).
For each line you just plot one pixel on each horizontal scanline (or build a scantable) then you draw horizontal lines (fill the space) between all the previous lines. Not sure it will be that much faster though.
_________________
I probably suck, my not is a programmer.

I dunno Mike -- I agree that the code looks about as fast as it's gonna get using this method, but should we really expect to be able to redraw a whole mode 4 screen in 1/60s, without DMA?

The rotated mode 5 trick should be enough, but it costs some resolution.

Stray7Xi, do you know what kinds of values to expect in the 'solution' array? If they are often or always all above a certain point, you could DMA fill the bottom area and draw shorter lines. If you're not sure how low they go, but they are usually all high, you could search for the lowest one and DMA fill below that.

If you expect them to be around the middle on average, you could DMA fill the bottom half of the screen. When drawing a line that's less than half-height, you'd erase the remainder of the filled portion. When drawing a line that's more than half-height, you'd just add the additional pixels. That could save a lot of plotting.

The point is that knowledge about your data set can help you optimize for specific cases.

This should be quite a bit faster as it eliminates the complex end-loop test and makes a few other changes. Make sure you compile it as ARM and put it in IWRAM for maximum performance. Note that this code is untested so I might have screwed up somewhere.

Code:

void Draw2VLine( int x, int ymin1, int ymin2, int ymax, int c )
{
int loop, c16;
u16* vram_address;
x >>= 1;

if ( ymin1 > ymin2 )
{
c16 = c << 8;
vram_address = &VBuffer[(ymin2*120)+x];
for( loop = ymin1-ymin2; loop > 0; --loop )
{
*vram_address = (u16)c16;
vram_address += 120;
}

c16 |= c;
for( loop = ymax-ymin1; loop > 0; --loop )
{
*vram_address = (u16)c16;
vram_address += 120;
}
}
else
{
c16 = c;
vram_address = &VBuffer[(ymin1*120)+x];
for( loop = ymin2-ymin1; loop > 0; --loop )
{
*vram_address = (u16)c16;
vram_address += 120;
}

c16 |= c<<8;
for( loop = ymax-ymin2; loop > 0; --loop )
{
*vram_address = (u16)c16;
vram_address += 120;
}
}
}

To optimise this further you could try manually unrolling the inner loop or writing it in assembler. Ideally, the "+= 120" would be done as a post-increment as part of the strh instruction - I'm not sure if the compiler will be smart enough to pick this up. The reason why I switched to using ints rather the u16/u8s in the code is because otherwise the compiler might generate unnecessary ands.

If you do write a fully unrolled assembler version (which will basically be lots of strh's and bx into them), you should get a fill rate of about 11 million pixels per second, which in the worst case would work out at ~300 FPS. Although this ignores how long it takes to evaluate f(x), I doubt performance will be an issue once the code is fully optimised. :)

Last edited by jd on Tue Apr 27, 2004 12:22 am; edited 1 time in total

jd, doesn't the optimizer take care of all that? The loop invariants, I mean.

sajiimori wrote:

jd, doesn't the optimizer take care of all that? The loop invariants, I mean.

If it's clever then it should - although unfortunately GCC ARM is rarely that smart. Certainly something is holding back the performance as it's easily possible to redraw the screen completely in less than 1/60th of sec with the CPU.

What the heck, here's a fully optimised version - again, it's untested. It's also rather wasteful of IWRAM and will fail spectacularly if one of the lines is taller than 160 pixels - although this is the height of the screen so it shouldn't be a major limitation. It will also fail if ymax is less than ymin1 or ymin2.

C code:

Code:

void Draw2VLine( int x, int ymin1, int ymin2, int ymax, int c )
{
int c16;
u16* vram_address;

x >>= 1;
if ( ymin1 > ymin2 )
{
c16 = c << 8;
vram_address = DrawSpan( &VBuffer[(ymin2*120)+x], ymin1 - ymin2, c16 );
c16 |= c;
DrawSpan( vram_address, ymax - ymin1, c16 );
}
else
{
vram_address = DrawSpan( &VBuffer[(ymin1*120)+x], ymin2 - ymin1, c );
DrawSpan( vram_address, ymax - ymin2, c | (c<<8) );
}
}

Header for assembler code:

Code:

#define CODE_IN_IWRAM __attribute__ ((section (".iwram"), long_call))

CODE_IN_IWRAM u16* DrawSpan( u16* vram_address, int loop, int c16 );

Assembler:

Code:

.TEXT
.SECTION .iwram0,"ax",%progbits
.ALIGN
.ARM

.GLOBAL DrawSpan

DrawSpan:
adr r3,_end
sub r3,r3,r1,lsl #2
bx r3
.rept 160 @ Might have to increase this to avoid error with "adr" above
strh r2,[r0],#240
.endr
_end:
bx lr

You could speed it up marginally by putting the whole thing in assembler (rather than just the inner loops) but it would only be a slight gain. (Assuming the C code is compiled as ARM code in IWRAM.)

Wow... thanks, wasn't expecting so much help.

I implemented the C code, it appears to be running at ~20 fps now. I'm not sure I'm using the best way to measure FPS, just using a 1024 timer and storing the value at beginning and end of while loop (as well as before and after the bargraph function). Then diving the 16million by number of cycles the timer says. It seemed accurate enough when the function was running real slow, but now I can't verify by eye if it's giving me accurate info anymore ;)

I'll implement the asm code tomorrow when I get time to look up stuff on arm asm.

I definately will fool around with Sajiimori's suggestion about knowledge of data. I'll have to see what the data looks like (the hardware that's supplying the data isn't done yet). may do the clearscreen to half dark, half clear. I may instead of clearing the screen each frame I'll use previous frame and draw the difference. I guess that'd mean I have to switch from buffer flipping to DMA'ing the backbuffer to the front so the data would be current. But by removing the clearscreen that shouldn't cost anything.

You guys have been really helpful, thank you :)

Quote:

But by removing the clearscreen that shouldn't cost anything.

It costs the memory reads required to access the old frame buffer. Clearing the screen only requires writes, so it's twice as fast.

Ack that explains my results. Was figuring I'd be drawing half as much (with the dummy data I'm still using) but it was taking almost the same amount of time (still a bit faster).

I guess I'll go back to page flipping and store the data for the previous 2 frames so I can do the comparison. I should of been doing that anyways since it would of saved me a large DMA.

BTW I tried to join the #gbadev channel today, it's on efnet right? It said address was banned. First time trying to join, Is there a lot of subnets banned?

Out of curiosity, I compiled Stray7Xi's second version of the code using -O3. Here are the 4 inner loops:

Code:

; for (loop = (ymin2*120+x); loop < (ymin1*120+x); loop+=120)
; VBuffer[loop] = c16;
.L7:
mov r3, r1, asl #1
add r1, r1, #120
cmp r1, lr
strh r2, [r3, ip] @ movhi
blt .L7

; for (;loop < (ymax*120+x);loop+=120)
; VBuffer[loop] = c16;
.L12:
mov r3, r1, asl #1
add r1, r1, #120
cmp r1, r0
strh r2, [r3, ip] @ movhi
ldmgefd sp!, {r4, r5, pc}
b .L12

; for (loop = (ymin1*120+x); loop < (ymin2*120+x); loop+=120)
; VBuffer[loop] = c16;
.L18:
mov r3, r1, asl #1
add r1, r1, #120
cmp r1, r2
strh r5, [r3, ip] @ movhi
blt .L18

; for (;loop < (ymax*120+x);loop+=120)
; VBuffer[loop] = c16;
.L23:
mov r3, r1, asl #1
add r1, r1, #120
cmp r1, r0
strh r2, [r3, ip] @ movhi
ldmgefd sp!, {r4, r5, pc}
b .L23

It appears GCC is clever enough after all. Note that the 2nd and 4th inner loops each have an extra instruction because they return upon completion.

sajiimori wrote:

It appears GCC is clever enough after all. Note that the 2nd and 4th inner loops each have an extra instruction because they return upon completion.

Yes, although it does miss the optimisations that can be made when accessing VBuffer so I guess that's why Stray7Xi still got some performance gain from the revised C code. I'm pretty sure the code must be executing in ROM otherwise it would be a lot faster than 20 FPS (more like 80).

Ya it likely is in ROM, I started looking at the details of loading it into IWRAM. Next time I have some free time I'll have to pour over the GCC documentation. This is my first project with gcc so I was just using the default compile options.

I'll prally get to this tomorrow. I appreciate all the help and input

Stray7Xi wrote:

Ya it likely is in ROM, I started looking at the details of loading it into IWRAM. Next time I have some free time I'll have to pour over the GCC documentation. This is my first project with gcc so I was just using the default compile options.

I see. Generally, you either want to use thumb code in ROM (for non-critical code) or ARM code in IWRAM (for performance-critical code - on average this is about 4x faster than thumb code in ROM). The default (at least on my version of devkitadv) is ARM code in ROM, which is a really bad combination (ARM is slower and uses more space than thumb when executed from ROM).

Quote:

Yes, although it does miss the optimisations that can be made when accessing VBuffer so I guess that's why Stray7Xi still got some performance gain from the revised C code.

No, I compiled your C version and the 4 inner loops were equivalent to the ones I posted earlier. They use a slightly different strategy, but should be identical in speed. Of course, neither version makes GCC produce optimal assembly, but the point was that GCC can make the first version as fast as the second version.

Code:

; for( loop = ymin1-ymin2; loop > 0; --loop )
; {
; *vram_address = (u16)c16;
; vram_address += 120;
; }
.L7:
sub ip, ip, #1
cmp ip, #0
strh r3, [r0, #0] @ movhi
add r0, r0, #240
bgt .L7

; for( loop = ymax-ymin1; loop > 0; --loop )
; {
; *vram_address = (u16)c16;
; vram_address += 120;
; }
.L12:
sub ip, ip, #1
cmp ip, #0
strh r3, [r0, #0] @ movhi
add r0, r0, #240
ldmlefd sp!, {r4, r5, r6, pc}
b .L12

; for( loop = ymin2-ymin1; loop > 0; --loop )
; {
; *vram_address = (u16)c16;
; vram_address += 120;
; }
.L18:
sub ip, ip, #1
cmp ip, #0
strh r5, [r0, #0] @ movhi
add r0, r0, #240
bgt .L18

; for( loop = ymax-ymin2; loop > 0; --loop )
; {
; *vram_address = (u16)c16;
; vram_address += 120;
; }
.L23:
sub ip, ip, #1
cmp ip, #0
strh r3, [r0, #0] @ movhi
add r0, r0, #240
ldmlefd sp!, {r4, r5, r6, pc}
b .L23

sajiimori wrote:

No, I compiled your C version and the 4 inner loops were equivalent to the ones I posted earlier. They use a slightly different strategy, but should be identical in speed. Of course, neither version makes GCC produce optimal assembly, but the point was that GCC can make the first version as fast as the second version.

You're right, but it's mainly because GCC has made a right dog's dinner of the second version - it does sub then cmp (rather than just a subs) and also doesn't do the increment as part of the strh operation, both of which are pretty obvious optimisations. Still, that does mean that the reason why Stray7Xi got a speed gain with the alternative code is a bit of a mystery.

gbadev.org forum archive

C/C++ > Graphing a Function, speed problem

#19794 - Stray7Xi - Sun Apr 25, 2004 6:46 pm

#19797 - Stray7Xi - Sun Apr 25, 2004 7:54 pm

#19801 - Miked0801 - Sun Apr 25, 2004 9:12 pm

#19836 - FluBBa - Mon Apr 26, 2004 3:02 pm

#19846 - sajiimori - Mon Apr 26, 2004 6:36 pm

#19859 - jd - Tue Apr 27, 2004 12:06 am

#19860 - sajiimori - Tue Apr 27, 2004 12:20 am

#19861 - jd - Tue Apr 27, 2004 12:25 am

#19863 - jd - Tue Apr 27, 2004 12:38 am

#19881 - Stray7Xi - Tue Apr 27, 2004 7:14 am

#19892 - sajiimori - Tue Apr 27, 2004 7:01 pm

#19899 - Stray7Xi - Tue Apr 27, 2004 9:08 pm

#19904 - sajiimori - Tue Apr 27, 2004 9:44 pm

#19912 - jd - Tue Apr 27, 2004 11:02 pm

#19921 - Stray7Xi - Wed Apr 28, 2004 1:28 am

#19926 - jd - Wed Apr 28, 2004 2:34 am

#19932 - sajiimori - Wed Apr 28, 2004 5:50 am

#19935 - jd - Wed Apr 28, 2004 5:59 am