#12050 - johnny_north - Tue Oct 28, 2003 5:02 am
Could someone suggest how I could futher optimize this function:
Code: |
void hbl(){
static bool skipFlag = false;
static u16 passCnt = 0;
if(REG_VCOUNT <115){
REG_WIN0H = (circle[cnectr]<<8| circle[cnectr+1]);
cnectr+=2;
skipFlag = false;
return;
}
else if(REG_VCOUNT <160){
cnectr-=2;
REG_WIN0H = (circle[cnectr]<<8| circle[cnectr+1]);
return;
}
else if(!skipFlag){
skipFlag = true;
passCnt++;
cnectr = passCnt*232;
REG_WIN0H = (circle[cnectr]<<8| circle[cnectr+1]);
cnectr+=2;
}
} |
Right now, it's compiled in ARM running from IWRAM
cnectr (initialized to zero) is a global u16 used as an index into the circle[] array
circle[] is a large u16 array of left and right window value pairs stored in rom
232 is the number of l&r window pairs needed to produce one frame of the spotlight (see below).
optimization -O3
This function is called for 60 frames and produces a shrinking spotlight effect similar to the Mario games. Currently, the effect produces some garbage horizontal lines (very minimaly) that have improved as it is optimised. It appears to run too slow from THUMB/rom and hangs the hardware, but I'd like for it not to be taking up IWRAM if possible. I'm not opposed to trying asm if necessary.
Thanks.
#12052 - Burton Radons - Tue Oct 28, 2003 7:05 am
REG_VCOUNT is being accessed twice because it's volatile, although stashing it in a local will only be helpful if it puts it in a register.
circle should be made so that it can be accessed as a simple u16 and fed directly into REG_WIN0H. cnectr should be made into a pointer instead of an index. skipFlag shouldn't be necessary - service HBlank before VBlank!
Doing all that cuts off 45% of the instructions; here's my modified copy.
Code: |
extern const u16 circle [];
const u16 *cnectr;
void hbl ()
{
static u16 passCnt = 0;
u32 vcount = REG_VCOUNT;
if(vcount <115)
REG_WIN0H = *cnectr++;
else if(vcount <160)
REG_WIN0H = *--cnectr;
else if(vcount==160){
passCnt++;
cnectr = &circle[passCnt*232];
REG_WIN0H = *cnectr++;
}
}
|
Naturally, the 232 there should be changed to whatever's appropriate.
#12053 - sajiimori - Tue Oct 28, 2003 7:15 am
Code: |
passCnt++;
cnectr = &circle[passCnt*232];
|
If passCnt isn't being used for anything else, would it be faster to avoid the multiplication?
Code: |
passCnt += 232;
cnectr = &circle[passCnt];
|
#12054 - DekuTree64 - Tue Oct 28, 2003 7:20 am
You could also make it even smaller if you just make a table of 160 WIN0H values and then just do
REG_WIN0H = table[REG_VCOUNT];
in your interrupt.
Then recalculate your table on VBlank to shrink the circle or whatever. It looks like you're using a bunch of precalculated tables for your circle right now, so if you're not sure how to do it dynamically, the easiest way is to use something like the Bresenham circle drawing algorithm, except don't draw pixels, just keep track of the smallest/largest x values for each scanline. Then those will be your window values. FOr any scanlines above/below the circle, just set the window size to 0, 0 so it doesn't show.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#12055 - Touchstone - Tue Oct 28, 2003 10:19 am
You could break it up into three seperate functions, like this: Code: |
static bool skipFlag = false;
static u16 passCnt = 0;
void hbl0()
{
REG_WIN0H = (circle[cnectr]<<8| circle[cnectr+1]);
cnectr+=2;
skipFlag = false;
if(REG_VCOUNT == 114)
setHBLFunction(hbl1);
}
void hbl1()
{
cnectr-=2;
REG_WIN0H = (circle[cnectr]<<8| circle[cnectr+1]);
if( REG_VCOUNT == 159)
setHBLFunction(hbl2);
}
void hbl2()
{
if(!skipFlag)
{
skipFlag = true;
passCnt++;
cnectr = passCnt*232;
REG_WIN0H = (circle[cnectr]<<8| circle[cnectr+1]);
cnectr+=2;
}
if (REG_VCOUNT == 227)
setHBLFunction(hbl0);
} |
Something like that maybe? Atleast you get the idea of splitting it into seperate quicker functions instead of just having one slower.
If you are having problem with the hardware registers being updated to late (i.e. when the line has already begun being rendered) you can make sure that the first thing each function do is to set the hardware register and then it process and makes ready for the next line.
_________________
You can't beat our meat
#12056 - tom - Tue Oct 28, 2003 10:47 am
DekuTree64 wrote: |
You could also make it even smaller if you just make a table of 160 WIN0H values and then just do
REG_WIN0H = table[REG_VCOUNT];
in your interrupt. |
this will also allow you to use HDMA transfers instead of the horizontal blank interrupt.
#12072 - johnny_north - Tue Oct 28, 2003 11:20 pm
Thanks everyone for the response. I?m especially intrigued with the HDMA ? I?ve never used it before.
Implement like this?
Set up DMA 3 with increment source (&circle[]), fixed dest (REG_WIN0H), DMA repeat, 16 bit transfer type, DMA start timing on hblank, interrupt at the end of word?
If I wanted to start the transfer on the last hidden scan line (so that the operation will affect scan line 0 first) do I set bit15 to enable the DMA in the second to last hblank?
Do I clear bit15 on the 158th hblank?
#12074 - DekuTree64 - Wed Oct 29, 2003 12:00 am
Yeah, DMA would be the fastest. You do have to reset it on VBlank though. First set DM3CNT to 0 to disable it (otherwise it won't work on hardware) then set DM3SAD back to the start of the table and start it up again. The DMA won't trigger during the VBlank, so you don't have to worry about that. And yes, the top scanline of the screen is a problem. Since HBlank occurrs at the end of the line, you actually need to offset your table entries by one, so the first entry is the WIN0H for the second scanlinem second entry for third scanline, and last entry (159) for scanline 0, so it sets it up for the next frame at the end of the current frame.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#12077 - johnny_north - Wed Oct 29, 2003 1:39 am
Ok, I want to change my const circle dec/def so that I can reinterpret the pairs of u8s as u16s and skip the orring:
I change
Code: |
const u16 circle[]={.... |
to
Code: |
const u8 circle[]={.... |
the compiler wont let me make the following assignment:
Code: |
static const u16 *cnectr = const_cast<u16*>circle; |
circle is now a pointer to const u16 data and cnectr is a pointer to const u16 data. What's the correct way to do this?
#12080 - tepples - Wed Oct 29, 2003 2:50 am
If "the compiler won't let [you] make" an assignment, what error does g++ emit? I can't suggest a solution without a verbatim copy of the error message.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#12081 - johnny_north - Wed Oct 29, 2003 3:14 am
The compiler was complaining of a parse error before the ";" on the line:
Code: |
static const u16 *cnectr = const_cast<u16*>circle; |
This works:
Code: |
static const u16 *cnectr = (const u16*)circle; |
But I got the impression that the first was favored and the second depreciated in the current standard. I assume I'm goofing the sytax somehow.
#12082 - tepples - Wed Oct 29, 2003 3:24 am
johnny_north wrote: |
The compiler was complaining of a parse error before the ";" |
In that case, it appears you need to surround the argument with parentheses:
Code: |
static const u16 *cnectr = const_cast<u16*>(circle); |
Think of const_cast as a template function.
But I don't think you need the const_cast here; const_cast is for removing constness. C++ adds constness wherever implied. Use static_cast instead, and again, you need the parentheses:
Code: |
static const u16 *cnectr = static_cast<u16*>(circle); |
Don't use dynamic_cast if you can avoid it, as dynamic_cast will trigger run-time type information (RTTI) to be included in your program, which slows things down.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#12083 - Burton Radons - Wed Oct 29, 2003 3:25 am
Code: |
static const u16 *cnectr = const_cast<u16*>(circle); |
It's a function, rather than a syntax primitive like a C cast.
#12086 - johnny_north - Wed Oct 29, 2003 4:12 am
Thanks for the correction guys and for the link. Now when I compile this with the correct syntax:
Code: |
static const u16 *cnectr = const_cast<u16*>(circle); |
I get the error
invalid static_cast from type `const unsigned char[27840]' to type
`u16*'
I was doing this to save myself some work by trying to get this:
Code: |
const u8 circle[] = {0, 209, 1, 208, 2, 207.... |
to be behave like:
Code: |
const u16 circle[] = {0<<8|209, 1<<8|208, 2<<8|207... |
instead, I think I'm getting this behavior as the window seems to be turned inside out:
Code: |
const u16 circle[] = {209<<|0, 208<<8|1, 207<<|2... |
Also, for those wanting to try this at home, I also observed that
Code: |
else if(vcount==160){ |
doesn't seem to work for me if I have the vblank interrupt turned on.
#12087 - tepples - Wed Oct 29, 2003 4:24 am
johnny_north wrote: |
Now when I compile this with the correct syntax:
Code: | static const u16 *cnectr = const_cast<u16*>(circle); |
I get the error
invalid static_cast from type `const unsigned char[27840]' to type `u16*' |
You may need to use a reinterpret_cast instead of a static_cast. As I understand it, a static_cast lets the compiler throw an error for anything but a cast to a class's parent, but a reinterpret_cast has almost exactly the same semantics as a C style cast.
Quote: |
instead, I think I'm getting this behavior as the window seems to be turned inside out:
Code: | const u16 circle[] = {209<<|0, 208<<8|1, 207<<|2... |
|
There are two things that could screw you up:- The GBA's ARM7TDMI processor is little-endian, which means values are stored with the least significant byte first; 0x12345678 is 0x78 0x56 0x34 0x12.
- The GBA's video chip expects the right side of the window in the least significant byte.
Therefore, put the right side of the window first.
Quote: |
Code: | else if(vcount==160){ |
doesn't seem to work for me if I have the vblank interrupt turned on. |
Probably means your vblank ISR runs longer than a scanline. This isn't always a bad thing; some NES games, such as Super Mario Bros., run entirely in a vblank ISR.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#12092 - sajiimori - Wed Oct 29, 2003 7:45 am
Aren't you guys forgetting that the array is declared const, and thus stored in ROM, so all discussions of const casting and whatnot are pointless?
BTW tepples, how the heck do you know so much about the implementation of all these console games? It's interesting stuff.
#12097 - tom - Wed Oct 29, 2003 1:08 pm
DekuTree64 wrote: |
Since HBlank occurrs at the end of the line, you actually need to offset your table entries by one, so the first entry is the WIN0H for the second scanlinem second entry for third scanline, and last entry (159) for scanline 0, so it sets it up for the next frame at the end of the current frame. |
Interesting, I never thought about doing it this way =)
I usually set the stuff for the first scanline manually (during vblank, where i set up the hdma transfer). Simple and works fine.
#12099 - tepples - Wed Oct 29, 2003 4:08 pm
sajiimori wrote: |
Aren't you guys forgetting that the array is declared const, and thus stored in ROM, so all discussions of const casting and whatnot are pointless? |
When feasible, I try to implement C or C++ programs on one platform such that a port to another platform is straightforward, and I encourage others to make similar design choices.
Quote: |
BTW tepples, how the heck do you know so much about the implementation of all these console games? It's interesting stuff. |
I helped in a disassembly project that in effect produced the source code for Super Mario Bros. I also traced a few NES games in nesticle, the old emulator that was decent for its time but which can be defeated by four lines of assembly code. (I now use primarily FCE Ultra to test NES programs.)
tom wrote: |
I usually set the stuff for the first scanline manually (during vblank, where i set up the hdma transfer). Simple and works fine. |
I used the method you describe in TOD. Unlike DekuTree's "put scanline 0 at the end of the array" method, the "update scanline 0 in vblank" method avoids a one-frame delay of scanline 0's data.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#12102 - DekuTree64 - Wed Oct 29, 2003 5:56 pm
tepples wrote: |
I used the method you describe in TOD. Unlike DekuTree's "put scanline 0 at the end of the array" method, the "update scanline 0 in vblank" method avoids a one-frame delay of scanline 0's data. |
Yeah, that's how I did it before too. The line 0 at the end of the table trick was just something I came up with a while back, but haven't actually had a chance to use yet. I guess it would only work for something that will stay the same every frame (good call, Tepples^_^). But still, the first entry to be copied by the DMA should be for the second scanline.
Also, about the window being inverted, the upper byte is the left side and the lower byte is the right side. Using a u8 table so you can access the left/right edges individually is a good idea if you go with the dynamic circle drawing function idea, since that will make checking if the current pixels of the circle are farther left/right than the current ones stored for each scanline easier.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku