#27883 - ImInABand - Fri Oct 22, 2004 11:39 pm
is it possible to make a background for mode 4 that is larger than 240 X 160? If so, how?
#27884 - DekuTree64 - Fri Oct 22, 2004 11:54 pm
Nope, you're stuck with it. As far as I know, you can't wrap the mode4 BG either, and without wrapping the only way to scroll around a larger virtual map is to redraw the entire screen.
I haven't tested setting the wrapping bit in BG2CNT, but I'd seriously doubt it would change anything.
What did you have in mind to do? Most likely it would be much easier in the tiled modes. Mode4 is usually only best for drawing completely arbitrary shapes (3D, plasma/bump mapping/etc demos).
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#27885 - ImInABand - Fri Oct 22, 2004 11:57 pm
That is the thing, I know nothing of the tiled modes, and I was unsure if i had to learn a different mode to make a map suitable for a platforming game.
which tiled mode would you recommend, and what do you need to do to input tile data, tile pallets, and form together tile maps?
#27886 - ScottLininger - Sat Oct 23, 2004 12:00 am
Depends on how smooth you need the "scrolling" to be. You basically have to copy in the entire screen every frame while scrolling, which is slow.
My battleship game uses a larger BG in mode 4, but as you can see, the scrolling is jerky when you have music and stuff going on:
http://www.thingker.com/gba/BluetoothBattleship.zip
Though here's a version without music that is much smoother:
http://www.thingker.com/gba/bship_nomusic.zip
So it is very possible. It's just not desirable for every application, and obviously really large maps will make for really large ROMs.
Here's the function that draws in a portion of a larger graphic onto the screen:
Code: |
void DrawMap(int scrollX, int scrollY)
{
int drawX, drawY;
for(drawY = 0; drawY < 160; drawY++) {
for(drawX = 0; drawX < 120; drawX++) {
videoBuffer[drawY*120 + drawX] = mapData[(drawY+scrollY)*mapWidth+drawX+scrollX]);
}
}
} |
One could probably get better performance out of this if you used DMA for each of the 160 "stripes" of pixels you need to copy rather than doing it a pixel (or really two pixels) at a time. But I've never tried that.
Hope that helps,
Scott
#27889 - tepples - Sat Oct 23, 2004 2:03 am
If you only need to scroll up and down, you can reset the Y scroll on a vcount trigger to simulate wrapping.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#28019 - Gene Ostrowski - Tue Oct 26, 2004 5:47 am
Quote: |
One could probably get better performance out of this if you used DMA for each of the 160 "stripes" of pixels you need to copy rather than doing it a pixel (or really two pixels) at a time. But I've never tried that. |
Yes, using DMA for this makes a HUGE difference, at least in my experience doing this sort of thing. Oh yeah, getting rid of those horrid multiplies in the loop will help too-- before the loop starts determine the startaddress pixel with the one multiply and one add. Then, after each DMA simply add SRCMAPWIDTH pixels to the startaddress and avoid muls altogether.
I use DMA like this in cases where I want to copy the block for my "maps"
in tile mode from a larger megamap and don't want to deal with "strip-scrolling" or whatever the term is when you only copy the strip(s) that change... anyway, it's WAY faster than nested loops.
_________________
------------------
Gene Ostrowski
#28020 - sajiimori - Tue Oct 26, 2004 5:57 am
Quote: |
Oh yeah, getting rid of those horrid multiplies in the loop will help too |
I used to think the same thing, until I learned the ways of gcc -O3 -S. Try it out -- you might be suprised.
#28022 - Gene Ostrowski - Tue Oct 26, 2004 6:32 am
No, I'm not really surprised-- the gcc compiler actually does a very good job optimizing. I was at first, until I realized that the guys who wrote the optimizing portion of the compiler know a lot more about ARM architecture than I do.
I've just always found that I shouldn't rely on the compiler's optimization routines to take care of something that I should be doing. I feel that if you provide the compiler with "better input" it will be able to produce better optimizations, in general.
Good stuff in, good stuff out. Crap in, optimized crap out.
There's only so much a compiler can do to the code, without "knowing" what you were trying to do in the first place. I would be very surprised (and kind of scared), if the compiler was smart enough to replace all that address calculation code with the single offset add instruction and knew what offset it needed.
_________________
------------------
Gene Ostrowski
#28023 - sajiimori - Tue Oct 26, 2004 6:56 am
We have different ideas about what makes "crap" code. I think it's more important to write simply and clearly than to do by hand what the computer can do for you.
Quote: |
I would be very surprised (and kind of scared), if the compiler was smart enough to replace all that address calculation code with the single offset add instruction and knew what offset it needed.
|
Here's the inner loop on gcc 3.2.3 for x86 (I don't have an ARM compiler on me right now):
Code: |
L10:
movw (%ecx), %ax
addl $2, %ecx
movw %ax, (%edx)
addl $2, %edx
decl %ebx
jns L10
|
Scared yet?
#28047 - Gene Ostrowski - Tue Oct 26, 2004 4:08 pm
Heh. I, too, think its more important to write simply and clearly. We can start to debate style again, but I'd write it more like this:
Code: |
src=&mapData[scrollY*MAPSIZEX+scrollX]);
dst=videoBuffer;
for(drawY = 0; drawY < 160; drawY++)
{
memcpy(dst, src, BYTESTOCOPY); // copy a row
src+=MAPSIZEX; // move source to next row
dst+=120; // move dest to next line
}
|
I think mine is simpler and cleaner, since it's very clear exactly what the loop is doing. But again, it's style...
Plus, the fact that you even had an "inner" loop at all is ugly. It probably shifted the mul to the outer loop, because it detected that there were no dependencies, but I'd wager it's still being calculated in there somewhere. The compiler can't exactly determine the intent of what you wanted.
I'll bet the compiler would optimize my code more effectively.
_________________
------------------
Gene Ostrowski
#28052 - tepples - Tue Oct 26, 2004 4:21 pm
This conversion of inner-loop multiplication to differential addition is called "strength reduction" (as defined in FOLDOC), and any decent optimizer will try it to some extent. Given a function f, it involves computing a function g such that g(x, f(x)) = f(x + 1) and g is simpler than f.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#28058 - Gene Ostrowski - Tue Oct 26, 2004 4:55 pm
Yes, Tepples, you are correct. Any decent compiler will attempt it to some extent.
But why have a compiler attempt something (and possibly give up), when it's unnecessary in the first place? Tepples, tsk tsk, you of all people know not to leave it to the compiler, otherwise why all the hand-optimizing ASM code in the other threads :) I'm just kidding, and hope that you are just making observations about compiler operations simply to stir the conversation...
It boils down to a "better" implementation. Feed a compiler a bubble sort and I'll poop myself if it "decides" that a quick sort would work better for your data, and so decides to assemble a quick sort instead. It's only going to do so much with what you give it.
By the way, my "crap in crap out" comment wasn't meant to degrade or insult the code, or offend anyone or their ideas. It's merely making a point. Scott's original post indicated that his scrolling was slow, and based on the code he provided, I gave him an alternative on how to code it so that the compiler would spit out code that ran it many times faster. All the -O3 -S options probably wouldn't improve it too much the way he has it now. I think that Scott would benefit from not "leaving it up to the compiler", as is being suggested. He'll be a better coder in the long run because of it.
I feel that once you begin to leave things up to the compiler, you start to get lazy on how you approach a problem. I've fallen into that trap many times myself, with all the Java, ColdFusion, ASP, and C++ stuff that I do these days in my professional life. It's good to "get back to the roots" with a nice embedded system like the GBA to make me think about these things (performance, memory footprints, implementations, etc.) again.
_________________
------------------
Gene Ostrowski
#28063 - keldon - Tue Oct 26, 2004 5:18 pm
Your limitation is merely the language and not the compiler. However some simple loops are simple enough for the compiler to give a decent optimisation.
And gene, carefully with those back-slashes, you might take someone's eye out =D dest looks better than dst. And uurgh, you've got the opening curly bracket on its own line; that my friend is uncool :)
On a serious note though people seem to cuddle their grey desktop systems like writing stuff on their OWN makes them any closer to it; when in fact they're becoming a victim of their own wonders. They're the same people who fall slave to what you call the 'Java Trap', etc. I code in ASM because at times it makes plenty more sense, but not because of speed. Nessie (ASM) runs slower than 1/4 the speed of NESTICLE (C++). However working in ASM allows you to do stuff you simply cannot represent in any current HLL.
#28083 - sajiimori - Tue Oct 26, 2004 7:46 pm
Gene, you have your "inner loop" in a seperate function, so what? If the compiler inlined it, would that make your version ugly? What does it even mean to say that the compiler will optimize your version "better", since almost all the time is going to be spent in memcpy? And who cares about 1 mul per 160 pixels?
Quote: |
Scott's original post indicated that his scrolling was slow, and based on the code he provided, I gave him an alternative on how to code it so that the compiler would spit out code that ran it many times faster. |
The alternative is DMA, period. You got that part right in any case.
Quote: |
Feed a compiler a bubble sort and I'll poop myself if it "decides" that a quick sort would work better for your data, and so decides to assemble a quick sort instead. |
Use a language that has innate concepts of sorting and the compiler could very well do that. The real question is: should it make the change, if it thinks it can prove that one version will always be faster in a given case? Is it even possible to prove such a thing? Strength reduction and loop invariant optimization can both be objectively proven better, and that's why your analogy fails.
Quote: |
All the -O3 -S options probably wouldn't improve it too much the way he has it now. |
I showed you what -O3 does to his code, and it is obviously very different from a literal translation. So your point would be what? That compiling it with -O3 again won't make it better? And -S isn't even an optimization option -- it outputs an .s file.
Quote: |
I feel that once you begin to leave things up to the compiler, you start to get lazy on how you approach a problem. |
I'm not worried. I can't afford to be arbitrarily lazy when I have actual performance requirements.
#28089 - Gene Ostrowski - Tue Oct 26, 2004 9:20 pm
Quote: |
Gene, you have your "inner loop" in a seperate function, so what? If the compiler inlined it, would that make your version ugly? What does it even mean to say that the compiler will optimize your version "better", since almost all the time is going to be spent in memcpy? And who cares about 1 mul per 160 pixels? |
No, it was just a separate function for clarity sake. But memcpy (on architectures that support it) is usually implemented as an opcode that handles multiple, sequential memory moves efficiently. By telling the compiler to use it, it will use the fastest possible block-memory move code it knows. Not necessarily so with the other code. Plus with my code, I can replace the memcpy with the DMA call and be done.
I care about 1 mul per 160 pixels. Not to mention the multitude of other cycles wasted in that code by letting the compiler decide how to do it. And if there's cycles wasted there, chances are cycles are wasted all over the place. It all adds up quickly.
Quote: |
Use a language that has innate concepts of sorting and the compiler could very well do that. The real question is: should it make the change, if it thinks it can prove that one version will always be faster in a given case? Is it even possible to prove such a thing? Strength reduction and loop invariant optimization can both be objectively proven better, and that's why your analogy fails. |
No, the analogy is to show "intent". Strength reduction and loop invariant optimizations are not the issue. You obviously have had training in CompSci, so you know it (the change) can't be proved, so of course no, it shouldn't make the change.
Quote: |
I showed you what -O3 does to his code, and it is obviously very different from a literal translation. So your point would be what? That compiling it with -O3 again won't make it better? And -S isn't even an optimization option -- it outputs an .s file. |
My point is: look at the functions as a whole, look at both disassemblies, and time them both. Not just some inner loop. As far as the -O3 -S options? I was merely quoting you on what you suggested in your earlier response.
Scott, if you are following this thread, make the change and recompile and see how much improvement there is.
Perhaps you never spent days on end hand-coding ASM routines to optimize them beyond anything the compiler could ever produce. Perhaps if you had, you'd have a better understanding of what I'm talking about.
We are obviously debating style here-- you are comfortable leaving it to the compiler, I am not. I doubt at this point either one of us will be swayed.
Oh well, definitely interesting discussion, tho.
_________________
------------------
Gene Ostrowski
#28092 - ScottLininger - Tue Oct 26, 2004 9:32 pm
Gene Ostrowski wrote: |
Scott, if you are following this thread, make the change and recompile and see how much improvement there is. |
Of course I'm following the thread. :)
I wrote that code over a year ago during my first real GBA project. If I were to bother "optimizing" it, it would be by building my project over again in Mode 0. Not worth it. I've got bigger performance fish to fry, these days.
I do have some Mode4 drawing routines that might benefit from the discussion above. I'll check it all out. Thanks, guys!
-Scott
#28095 - poslundc - Tue Oct 26, 2004 9:44 pm
Gene Ostrowski wrote: |
I care about 1 mul per 160 pixels. Not to mention the multitude of other cycles wasted in that code by letting the compiler decide how to do it. And if there's cycles wasted there, chances are cycles are wasted all over the place. It all adds up quickly. |
You want your code to be sculpted, and that's cool. I dig that.
You don't want to be smart about where you do it, though. For that, I'd fire your ass, if I was in any position of respectability and not just a low-level code monkey myself.
Yes, wasted cycles add up. That's why when we need the speed, we optimize the code that's wasting the most time. That's why we profile our code. Anything else is unnecessary, so from a perspective of getting stuff done quicker, why do it?
Quote: |
Perhaps you never spent days on end hand-coding ASM routines to optimize them beyond anything the compiler could ever produce. Perhaps if you had, you'd have a better understanding of what I'm talking about. |
I know I've done it.
Then I've discovered that after all that effort... the processor is actually spending 90% of it's time in some inner loop somewhere else.
Egg on my face.
Dan.
#28100 - Gene Ostrowski - Tue Oct 26, 2004 10:14 pm
Quote: |
You don't want to be smart about where you do it, though. For that, I'd fire your ass, if I was in any position of respectability and not just a low-level code monkey myself. |
Eeek. Fired for writing an efficient block of code out-of-the box instead of a poor one? I spent about thirty seconds to think about that routine. A significant speedup for thirty seconds of thinking is worth it to me. And apparently to my boss as well.
I didn't say I'd spend a week hand-coding that routine in ASM if it was only going to be called once at the beginning of the program. Of course I profile my code, and only optimize it where it needs to be optimized. Not smart about it? Am I having an out of body experience here?
Quote: |
I know I've done it.
Then I've discovered that after all that effort... the processor is actually spending 90% of it's time in some inner loop somewhere else.
Egg on my face. |
Hello? And I'm being fired for the 30-seconds it took to speed up the part of his code that he already indicated was slow? Perhaps you forgot to follow your own advise and profile the code first?
I'm actually quite surprised about the comments in this post, coming from you. Shame on you.
Shame on me for even responding. I'll just assume that you were tired when you replied to this...
_________________
------------------
Gene Ostrowski
#28104 - sajiimori - Tue Oct 26, 2004 10:43 pm
Gene, get over yourself. Ok, I'm done so have your last word if you want it.
#28106 - keldon - Tue Oct 26, 2004 10:49 pm
my popcorn's getting cold; it's like watching Tyson in his prime =D
#28107 - poslundc - Tue Oct 26, 2004 10:49 pm
Gene Ostrowski wrote: |
Quote: | You don't want to be smart about where you do it, though. For that, I'd fire your ass, if I was in any position of respectability and not just a low-level code monkey myself. |
Eeek. Fired for writing an efficient block of code out-of-the box instead of a poor one? I spent about thirty seconds to think about that routine. A significant speedup for thirty seconds of thinking is worth it to me. And apparently to my boss as well. |
Most bosses want to optimize programmer time, not excess cycles. The extra cycles that add up make absolutely no difference* so long as you stay beneath the threshold of your program's refresh rate. It's those 30 seconds that really add up over time.
There are other big time-related factors to consider as well. Readability. Portability. Encapsulation. Verifiability. These are all much more valuable things to most bosses than cranking out extra speed that doesn't make a difference in the end product.
* - You can optimize to preserve battery life, I suppose, although this is a secondary concern in most apps.
Quote: |
Quote: | I know I've done it.
Then I've discovered that after all that effort... the processor is actually spending 90% of it's time in some inner loop somewhere else.
Egg on my face. |
Hello? And I'm being fired for the 30-seconds it took to speed up the part of his code that he already indicated was slow? Perhaps you forgot to follow your own advise and profile the code first?
I'm actually quite surprised about the comments in this post, coming from you. Shame on you. |
I did. But I was 13 or 14 at the time and learned my lesson then.
The point is: if you've already optimized everything then there is nothing to profile, it will either be fast enough or it won't. The idea is to optimize as little as possible, only where it's necessary. Your time is better spent elsewhere. This has been the rule since the invention of the compiler, and it's hard to deny the simple logic to it.
I totally understand why you would want to write good code from the getgo. But when you say you care about 1 mul per 160 pixels, or the arbitrary generic actions of the compiler, I'm telling you that you're caring about the wrong thing. You tell me that you only optimize where it needs to be optimized, but that does not reconcile with these statements.
Quote: |
I'll just assume that you were tired when you replied to this... |
Usually a safe assumption, although my sparkling wit manages to persevere somehow.
Dan.
#28119 - Gene Ostrowski - Wed Oct 27, 2004 12:41 am
Lol. I think now you're just arguing for argument's sake.
You yell at me, nay, fire me for optimizing in the wrong place.
Uumm, I haven't optimized anything yet!
All I did was write the routine in such a way that the compiler will be able to generate some decent code. In fact, it may be good enough that I don't even have to waste any time later trying to optimize it.
That thirty seconds of code improvement would have probably saved Scott many hours of time, as his test cycle turnaround would have been improved while he played with new features in his code. If one of my programmer contractors came to me with something like this, I'd probably give him my parking space for a month.
Good Lord, this has gotten way out of hand.
I give up. You guys "win". I apparently have no idea how to develop software effectively-- apparently the last twenty years of my life have just been a fluke, and I've gotten by on blind luck.
[Goes into corner and cries like a baby.]
Sigh.
_________________
------------------
Gene Ostrowski
#28128 - keldon - Wed Oct 27, 2004 2:23 am
could you kindly pick my lottery numbers please kind sir =D
#28131 - tepples - Wed Oct 27, 2004 3:40 am
Gene Ostrowski wrote: |
But why have a compiler attempt something (and possibly give up), when it's unnecessary in the first place? Tepples, tsk tsk, you of all people know not to leave it to the compiler, otherwise why all the hand-optimizing ASM code in the other threads :) I'm just kidding, and hope that you are just making observations about compiler operations simply to stir the conversation... |
Of course I was. I just wanted people to understand the theory of what the optimizer was doing to the code. In practice, a couple points separate gcc -O3 from hand-optimized asm code: - The undecidability of the Halting Problem implies that no algorithm can determine when every algorithm has been optimally strength-reduced. A compiler can't always find that simpler function g where g(x, f(x)) = f(x+1), but some compilers are better than others at that.
- Unlike assembly language, which has provided overflow flags for signed and unsigned addition on many architectures since the 6502, C provides no way to act on whether or not an integer addition has overflowed the result data type. Some compilers are better than others at inferring when to use the carry flag.
keldon wrote: |
Nessie (ASM) runs slower than 1/4 the speed of NESTICLE (C++). |
That's because Nesticle's CPU and PPU engines were coded with some bullsh*t optimizations that adversely affected emulation accuracy, caused in part by a lack of knowledge of the exact behavior of the NES hardware. For a more fair comparison, try VisualBoyAdvance (C++) vs. no$gba freeware (x86 asm).
sajiimori wrote: |
And -S isn't even an optimization option -- it outputs an .s file. |
It outputs a .s file, but a .s file is the first step toward 1. seeing what gcc -O3 is doing register-wise with your C program's inner loops and 2. if necessary, providing a starting point for hand-optimization when your carry flag instinct says so.
poslundc wrote: |
Then I've discovered that after all that effort... the processor is actually spending 90% of it's time in some inner loop somewhere else. |
To keep egg off my face, I like to use palette writes to gauge how much CPU time is spent in each function. I have used this technique when optimizing a port of the "Toast" GSM 06.10 decoder.
poslundc wrote: |
You can optimize to preserve battery life, I suppose, although this is a secondary concern in most apps. |
Secondary my gluteus maximus. Have you read the PlayStation Portable lot check guidelines? Sony's fighting to keep the egg of Sega's handhelds off its face, and battery life is considered even more important than use of any 3D graphics. Even on the GBA, some players have noticed that sometimes the battery light will be red in one game and green in another.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#28132 - sajiimori - Wed Oct 27, 2004 4:20 am
Quote: |
You can optimize to preserve battery life, I suppose, although this is a secondary concern in most apps. |
I agree with tepples, but your point is important anyway.
It might be better to say: When you have a limited amount of time to spend on optimization, you should make a concious effort to focus your effort on the bottlenecks because, all things being equal, spending an hour on a bottleneck will recover more resources than spending that hour elsewhere. So, paradoxically, by not wasting time optimizing most of your code, you end up with better performance in the end.
Again, that only applies if you have limited programmer hours.
#28134 - Datch - Wed Oct 27, 2004 4:53 am
tepples wrote: |
To keep egg off my face, I like to use palette writes to gauge how much CPU time is spent in each function. I have used this technique when optimizing a port of the "Toast" GSM 06.10 decoder. |
Excuse me if I'm off topic, but what is this "Palette writes" trick? Am I the only one to don't catch what you're talking about?
_________________
Visit SR-388 now and get a 40% discount!
#28137 - tepples - Wed Oct 27, 2004 6:01 am
It's a profiling technique. Changing the first entry of palette RAM changes the background color in real time, with each scanline representing 1232 cycles. If you change the background color or the fade level at various points in processing, you can get a sense of how much time each subroutine takes compared to the others.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.