gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > Auto Optimisation

#171615 - Azenris - Sat Dec 05, 2009 2:57 am

I can't remember if I read this or its just something that was made up in my head and completely not true!

When compiling do the divides and times done in your program automatically get changed into bit shifts if possible?

Anyway just asking, cant remember if/where I read it.
_________________
My Homebrew Games

#171616 - Rajveer - Sat Dec 05, 2009 3:52 am

Yeah I remember reading that divides/multiplications of a power of 2 get modified into a bit shift by the compiler, somewhere on these forums. Not sure though.

#171617 - Azenris - Sat Dec 05, 2009 3:57 am

Ahh ok thanks.

Yea think I read it here too, somewhere.
_________________
My Homebrew Games

#171618 - Dwedit - Sat Dec 05, 2009 4:25 am

Division by a constant is optimized, division by a variable is not.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#171623 - ninjalj - Sat Dec 05, 2009 12:12 pm

Rajveer wrote:
Yeah I remember reading that divides/multiplications of a power of 2 get modified into a bit shift by the compiler, somewhere on these forums. Not sure though.


On ARM, multiplication by 2^n-1 and 2^n+1 should also be changed to add/rsb instructions (IIRC), and on x86 anything that is synthetizable by LEA.

#171627 - Abcd1234 - Sat Dec 05, 2009 11:54 pm

Azenris wrote:
I can't remember if I read this or its just something that was made up in my head and completely not true!

When compiling do the divides and times done in your program automatically get changed into bit shifts if possible?

Anyway just asking, cant remember if/where I read it.


Yes, that's called strength reduction, and with gcc, traditionally it's included in the set of optimizations enabled with -O2.

#171628 - sajiimori - Sun Dec 06, 2009 1:00 am

Also, even when the compiler optimizes division by a constant power of 2, it's only a simple shift if the left-hand side is an unsigned value. If it's signed, extra opcodes are generated.

#171629 - Quirky - Sun Dec 06, 2009 5:26 pm

You might find this article interesting - "state of the art C compiler optimization tricks" http://lambda-the-ultimate.org/node/3674

And here is the PDF on Google Docs.

It focuses on x86, but arm-eabi-gcc gives similar results with a few of the examples when I tried them.

#171637 - sajiimori - Mon Dec 07, 2009 9:03 pm

Really interesting doc! But I wouldn't give it to a junior programmer, without discussing each point with them. Almost every statement in the presentation can be very easily misinterpreted or overgeneralized.

Case in point, the caveat I already mentioned (about signed integer strength reduction) is omitted from the PDF. It just declares an unsigned int and says it's fast, but doesn't mention that it's slower for signed values.

It's very possible that the finer points were noted by the speaker, without being on the slides.

#171638 - Abcd1234 - Mon Dec 07, 2009 9:23 pm

sajiimori wrote:
Really interesting doc! But I wouldn't give it to a junior programmer, without discussing each point with them. Almost every statement in the presentation can be very easily misinterpreted or overgeneralized.

Case in point, the caveat I already mentioned (about signed integer strength reduction) is omitted from the PDF. It just declares an unsigned int and says it's fast, but doesn't mention that it's slower for signed values.

It's very possible that the finer points were noted by the speaker, without being on the slides.


While that's certainly true, I think the general message is simple and accurate: Don't micro-optimize unless you've profiled the code, both before and after your optimization, as odds are a) the compiler will do a better job than you as it's surprisingly clever and will likely take advantage of details about the underlying architecture that you aren't even aware of (the Opteron branch prediction workaround is a great example), and b) if you do try to be clever, you might just end up tricking the compiler into generating *less* optimal code.

#171642 - Exophase - Mon Dec 07, 2009 11:18 pm

GCC isn't nearly as good at ARM as it is on x86. Beating GCC at ARM is easy (IMO anyway). It's been improving a lot over the years but still pretty slowly.

Compilers may know tricks that you don't, and they might be better at scheduling for complex architectures than you are. But there are a lot of things they're just not as good at, generally. And no matter how good a compiler is, it won't have a degree of insight into a problem that the programmer often will. Runtime profiling can help but only so much, and really it's a pain in the ass.

Some optimizations are just easier to write in ASM. Sometimes you can coerce the compiler into generating what you want, but at pretty great difficulty and expense of code clarity or other optimizations. For instance, in order to get GCC to sibcall so I could get tail recursion w/o my stack blowing up (yes, relying on an optimization, but never mind that right now) I had to put the body of the function in another function, non-static so it didn't get inlined. Otherwise GCC would fail to match the stack frames due to the locals, despite the fact that the locals are all dead by the time the sibcall is made. GCC could be smarter than this, but the fact is that it isn't. This was for x86, for what it's worth - I don't usually look at ASM generated on x86 but sometimes I have to. On ARM I look at it all the time.

I think my main beef with this article is precisely that it does seem to assume all coding is done on cutting edge out of order deeply superscalar deeply pipelined complex CPUs, when all sorts of coding is done on simpler embedded CPUs and it's these CPUs that NEED greater detail to optimization. The ARM9 on the DS, for instance, has relatively simple static scheduling, so the compiler really doesn't have an advantage over an ARM coder worth much of anything. On the other hand, the compiler isn't that amazing at folding memory increments or even optimal register allocation and various platform specific tricks.

I do agree with the article that memory hierarchy is very important, but again, it's exaggerated here for really high MHz platforms that have deep memory hierarchies that quickly hit large cycle count latencies. It matters on DS, especially with only a little bit of L1 cache, but main memory just isn't as relatively slow as it is on modern x86. To say that this kind of optimization is all that matters anymore is just insane. You can still make huge wins shaving off cycles here and there in inner loops, purely in non-memory.

Finally, I think the dogma of "profile before doing anything" is overstated. Here I take profiling to mean using an automated method of finding which pieces of code are hit most often. The process itself can introduce its own new information that skews the results. I'm not referring to benchmarking in general, which is always useful. As general advice to a general audience "always profile" is probably good because I think a lot of people don't have enough intuition to reliably know which parts of their program need optimization, but as far as I'm concerned some things are just going to be dead obvious.

Basically, I think people need to be ready for anything and willing to examine problems on a case by case basis. There has been backlash for a long time towards an (incorrect) attitude that dropping down to ASM will result in better code than what a compiler generates - now what has replaced it is this equally incorrect attitude that the compiler is always perfect and you're wasting your time trying to beat it. But yes, actually checking performance before and after doing anything is a no-brainer.

#171644 - sajiimori - Tue Dec 08, 2009 12:47 am

Well said. =)