gbadev.org forum archive

While writing that pixel plotter, I was remembering all the things that help me to determine if my code is doing all that it can. Here are a few of the guidelines that kept popping up in my head. Keep in mind that these are for use after your algorithm is deemed good (which the below may not be as I'm still coding this blind). My first attempt didn't work so it doesn't matter how fast it was :P
I'll repost the code here to make it easy for reference:

Code:

PlotPixel:
add r3,r1,r1,#lsl 8 @ Get x240 by 256 - 16
sub r3,r3,r1,#lsr 4 @ ...
movs r0,r0,#0,#lsr 1 @ x/2 + address and use carry for low/high
add r3,r3,r0 @ Add x to address
rsc r0,r0,#1 @ Use reverse subtract carry to reverse the carry for address read.
ldrb r0,[r3,r0] @ Read the byte (r0 is 0 or 1 for offset)
orrcs r0,r0,r2,#lsl 8 @ if Carry, move color value high
orrcc r0,r2,r0,#lsl 8 @ if not, color low, video value high
strh r0,[r3] @ store the result with base address only.

bx r14 @ return

1. How many instructions are using the built in shifter?
In the above, only 2 non load instructions are without shifting which tells me there are working hard.

2. How many branches are in the code?
In good ARM, you don't need to branch hardly at all. In the above, the 2 orr use conditional compiling to save time.

3. Are you using the mov instruction?
Mov is a very lazy opcode. All the other arithmetic opcodes have mov's functionality built in. That's one reason I believe the above has at least 1 more cycle to be had in the removal of the mov. It's only saving grace is that it is shifting and setting a flag at the same time.

4. Are you minimizing Memory accesses?
The above reads once and writes once. Hard to be that :)

5. Are you using 8-bit constants or shiftable 8-bit constants?
If not, you need an expensive load or funky shifting to get your values. The above only uses 0 and 1.

6. Please tell me you are avoiding globals?
Globals cost thrice - once for loading the address, once for loading the value, and 4 bytes of ROM for address storage you need to avoid.

7. Are you commenting your code so that you'll be able to read it tomorrow?
The above is barely commented enough. The section on the byte address stuff isn't very clear.

8. Are you avoiding pushes/pops?
While it's nice to use all the ARM registers, it takes time to push/pop when calling from C code. If you can, use only the safe registers. The above only uses r0-r3 (with r1 and r2 never changed)

Others who are experienced with ARM optimization, please chime in! I'm interested in comparing notes and seeing how many things I'm missing.

Mike

Optimizations aside, just glancing at your code there are a couple of things I don't understand... either errors or things that are different on the Nintendo assembler from GAS, I guess...

1. Prefixing the barrel-shift operation with a number-sign, and then not prefixing your constants. I don't think this works in GAS.

2. movs r0, r0, #0, #lsr 1 - this would definitely come up as having too many operands in GAS.

Anyway, as far as optimizations go, you've covered most of the bases... I'd also mention unrolling loops wherever it's practical to do so, using the block-load/store instructions wherever possible, and effective use of the flag-set option to reduce the need for the comparison commands.

I also like to see code that uses "logical" condition mnemonics (since many of them overlap in functionality) that match the documentation/comments. eg. if I am comparing two variables against each other, I like to use gt/ge/lt/le, as if I were using the equivalent C comparison operators.

I also like to see comments that describe the purpose of the code instead of just repeating what the code does. For example, from your code "Add x to address" is far more descriptive than "Add r0 to r3". At the same time documenting register use is often useful, so I might have commented it "r3 <- address + x". Consistent notation in comments is also important.

That's all I've got at the moment.

Dan.

For what it's worth, here's one in 9 instructions. Didn't test it on hardware, so I can't say for sure wether VRAM cares about the lower bit for storing, but it works fine in VBA.

Code:

PlotPixel:
tst r0, #1             @check odd/even pixel
rsb r1, r1, r1, LSL #4
add r1, r0, r1, LSL #4 @r1=x+(y*16-y)*16, or x+y*240
eor r1, r1, #1          @if setting odd, load even, else load odd
ldrb r0, [r3, r1]
orreq r0, r2, r0, LSL #8 @if even, put old in top
orrne r0, r0, r2, LSL #8 @if odd, put new in top
strh r0, [r3, r1]       @store, lower bit doesn't matter
bx    lr

As for optimization notes, one of my favorites, which is good for sound mixers, is using a single multiply to do 2 values, either by putting one sample in the bottom and another in the top 16 bits, and multiplying by volume, or for stereo, left volume in bottom and right volume in top, and multiplying a single sample by that.
Also, carefully ordering your globals in a struct so you can ldmia bunches of related ones at a time is a nice trick.

Another one I came up with for an S-buffer 3D engine I was scheming on a while back is using unrolled loops to render pixels to registers and stmia them, and when you need to break and load a new span, bl to the span loader so you can just bx lr back to where you were in the loop. Basically a regular function call, but with both functions knowing what the other is doing. That's another nice one, if you have a normal function that gets called a lot from a specific function, but is a little big to inline, you can just write a specific version for that caller so it knows exactly which regs are available and such.
And a nifty one to get around storing both a texture's size and bitmask for looping in each span is storing the size as its power of two, like 3 for an 8x8, 6 for 64x64, etc.. Then to generate the texture mask, all you have to do is set a register to 1 and do like rsb reg, 1, 1<<sizeShift

Can't think of many general optimizations aside from the ones already mentioned, but there's plenty of task-specific tricks out there.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

My experience and my ARM documentation say that writing an odd address with strh[] is undefined. Unless you are sure this works on hardware, I'd be very wary of that. That's what I'm having so much trouble with: put low bit in flags, reverse it, and clear it before strh. Losts of stuff to do before that last store. Still, this way is much more efficient than anding off a ldrh call :)

One more thing for the optimization list:

9. Like mov; cmp, tst, cmn, and any other test only opcodes are also very lazy. Though not quite as bad as mov, they can almost always be replaced by arithmetic routines setting flags while playing with data.

Miked0801 wrote:

My experience and my ARM documentation say that writing an odd address with strh[] is undefined. Unless you are sure this works on hardware, I'd be very wary of that.

It says undefined in the ARM ARM, but it's actually defined by the memory controller. The GBA's mem controller ignores the appropriate low bits on word and halfword writes (as it's the simplest thing you can do), so it will work perfectly. =)

Tested it on HW and it works fine.
Although I glanced past Flubba's post in the other thread earlier that day and got the idea for the rsb, but I didn't get a chance to read his whole function, and it turned out to do almost exactly the same thing. Guess that means it really is about as good as it gets. Still, the explicit tst seems like it could be squeezed into an arithmetic instruction somewhere.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

Agreed with some more use of unique indexing. I'll let the sit in the back of my head today and see if we can't get another cycle out of it (I think we can.)

Using the barrel shifter adds one internal cycle when used on normal logical ops (add, mov, orr etc) but I don't think it matters when used together with mem access (ldr & str).
I'm not sure how the writeback works though.
Can you make a speed test on the two different implemntations?
_________________
I probably suck, my not is a programmer.

FluBBa wrote:

Using the barrel shifter adds one internal cycle when used on normal logical ops (add, mov, orr etc)

Only if you are shifting by a register (ie. variable) amount. If you are shifting by a constant, it does not add any additional cycles.

Dan.

WARNING: I haven't written a line of production ARM ASM in my life. (In fact I just started reading this thread for inspiration on learning.)

r3 is the base for the buffer (or I'm hopelessly confused) - would that normally be initialized from an immediate value or from a memory copy? If it's from memory can we set initialize it to (base | 0x0001) and do this modified version of Deku's code?

Code:

PlotPixel:
tst r0, #1          @check odd/even pixel
rsb r1, r1, r1, LSL #4 @r1=y*15
eor r3, r3, r1          @r3=r3^x, (base + (x^1))
ldrb r0, [r3, r1, LSL #4]    @load from r3 + r1*16, (r3 + (y*240))
orreq r0, r2, r0, LSL #8 @if even, put old in top
orrne r0, r0, r2, LSL #8 @if odd, put new in top
strh r0, [r3, r1, LSL #4] @store, lower bit doesn't matter
bx    lr

<<EDIT: As you all probably know this doesn't work because I didn't realize that strh uses a different addressing mode. I'm leaving it up in case the eor idea inspires somebody, but I'll go back to the nice, safe C board now...>>>

- eli

gbadev.org forum archive

ASM > What to look for in well optimized ARM code

#17323 - Miked0801 - Sat Mar 06, 2004 10:05 am

#17330 - poslundc - Sat Mar 06, 2004 2:28 pm

#17377 - DekuTree64 - Sun Mar 07, 2004 1:39 am

#17397 - Miked0801 - Sun Mar 07, 2004 9:24 am

#17403 - torne - Sun Mar 07, 2004 1:10 pm

#17418 - DekuTree64 - Sun Mar 07, 2004 6:29 pm

#17424 - Miked0801 - Sun Mar 07, 2004 8:52 pm

#17485 - FluBBa - Tue Mar 09, 2004 11:11 am

#17494 - poslundc - Tue Mar 09, 2004 4:18 pm

#18480 - ecurtz - Sat Mar 27, 2004 9:30 am