gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > interlockin' (ARM9 pipeline optimizations)

#139576 - DekuTree64 - Fri Sep 07, 2007 6:23 pm

In this post, kusma wrote:
...and there you lost the single cycle-beauty of SMLAWx, it's only single-cycle if you don't use the result in the next instruction :)

Oops, nice catch :)
The first 2 are ok, since the accumulate term isn't accessed until the second cycle, but the last 2 will indeed interlock. I think it would be good to get it working using this version though, then we can go and reorder things to avoid that. r12 and r14 are still unused, so it should be pretty easy.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#139582 - simonjhall - Fri Sep 07, 2007 7:40 pm

I can't remember who told me, but I thought that in the DS implementation of the ARM ISA instructions didn't have any latency? As in the units within the chip weren't pipelined - for instance, if you did add $1 $2 $3, $1 would be ready to use in the next instruction without any hazards (although it wouldn't necessarily be in the next clock cycle).

If you can't use the result of a SMLAWx in the next instruction, does this mean that other instructions have similar latencies? How about loads and stores? I know that they take a long time, but am I going to gain anything by pipelining loads and stores?

</off topic>
_________________
Big thanks to everyone who donated for Quake2

#139586 - tepples - Fri Sep 07, 2007 8:29 pm

simonjhall wrote:
I can't remember who told me, but I thought that in the DS implementation of the ARM ISA instructions didn't have any latency?

That was true of the ARM7. The ARM9 is a more complicated beast, and it likely has interlocks so that it can speed up the average case while maintaining correctness and not introducing "delay slot" semantics into the architecture like on MIPS.

Quote:
As in the units within the chip weren't pipelined - for instance, if you did add $1 $2 $3, $1 would be ready to use in the next instruction without any hazards (although it wouldn't necessarily be in the next clock cycle).

Add is also one of the "16 simple ALU operations".

Quote:
If you can't use the result of a SMLAWx in the next instruction, does this mean that other instructions have similar latencies?

A multiply-add is much more complicated than an add.

Quote:
How about loads and stores? I know that they take a long time, but am I going to gain anything by pipelining loads and stores?

All I can suggest here is profile it and see. I used to do this a lot back in the GBA days when I had more time to spare (didn't have to entertain little cousins, didn't have a housemate's talk radio eating up my concentration, etc.).
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#139589 - DekuTree64 - Fri Sep 07, 2007 8:44 pm

Tepples: Maybe split this to a seperate topic to discuss ARM9 cycle times?

simonjhall wrote:
I can't remember who told me, but I thought that in the DS implementation of the ARM ISA instructions didn't have any latency?

AFAIK, it's a feature of the ARM9 core, and can't be changed.

Quote:
If you can't use the result of a SMLAWx in the next instruction, does this mean that other instructions have similar latencies? How about loads and stores?

Yep. Loads take 3 cycles, but are effectively single-cycle if you don't use the result right away. I ran some tests on this before to verify it. I can't remember how it interacts with waitstates though...

Since store has no result, I think it's pretty much always single-cycle, at least when using the write buffer, or if the data being written to is in the cache or DTCM. It may stall if it actually has to wait for memory to finish writing though.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#139592 - simonjhall - Fri Sep 07, 2007 9:41 pm

tepples wrote:
simonjhall wrote:
I can't remember who told me, but I thought that in the DS implementation of the ARM ISA instructions didn't have any latency?

That was true of the ARM7. The ARM9 is a more complicated beast, and it likely has interlocks so that it can speed up the average case while maintaining correctness and not introducing "delay slot" semantics into the architecture like on MIPS.
Ah ok. I remember documentation about the ARM9 saying how it has a five-stage pipeline (versus the three- on the '7) but I've never heard anyone talking about pipelining code in order to avoid stalls.
Also if you look at the assembled output of a program at a first glance it seems that there's no pipelining of the code going on. If indeed there are RAW hazards from loads is there any chance of some compiler switch which'll allow you to set how long you expect a read to take? (allowing the compiler to schedule a big gap until the result of the load is used, eg for slow memory)

So if you say that add is one of the 16 fast operations, is there a list somewhere saying what takes how long?

EDIT: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0240b/ch01s01s01.html and http://infocenter.arm.com/help/topic/com.arm.doc.ddi0240b/DDI0240A.pdf
_________________
Big thanks to everyone who donated for Quake2

#139604 - tepples - Sat Sep 08, 2007 1:57 am

simonjhall wrote:
Also if you look at the assembled output of a program at a first glance it seems that there's no pipelining of the code going on. If indeed there are RAW hazards from loads is there any chance of some compiler switch which'll allow you to set how long you expect a read to take? (allowing the compiler to schedule a big gap until the result of the load is used, eg for slow memory)

You could start with the optimizer section of the GCC 4.1.x manual. The option -fschedule-insns2 appears to do what you ask for; -O2 is supposed to enable this. (-O3 makes inlining too aggressive for my tastes.) Adding -fweb -frename-registers might allow the use of more registers to hide load delays if your algorithm is memory-bound rather than register-bound.

How do you have -march and -mtune set up? The makefile from the devkitARM DS templates recommends -march=armv5te -mtune=arm946e-s.

Quote:
So if you say that add is one of the 16 fast operations, is there a list somewhere saying what takes how long?

I was referring to the Thumb type 4 instructions and their ARM counterparts.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.