gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Constant values break assembly!

#83648 - devmelon - Wed May 17, 2006 5:58 pm

Hello everyone.

I've just picked up arm coding and I like it, though it puzzles me at times. This is one of them times. When I wrote my program, I came to understanding that not all constants are valid to use. For example, if I want to move 0x00000415 to some register, I'd write:
Code:
mov r8, #0x415

But for some strange reason, that doesn't work!
I get the error message Error: invalid constant -- `mov r8,#0x415'.
However; this code works just fine!
Code:
mov r8,     #0x400
orr r8, r8, #0x010
orr r8, r8, #0x005

I tested around a little and thought maybe you can only set a nibble at a time but then I saw some other constants that had different nibbles set, and that was working. Does anyone know what's up? Is it the assembler, me, or something truly magical happening here? Are there papers on constant formats that mov takes? How can I reduce three lines into one? :p

Thanks in advance.

#83652 - DekuTree64 - Wed May 17, 2006 6:10 pm

Constants are stored as an 8-bit value and a shift amount (actually I think it's a rotate right amout, but same concept). So something like 0xff000 is fine, because it can be stored as 0xff << 12. But then 0xff800 does not work, because it would be stored as 0x1ff << 11, but 0x1ff doesn't fit in 8 bits.

Check out the ARM7TDMI data sheet for all the info you need on the instruction set.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#83655 - poslundc - Wed May 17, 2006 6:25 pm

If you want to load complicated constants, try the following:

Code:
     ldr     r0, =0x12345678
     ldr     r1, =0x98765432


This will put the constants in an area of nearby memory called the "constant pool" (close enough that the location is relative to the PC, but out of the range of executable code). The values are then loaded from there.

If you want to control where the constant pool is, you can do so by putting the .pool directive (somewhere relative to your code, but unbranchable to; typically after your function terminates).

Of course, loading from memory takes more cycles than a mov instruction, and each constant takes 4 bytes which can be undesirable when working in fast RAM, so whenever you can use mov it's a good idea to do so.

Edit: anyone know if the assembler is smart enough to replace those ldr statements with mov if it sees the constant is a candidate for direct assembly?

Dan.

#83661 - DekuTree64 - Wed May 17, 2006 7:02 pm

poslundc wrote:
anyone know if the assembler is smart enough to replace those ldr statements with mov if it sees the constant is a candidate for direct assembly?

I think I remember testing this before and it does use mov if possible.

The only case I know of where it's better to do mov and then orr is for 16-bit values. mov-orr takes 8 bytes, and 2 cycles. ldr reg, =constant also takes 8 bytes (4 for the ldr, 4 for the constant), but takes 3 cycles. Pretty insignificant unless you're in some inner loop that's called a LOT of times, and by then you should be caching the constant in a register anyway :)

So generally you can just use ldr.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#83662 - Cearn - Wed May 17, 2006 7:09 pm

The data processing ARM instructions have 4 rotate bits (r) and 8 bits for the immediate value (d), allowing the numbers x = ROR(d, 2*r). I had a link that explained this, but it's long since departed. Re-eject's quick references are still active though, and you can get this and other info from it at a glance: ARM/THUMB quickrefs.

poslundc wrote:
Of course, loading from memory takes more cycles than a mov instruction, and each constant takes 4 bytes which can be undesirable when working in fast RAM, so whenever you can use mov it's a good idea to do so.

Each extra instructions for piecing it together from bytes also takes a word, so using loads isn't completely without merit.

poslundc wrote:
Edit: anyone know if the assembler is smart enough to replace those ldr statements with mov if it sees the constant is a candidate for direct assembly?

Code:
// ARM asm:
    ldr r0, =4
    ldr r0, =256
    ldr r0, =257

// VBA disassembly:
E3A0 0000 mov     r0, #0x4
E3A0 0c01 mov     r0, #0x100
E51F 0004 ldr     r0, [$0200077C] (=$00000101)


Guess so.

#83667 - tepples - Wed May 17, 2006 7:44 pm

Cearn wrote:
poslundc wrote:
Of course, loading from memory takes more cycles than a mov instruction, and each constant takes 4 bytes which can be undesirable when working in fast RAM, so whenever you can use mov it's a good idea to do so.

Each extra instructions for piecing it together from bytes also takes a word, so using loads isn't completely without merit.

It becomes more complicated in Thumb. When you piece a constant together from bytes, you use a MOV followed by an ORR (with shifts in the middle if you're using Thumb), and these are read sequentially. When you read from a constant pool, you get two seek penalties (2 cycles in ROM, 0 cycles elsewhere). In 16-bit ROM without prefetch:
  • MOV, LSL, MOV, ORR is 4 instructions that take 8 cycles: wait, read instruction, wait, read instruction, wait, read instruction, wait, read instruction.
  • LDR from pool is 1 instruction and 2 data that takes 11 cycles: wait, read instruction, generate effective address, seek, wait, wait, read data, wait, read data, seek, wait.

_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#83675 - Cearn - Wed May 17, 2006 8:23 pm

tepples wrote:
  • LDR from pool is 1 instruction and 2 data that takes 11 cycles: wait, read instruction, generate effective address, seek, wait, wait, read data, wait, read data, seek, wait.


Shouldn't that be 9 cycles? 1S+1I for the instruction itself, and 1N+1S for the dataload itself (32bit from ROM = 2 16bit reads). Assuming you meant ROM in 3,1 wait, that would come down to (S+I)+(N+S) = (2+1)+(4+2)= 9 cycles. Or am I missing something?

It's still slower than piecing it together, of course, just not by as much. But if you need 3 or more bytes, then ldr wins out again. Argh, initializing large numbers sucks >_<

#83699 - poslundc - Wed May 17, 2006 10:35 pm

I should clarify that when I said "use mov whenever possible" I actually meant "use mov when you have a valid constant to do it in a single instruction."

Although now that I see the ldr/pool format is at least intelligent enough to recognize that scenario (if not intelligent enough to know your memory access waitstates and bus width!) it now seems to me to be the preferable candidate in perhaps all but the most rigorously coded assembly.

Dan.

#83701 - tepples - Wed May 17, 2006 10:52 pm

Cearn wrote:
tepples wrote:
LDR from pool is 1 instruction and 2 data that takes 11 cycles: wait, read instruction, generate effective address, seek, wait, wait, read data, wait, read data, seek, wait.

Shouldn't that be 9 cycles? 1S+1I for the instruction itself, and 1N+1S for the dataload itself (32bit from ROM = 2 16bit reads). Assuming you meant ROM in 3,1 wait, that would come down to (S+I)+(N+S) = (2+1)+(4+2)= 9 cycles. Or am I missing something?

Yeah, you're missing the non-sequential penalty (N-S = 2) for the next instruction.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#83808 - Cearn - Thu May 18, 2006 6:32 pm

Just did some tests with different waitstates and data in different sections and indeed ldr seems to account for 1Ndata + 1Ncode + 1I cycles. So why does every document I've seen list ldr as 1N+1S+1I ?

#83861 - poslundc - Fri May 19, 2006 1:07 am

Arguably the 1S comes from the understanding that most instructions are being loaded sequentially from memory, and if you do a non-sequential read in a load instruction the 1N isn't from the load instruction, but it's from the 1S of the next instruction actually becoming 1N because what is assumed to be a sequential read is instead non-sequential.

In other words, the timing formats in the ARM specs account for a non-sequential generated from a branch by making a branch take extra cycles, but other than that they don't presume to be aware of any waitstate penalties you may suffer from if a data load causes your next read to be non-sequential.

Cearn, did you test ROM reading from RAM as well as ROM reading from ROM? Were the results the same in both case?

Dan.

#83974 - tepples - Sat May 20, 2006 12:02 am

Reading a 32 bit value in ROM from EWRAM: 10 cycles
wait, wait, read instruction, compute address, wait, wait, wait, read low bits, wait, read high bits

Reading a 32 bit value in EWRAM from EWRAM: 10 cycles
wait, wait, read instruction, compute address, wait, wait, read low bits, wait, wait, read high bits

Loading a 16 bit value using mov/lsl/mov/orr: 12 cycles
(wait, wait, read instruction) * 3
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#83995 - Cearn - Sat May 20, 2006 1:56 am

poslundc wrote:
Arguably the 1S comes from the understanding that most instructions are being loaded sequentially from memory, and if you do a non-sequential read in a load instruction the 1N isn't from the load instruction, but it's from the 1S of the next instruction actually becoming 1N because what is assumed to be a sequential read is instead non-sequential.

In other words, the timing formats in the ARM specs account for a non-sequential generated from a branch by making a branch take extra cycles, but other than that they don't presume to be aware of any waitstate penalties you may suffer from if a data load causes your next read to be non-sequential.

I thought of that, so I also ran tests with either a mov,str (for 1S,2N) and str,mov (2N,1S). I figured that if the extra non-sequential penalty was applied to the next instruction, then if it's already a non-sequential cycle nothing extra would happen. However, there is no difference between ldr,mov,str and ldr,str,mov in terms of cycles, so there goes that idea. Unless, the str also has a similar penalty, I suppose.

poslundc wrote:
Cearn, did you test ROM reading from RAM as well as ROM reading from ROM? Were the results the same in both case?

I tried code in ROM and reads from ROM and IWRAM for 4,2 / 4,1 / 3,2 / 3,1 waitstates, no prefetch. In all cases it was 1Ndata+1Ncode+1I.

tepples, since you know what's actually going on, where did you get this information from?

#84006 - tepples - Sat May 20, 2006 4:27 am

I mentally map out the sequence of addresses that the CPU accesses. If they're sequential, I add sequential wait states (1 in ROM, 2 in EWRAM), and if they're non-sequential, I add non-sequential wait states (3 in ROM, 2 in EWRAM). (Here, "sequential" is defined not by the ARM spec but by accesses within an address range; for example, code in ROM that LDR's from RAM will still access ROM itself sequentially.) If an instruction causes the next instruction to be non-sequential when it would otherwise be sequential, I add extra wait states to make up for the difference.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#142490 - Aeneas - Tue Oct 09, 2007 4:37 pm

I would be interested to know where these cycle timing data originate ?
Which document is this in, or if you measured this, which hardware or software was used to measure this data ?

#142501 - Dwedit - Tue Oct 09, 2007 6:38 pm

I'm pretty sure you can find it if you google 'arm7tdmi pdf'
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#142503 - Aeneas - Tue Oct 09, 2007 6:45 pm

Would ARM7 clock cycle counts be consistent with ARM9 counts, which are my particular application (ARM926EJ-S) ?

#142508 - tepples - Tue Oct 09, 2007 9:03 pm

Aeneas wrote:
I would be interested to know where these cycle timing data originate ?
Which document is this in

Wait state info is in GBATEK.

And no, ARM9 timings aren't necessarily consistent with ARM7 timings.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.