#134125 - simonjhall - Tue Jul 10, 2007 8:04 pm
I have a dumb function which allows 8-bit writes to the gba slot by reading in a short, masking out the bits I don't want, then orring in the new bits of the value, like so:
Code: |
//write byte 'value' to address 'ptr'
void byte_write(void *ptr, unsigned char value)
{
unsigned int aligned_addr = ((unsigned int)ptr) & 0xfffffffe;
unsigned short existing = *(unsigned short *)aligned_addr;
unsigned short result;
unsigned short is_lower = (unsigned int)ptr & 0x1;
unsigned short value_shift = is_lower << 3;
unsigned short shifted_value = value << value_shift;
unsigned short existing_masked = existing & (0xff00 >> value_shift);
result = shifted_value | existing_masked;
*(unsigned short *)aligned_addr = result;
} |
The compiler does a good job of this, and assembles this:
Code: |
02043ad8 <_Z10byte_writePvh>:
2043ad8: e2002001 and r2, r0, #1 ; 0x1
2043adc: e3c00001 bic r0, r0, #1 ; 0x1
2043ae0: e1d030b0 ldrh r3, [r0]
2043ae4: e1a02982 mov r2, r2, lsl #19
2043ae8: e1a02822 mov r2, r2, lsr #16
2043aec: e3a0ccff mov ip, #65280 ; 0xff00
2043af0: e003325c and r3, r3, ip, asr r2
2043af4: e1833211 orr r3, r3, r1, lsl r2
2043af8: e1c030b0 strh r3, [r0]
2043afc: e12fff1e bx lr |
However I reckon this can be done better. For instance, my C code makes sure that the load (and store) is aligned by taking the destination address and chopping off the bottom bit. But (I'm sure) ldrh won't care about that bottom bit anyway, so I may as well leave it in (and hence remove the and and bic).
What else can I drop to make this as short as pos?
Oh and I preferably want to use as few possible registers as possible.
And finally, there are several names for strb - there are the conditional versions, are there any more?
Ta :-D
_________________
Big thanks to everyone who donated for Quake2
#134127 - Dwedit - Tue Jul 10, 2007 8:48 pm
I made a couple of these (8 instructions)
This one assumes you don't care about writing or reading from an unaligned address, and want to preserve r0 and r1:
Code: |
@r0 = address
@r1 = value
tst r0,#1
ldrh r2,[r0]
biceq r2,r2,#0x00FF
bicne r2,r2,#0xFF00
orreq r2,r2,r1
orrne r2,r2,r1,lsl#8
strh r2,[r0]
bx lr
|
Then another one if you don't care about destroying the address, and want to avoid unaligned memory access: (still 8 instructions) You can easily sacrifice an additional register if you want to preserve r0.
Code: |
@r0 = address
@r1 = value
ands r2,r0,#1
ldrh r2,[r0,-r2]!
biceq r2,r2,#0x00FF
bicne r2,r2,#0xFF00
orreq r2,r2,r1
orrne r2,r2,r1,lsl#8
strh r2,[r0]
bx lr
|
Edit:
This C code produced results somewhat similar to the first example at the top:
C code:
Code: |
void byte_write(u8* mem, u8 value)
{
u32 address=(u32)mem;
int condition = (address&1);
u16 readvalue = *(u16*)mem;
if (condition) readvalue&=0xFF00;
else readvalue &=0x00FF;
if (condition) readvalue|=(value<<8);
else readvalue|=value;
*(u16*)mem=readvalue;
}
|
Produced this:
Code: |
byte_write:
ldrh r2, [r0, #0]
tst r0, #1
and r3, r2, #65280
and r2, r2, #255
orr r3, r3, r1, asl #8
orreq r3, r2, r1
strh r3, [r0, #0]
bx lr
|
I couldn't get the compiler to produce anything resembling the second code though. It must not like to subtract 1 with a LDRH []! instruction
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#134130 - masscat - Tue Jul 10, 2007 9:03 pm
LDRH may or may not care about the least significant bit.
Quoting the ARM Architecture Reference Manual:
Quote: |
If the address is not halfword-aligned, the result is UNPREDICTABLE. |
Same for STRH.
The behaviour maybe consistent for the DS ARMs though.
You can get the ARM Arichitecture Reference Manual by googling for "arm ddi 0100e".
#134136 - simonjhall - Tue Jul 10, 2007 10:42 pm
Kewl, ta dwedit. I reckon I'll be taking the second bit of code from you (the one using the conditional execution). Btw, have you tested these code fragments, so can I be sure that they work? ;-)
And thanks masscat - I think I'll make sure the bottom bit is cleared, just in case it happens to run on a DS which cares about that bit!
Watch this space...
_________________
Big thanks to everyone who donated for Quake2
#134149 - Cearn - Wed Jul 11, 2007 1:01 am
Here's one for 7 instructions, without masking.
Code: |
@ void byte_write(void *ptr, u8 value)
byte_write:
eor r2, r0, #1
ldrb r3, [r2] @ Read the *other* byte
ands r2, r0, #1 @ Test and prep for alignment
orreq r3, r1, r3, lsl #8 @ Even: src-byte needs shift
orrne r3, r3, r1, lsl #8 @ Odd: value needs shift
strh r3, [r0,-r2] @ Align and write
bx lr
|
NOTE: UNTESTED. But something like this should be possible. I'll test tomorrow when it's not 2am.
#134156 - Dwedit - Wed Jul 11, 2007 2:03 am
Wow, that's even awesomer.
I think the best method is probably still the swpb with cache method, which is NDS9 only.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#134166 - simonjhall - Wed Jul 11, 2007 7:44 am
Dwedit wrote: |
I think the best method is probably still the swpb with cache method, which is NDS9 only. |
What does that do? (sorry but I've forgotten most of my ARM!)
_________________
Big thanks to everyone who donated for Quake2
#134169 - Ant6n - Wed Jul 11, 2007 8:11 am
How about enable caching with WB on gba slot, and use
Code: |
pld [r0]
strb r1,[r0]
|
can't really test it, don't have access to a DS right now. Since there is not much information on pld, i'd guess that if a line fill is required, it'll stall the bus for the load of 32 bytes. If executing from itcm, one could probably have some instructions inbetween there, because they dont need the external bus. This thing could give varying performance based on how scattered the access is. Maybe using unified cache, which I believe would give 12Kb instead of 4Kb would increase the change of cahce hit when running code from itcm. Again, I can only make assumptions, as the documentation is sparse, and I can't test.
Also, things might go wrong if an interrupt occurs inbetween these instructions and the cacheline of [r0] is flushed by the isr.
#134171 - kusma - Wed Jul 11, 2007 8:37 am
Ant6n wrote: |
Since there is not much information on pld, i'd guess that if a line fill is required, it'll stall the bus for the load of 32 bytes. |
Most ARM9-implementations I've seen has actually treated PLD as a NOP. Not too sure about what the NDS9 does though.
#134173 - simonjhall - Wed Jul 11, 2007 9:22 am
Ant6n wrote: |
Also, things might go wrong if an interrupt occurs inbetween these instructions and the cacheline of [r0] is flushed by the isr. |
Yeah that's what I was worried about.
Tbh after writing the debugger I'd like to run something which works *exactly* the way I expect it to! No fancy stuff, so hopefully the debugging will be easier!
Anyway, why use a pld? What about a ldrb followed by a strb? I'm pretty sure ldrb works :-)
_________________
Big thanks to everyone who donated for Quake2
#134174 - Ant6n - Wed Jul 11, 2007 9:24 am
kusma wrote: |
Ant6n wrote: | Since there is not much information on pld, i'd guess that if a line fill is required, it'll stall the bus for the load of 32 bytes. |
Most ARM9-implementations I've seen has actually treated PLD as a NOP. Not too sure about what the NDS9 does though. |
well, even if pld doesnt do anything, one could still just read the byte/corresponding word - and force the linefetch that way. Considering that caching should speed up ram in general, and that a read has to be performed to write 8bit anyway, the cache trick might come pretty much for free.
#134184 - nutki - Wed Jul 11, 2007 11:48 am
Cearn wrote: |
Code: | @ void byte_write(void *ptr, u8 value)
byte_write:
...
ldrb r3, [r2] @ Read the *other* byte
...
|
|
Wasn't the whole thing that you cannot 8-bit access the *ptr?
#134186 - simonjhall - Wed Jul 11, 2007 12:02 pm
ldrb should work fine, it's strb that doesn't work correctly when used on the (uncached?) gba slot memory space.
_________________
Big thanks to everyone who donated for Quake2
#134196 - kusma - Wed Jul 11, 2007 2:34 pm
I think Cearn's technique here is close to optimal, but you might end up wasting a bit of performance as function call overhead. If you move the register-allocation over to gcc, you can write a nice inline-routine that can be used. If you don't just end up trashing the instruction-cache, that is.
edit: For your convenience, I've rewritten Cearn's code to inline asm. When compiled with optimizations on, it outputs exactly the same as the non-inline-asm version. Here you go:
Code: |
void byte_write(void *ptr, u8 value)
{
int tmp1, tmp2;
asm volatile (
"\n\
eor %0, %[ptr], #1 \n\
ldrb %1, [%0] @ Read the *other* byte \n\
ands %0, %[ptr], #1 @ Test and prep for alignment \n\
orreq %1, %[val], %1, lsl #8 @ Even: src-byte needs shift \n\
orrne %1, %1, %[val], lsl #8 @ Odd: value needs shift \n\
strh %1, [%[ptr],-%0] @ Align and write \n\
"
: /* output */
"=&r"(tmp1), "=&r"(tmp2)
: /* input */
[ptr] "r"(ptr),
[val] "r"(value)
);
}
|
edit2: warning: phpbb has added some spaces between the slashes and line-ends. remove them to get the code to compile ;)
#134214 - Ant6n - Wed Jul 11, 2007 5:38 pm
simonjhall wrote: |
...
Anyway, why use a pld? What about a ldrb followed by a strb? I'm pretty sure ldrb works :-) |
When I first looked into it it seemed like a pretty smart idea because PLD only takes one cycle. In my project design I was going to execute asm from itcm, and since pld does not need the return value it'd most likely be possible to have a bunch of instructions between the pld and strb. In that way the line fill and those instructions could execute simulatenously - although I haven't tested whether line fill and on-die execution can happen simulatenously. Of course this doesnt apply here.
About the interrupt cases I think one could get deterministic behaviour, if one considers this:
- if the irs doesn't used cached memory, then everything should be fine (the cache survives across calls)
- if the irs does use cached memory, then check whether the instruction before the interrupt was a load, and simply return to that same instruction. One could add some exra checks (i.e. whether the instruction afterwards is ldrb), or possibly use pld to identify the cache prepare.
Whether this works sort of depends on the project
#134216 - DekuTree64 - Wed Jul 11, 2007 5:52 pm
Can't you just use swpb alone, like dwedit mentioned a few posts ago? It should load the cache line and do the write without the possibility of interrupts inbetween.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#134257 - Ant6n - Thu Jul 12, 2007 2:29 am
oh right! ('doh)
#134293 - simonjhall - Thu Jul 12, 2007 9:33 am
So again, what does swpb do?
_________________
Big thanks to everyone who donated for Quake2
#134296 - chishm - Thu Jul 12, 2007 9:59 am
It's an atomic byte load then store. It locks the bus while the instruction executes, and can be used to implement semaphores.
_________________
http://chishm.drunkencoders.com
http://dldi.drunkencoders.com
#134449 - Ant6n - Fri Jul 13, 2007 7:28 am
Cearn wrote: |
Here's one for 7 instructions, without masking.
Code: | @ void byte_write(void *ptr, u8 value)
byte_write:
eor r2, r0, #1
ldrb r3, [r2] @ Read the *other* byte
ands r2, r0, #1 @ Test and prep for alignment
orreq r3, r1, r3, lsl #8 @ Even: src-byte needs shift
orrne r3, r3, r1, lsl #8 @ Odd: value needs shift
strh r3, [r0,-r2] @ Align and write
bx lr
|
... |
If I remember correctly, a ldrb causes a two cycle interlock.
#134680 - simonjhall - Sun Jul 15, 2007 6:55 pm
I forgot to follow this one up - basically I was writing a tool to automatically patch up strbs with a dash of code to emulate the instruction. However it didn't work out as well as I had hoped as other stuff broke it!
It works like this:
- compile the code not to an object file, but to assembly output (-S, not -c)
- find all instances of strb
- figure out what that instruction was going to do
- insert a replacement
- assemble the files
- ship it
I know this works, as I've done it before on another platform (I replaced all load/store instructions with blocking DMA reads and writes), however the ARM ISA is a little more complicated so it means that there's more chance of getting it wrong. However I do have something which I think is working ok, but interrupt handlers are breaking it ;-)
The biggest problem is on entry to the replaced block, I need a couple of spare registers (eg computing the EA) and need to save/restore the flag bits. However there doesn't seem to be a 'store absolute' instruction, so all load/stores need to be with reference to a register...but I can't set an address in a register without trashing its contents! So that leaves $sp and $pc...
Using the stack pointer is a bit risky (as you can't be sure if it actually points to the stack), but also (what clobbered me) is the fact that interrupts which use a lot of stack may occur and wipe out the stored state.
So that leaves the program counter. I could have gotten away with doing something like this:
Code: |
str r0, [pc + 20]
str r1, [pc + 20]
mrs r1, cpsr
str r1, [pc + 16]
b rest_of_code
.word 0x0
.word 0x0
.word 0x0
rest_of_code:
<etc>
|
...but I couldn't be arsed, plus didn't like the wasted time due to branching. So before I waste yet more time working on this thing, I'm gonna do it by manually going through gallons of code, manually replacing strbs :-(
If anyone wants to pick up my code, just shout. If it works, it'll quite easily allow you to run any program from slot-2 RAM without modifying the compiler.
_________________
Big thanks to everyone who donated for Quake2
#134706 - Ant6n - Mon Jul 16, 2007 1:06 am
the swap byte trick didn't work?
#134713 - Dwedit - Mon Jul 16, 2007 3:12 am
I believe you need to enable caching for that memory range before the swap byte trick will work.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#134715 - Dwedit - Mon Jul 16, 2007 3:17 am
Have the changes made to DSLinux's GCC toolchain (to support 8-bit writes to any memory area) been merged into Devkitarm yet?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#134718 - wintermute - Mon Jul 16, 2007 4:40 am
Dwedit wrote: |
Have the changes made to DSLinux's GCC toolchain (to support 8-bit writes to any memory area) been merged into Devkitarm yet? |
No, and given that it would then require another 4 sets of libraries I'm really not that keen on doing it.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog
#134737 - simonjhall - Mon Jul 16, 2007 8:05 am
Yeah I thought about swpb, but I don't think you can use the same addressing modes as strb, which means the ea might have to be computed in code. If so, you'd still need a spare register to do that with...
_________________
Big thanks to everyone who donated for Quake2
#134744 - DekuTree64 - Mon Jul 16, 2007 9:13 am
I think it's pretty safe to assume that the stack will always be the stack. So then all you need to do is find a fast byte write that doesn't involve conditional instructions. Here's a pretty crappy one. 11 instructions, 2 temp registers, and "undefined" behavior.
Code: |
r0 is dest pointer
r1 is data byte
stmfd sp!, {r2, r3}
ldr r2, [r0] @ Read 32 bits, rotated so our target is in the low bits
bic r2, r2, #0xff @ Clear space for target
and r3, r1, #0xff @ Upper bits may not be 0...
orr r2, r2, r3 @ Insert the byte
and r3, r0, #3 @ \
mov r3, r3, lsl #3 @ } 32 - (dest & 3) * 8
rsb r3, r3, #32 @ /
mov r2, r2, ror r3 @ Rotate target byte back in place
str r2, [r0] @ Store (unaligned, but should force)
ldmfd sp!, {r2, r3} |
Although I guess using swpb with manually computed addresses for the unsupported modes would be faster.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#134746 - simonjhall - Mon Jul 16, 2007 9:24 am
DekuTree64 wrote: |
I think it's pretty safe to assume that the stack will always be the stack. |
But what about (optimised) leaf functions which don't actually make stack frames, and don't move the stack pointer? You've got to store your data a certain amount below the stack pointer just in case...
Time to move to the PSP I think, where they don't have this stupid problem ;-)
_________________
Big thanks to everyone who donated for Quake2
#134750 - DekuTree64 - Mon Jul 16, 2007 10:27 am
simonjhall wrote: |
But what about (optimised) leaf functions which don't actually make stack frames, and don't move the stack pointer? |
But if anything was ever depending on the survival of data below sp, then you wouldn't be able to switch to the main stack during an interrupt handler. Or are you talking about hand-written functions?
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#134774 - Ant6n - Mon Jul 16, 2007 5:17 pm
switch to main stack?? why'd you do that? and also, why would you have ain interrupt stack to begin with?
That's dangerous talking there
#134807 - Ant6n - Tue Jul 17, 2007 2:32 am
One way to store a value temporarily without using the stack is using some specific memory location. To make sure that interrupts etc don't interfere with each other, one could have one of these for every couple of functions - assuming that these functions are not used by both isr and normal code, there should be no error conditions.
Regarding not cahnging condition flags, do lsl,asr,lsr,ror change the carry flag? My arm asm programming manual seems to suggest so, but gbatek seems to suggest that only when an instruction uses '-S' that any flags are set.