#10778 - slip - Mon Sep 15, 2003 2:01 pm
I've got two global variables declared in C code. but I can't seem to use them with inline ASM. Here is my code in a dummy function..
Code: |
u32 skyColor;
u32 floorColor;
someFunc()
{
asm("
.ALIGN
.ARM
mov r0,%0
mov r1,r0
ldr r2,=skyColor
mov r3,#0x4B00
add r1,r1,r3
clearmore:
str r2,[r1]
sub r1,r1,#4
cmp r1,r0
bge clearmore
add r0,r0,#0x4B00 @add half the screen
mov r1,r0
ldr r2,=floorColor
ldr r3,=0x347C @0x3840 is 24 pixels off bottom
add r1,r1,r3
clearmoreb:
str r2,[r1]
sub r1,r1,#4
cmp r1,r0
bge clearmoreb
.pool
" :
/* No output */ :
"r" (pDrawBuffer):
"r1", "r2", "r3");
}
|
The routine just clears the top half of the screen one color the bottom half another color. The colors are ment to be skyColor and floorColor, but that doesn't seem to work.
The both the asm routine and variables are in IWRAM. I'm not sure if this would matter.
I would have thought that if there was something wrong with this the assembler would have complained, or at least the linker??
Can anyone explain to me how I can use global variables in my ASM code like this?
Thanks in advance
_________________
[url="http://www.ice-d.com"]www.ice-d.com[/url]
#10779 - DekuTree64 - Mon Sep 15, 2003 2:05 pm
The problem is you're only getting the address of the variables, not the values stored in them. Use
ldr r2,=skyColor
ldr r2, [r2]
and it should be fine
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#10780 - tom - Mon Sep 15, 2003 2:10 pm
something else i spotted:
wenn your clearemoreb loop terminates, the cpu runs directly into the literal pool you placed after the bge instruction...try removing the literal pool (gcc will place one after the function anyway, iirc), or jump over it.
#10781 - slip - Mon Sep 15, 2003 2:21 pm
Ahh of course.. Thanks
oh and tom when I take out .pool I get
Error: Literal referenced across section boundary (Implicit dump?)
Error: Literal referenced across section boundary (Implicit dump?)
Error: Literal referenced across section boundary (Implicit dump?)
refering to the lines
ldr r2,=skyColor
ldr r2,=floorColor
ldr r3,=0x347C
=\
_________________
[url="http://www.ice-d.com"]www.ice-d.com[/url]
#10785 - tepples - Mon Sep 15, 2003 3:37 pm
If you use ldr =, you need to make a constant pool. Try putting a line containing only .pool at the end of each assembly language function that you write, after the 'bx lr' or 'mov pc, lr' or whatever you're using to return.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#10787 - torne - Mon Sep 15, 2003 4:22 pm
Except he's using inline asm, and there's nowhere you can really put a pool, unless you jump over it yourself.. you'd think GCC would be able to work it out.
#10797 - slip - Tue Sep 16, 2003 1:34 am
Its not causing any trouble AFAIK. Should I jump over it or just leave it?
_________________
[url="http://www.ice-d.com"]www.ice-d.com[/url]
#10800 - torne - Tue Sep 16, 2003 10:43 am
Disassemble your function and see what has happened to the pool. It will probably be 12 bytes, i..e three garbage instructions. Those will be getting executed, unless GCC has done something improbably cunning. =)
#10808 - tom - Tue Sep 16, 2003 3:19 pm
slip wrote: |
Its not causing any trouble AFAIK. Should I jump over it or just leave it? |
then you're lucky - for this time =)
either jump over it, just to be sure, or do what torne said (look at gcc's assembly output. i'd be interested to see it too).
#10858 - slip - Thu Sep 18, 2003 4:59 pm
Well I disassenbled using Visualboy Advance, there are three lines of 'crap'
Code: |
tsteq r0, #0x0
tsteq r0, #0x4
andeq r3,r0,r12, ror r4
|
I don't know thumb, tst is test I'm guessing? What does this code actually do?
This does present a small problem that this code is in a speed critical spot.
_________________
[url="http://www.ice-d.com"]www.ice-d.com[/url]
#10861 - torne - Thu Sep 18, 2003 5:29 pm
It's not thumb, it's ARM.
tst ANDs its arguments and sets flags. The structure of your function is such that the crap will only get executed if Z is clear (as if it were set, the loop would've been taken in the previous instruction) so it happens, by extremely lucky coincidence, that those instructions will never execute. They still have to go through the pipeline, however, which will take 3*(time to fetch a 32-bit instruction), where time depends on which type of memory you are reading from. If this code is running from IWRAM, it'll take three extra cycles. If it's running from ROM, you're looking at more like 15 extra cycles. Also, if you ever change the logic of your function (such that the Z flag could be set) or the addresses of your globals (such that the crap instructions are not the same) then it could break.
You need to fix it, especially if it's speed-critical code. Even branching over the pool will take SOME time that would otherwise not be spent.
If you don't mind gross, gross hackery, then define a dummy function just after that one, which consists of inline asm that just says .pool. This should force gcc to dump the pool to a region which is both close enough, and not actually in the path of execution. The only alternative I can see would be to declare this function in a seperate asm file; I have never been sure why people use inline asm for functions which contain no C code. =)
#10869 - tom - Thu Sep 18, 2003 7:23 pm
torne wrote: |
I have never been sure why people use inline asm for functions which contain no C code. =) |
maybe because they're masochists =)
#10872 - torne - Thu Sep 18, 2003 7:41 pm
I think this might be it. I've never used inline asm for anything more than getting an int3 instruction into the middle of a function (on Intel that is). Is there some mystical thing that you can do easily with inline asm that you can't with seperate .S files?
#10876 - FluBBa - Thu Sep 18, 2003 8:38 pm
So when are we going to optimize this code?
As Slip said it's in a speed critical spot.
_________________
I probably suck, my not is a programmer.
#10883 - tom - Fri Sep 19, 2003 8:10 am
oh, plenty of suggestions have been brought up:
- jump over it (unnecessary performance hit)
- place a .pool in a dummy function after the function (ugly hack, imagine what a nightmare code becomes when it's full of such things...)
- let it be and hope the code never breaks =)
- use an external .S file where you have full control over everything, including where you place your pools.
- or, if you absolutely want to use inline assembly, why don't you tell gcc to pass skyColor and floorColor to the asm block as you did with pDrawBuffer ?
#10885 - torne - Fri Sep 19, 2003 10:34 am
As for optimisation, throw the whole function away and use DMA. It will be much faster.
#10887 - FluBBa - Fri Sep 19, 2003 10:58 am
I was more into something like this:
Code: |
clearmore:
str r2,[r1]
sub r1,r1,#4
cmp r1,r0
bge clearmore
|
Which can be optimized to:
Code: |
clearmore:
str r2,[r1],#-4
cmp r1,r0
bge clearmore
|
And for even more speed the loop can be unrolled.
Shouldn't we be helping each other about writing good asm instead of just helping people to start doing asm?
_________________
I probably suck, my not is a programmer.
#10890 - torne - Fri Sep 19, 2003 12:55 pm
That is indeed an optimisation; however, the fastest way is still to throw it away and use DMA.
Also, for that particular chunk of code, I would write
Code: |
clearmore:
str r2,[r1]
subs r1,r1,#4
bge clearmore |
but that's not very interesting as that's the same speed. Shame you can't post-decrement-and-compare-to-zero, it'd be a nice optimisation for loops. Though if you go down that road too far you'll just end up with Intel assembler again, and everyone will go back to wanting to hang themselves rather than deal with assembly. =)
Other optimisations to apply to that code, other than just using DMA:
1) Replace the first add instruction with add r1,r1,#0x4B00 and throw away the mov instruction just before it. This is done later in the function, so why not here too? =)
2) Do multiple stores; put the background colour into, say, r0-r7, then use stmia to write 8 words at once. Though technically there's no reason to do this when you could use DMA.
#10892 - tepples - Fri Sep 19, 2003 3:21 pm
torne wrote: |
2) Do multiple stores; put the background colour into, say, r0-r7, then use stmia to write 8 words at once. Though technically there's no reason to do this when you could use DMA. |
Clearing memory is faster with stmia than with DMA, as stmia doesn't have to repeatedly re-read the background color from the stack. In addition, stmia is interruptible after each set of words written and DMA is not, which can become important in multiplayer communications.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#10893 - torne - Fri Sep 19, 2003 3:26 pm
DMA from a fixed-source is slower than stmia? I tried it and didn't find that, but my test was pretty vague.. is there a thread somewhere which discusses DMA speed? I'm quite willing to believe that I'm wrong...
#10898 - tepples - Fri Sep 19, 2003 3:44 pm
torne wrote: |
DMA from a fixed-source is slower than stmia? I tried it and didn't find that, but my test was pretty vague.. is there a thread somewhere which discusses DMA speed? |
Try the Search button.
Quote: |
I'm quite willing to believe that I'm wrong... |
Given zero wait states, each stmia instruction takes two cycles plus one for each word. DMA takes 2 for each word plus however long it takes to set up the control registers. Thus, an unrolled stmia loop can be faster than DMA for clearing memory, but DMA is still generally faster for copying memory.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#10901 - torne - Fri Sep 19, 2003 4:30 pm
Oops. Sorry; should've searched. I tried some more tests and I see it now; must've screwed something up before. Thanks.
#10951 - slip - Sun Sep 21, 2003 10:27 am
I was under the impression that using inline means you don't have to branch off somewhere which would mean faster??
I'm not real good at ASM and know it. I've only written two blitting routines and that small piece there. I'll have a play around with some of the things you've suggested =) thanks...
The code is in my game Target: Mainframe, SimonB posted the news for me the other day. If you've played it you'll probably pickup that its a little slow. I was thinking about what I can do to speed it up. After reading some of the suggestions for optimizing I'm going to work on my blitters and see if I can gain some performance. Thanks again.
I didn't want to pass floorColor and skyColor as params, because it would mean they'd have to be pushed onto the stack when they were global anyway??
_________________
[url="http://www.ice-d.com"]www.ice-d.com[/url]
#10954 - torne - Sun Sep 21, 2003 2:50 pm
You don't have to branch to enter the assembly code, but you still have to branch to enter your function! Using inline assembler and declaring a function as inline are not the same thing.
A function that's declared as inline will have its contents copied to every place it's called; it will be put in-line at every call site. This avoids a branch, which is faster, but makes your code bigger, unless the function is only called once.
Inline assembler is included where it appears in the code; if it is a smaller part of an otherwise C function, then the net effect compared to calling an assembler function that's declared in a seperate file is the same; it's put inline, which saves a branch. However, if your function consists *solely* of an inline asm block, you save nothing; you still have to branch to get to the function in the first place.
If you want the function to be inlined, thus avoiding a branch, then you should either put the inline asm block directly into the calling function, or declare the function itself as inline. However, neither of those will help you with your pool problem.
In practise, the added indirection of having to branch to a function is not very much in many cases. Avoid premature optimisation; don't worry about inlining things unless you have already tried it without and found that it's too slow. =)