gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Getting started the flat way...

#168371 - kohlrak - Sat Apr 25, 2009 10:50 pm

Me and my one friend were working on an x86 kernel, and i've personally found that the hardware documentation and computability issues with it are just annoying. I thought it'd be nice to try to work with a system with more standardized and better documented hardware (and i don't mean more documentation, i mean being able to look at examples or maybe a page [instead of 100+] just to use one piece of hardware).

Fortunately for me, my favorite assembler has an ARM version. Not so fortunately for me, nothing here is intended for an assembler with the kind of syntax that i wish to be using. I'm not familiar with using gnu assembler (i hate it, quite frankly, because it's syntax is rather weird and it uses a linker).

So, can anyone do me a favor and come up with a flatter example of hello world or something for the ds? For those of you that don't quite understand what i'm asking, a nice equivalent in x86 asm would be the following:

Code:

org 0x7c00 ;This is where the PXE loader puts us
;zero all segment registers except CS, since no one knows what the BIOS fills them with
   xor   ax,ax
   mov   ds,ax
   mov   es,ax
   mov   fs,ax
   mov   gs,ax
   mov   ss,ax

   ;Setup text mode
   mov ah, 0xF
   int 0x10 ;Get the screen page
   mov si, hello
   mov ah, 0xE ;teletype output

   ;Print the string a letter at a time
@@:   ;anonymous label
   lodsb
   or al, al
   jz @f ;next anonymous label
   int 0x10
   jmp @b ;back a label

@@:
   hlt
   jmp @b ;halt loop

hello db "Hello, world!", 0

#168372 - Dwedit - Sat Apr 25, 2009 11:20 pm

Hello world is a little tricky, since the GBA does not have a built in font, nor built in text drawing code.
For a hello world program you would need to do something like this: (psuedocode, not actual equate names)
Bring your own font and palette.

memset(vram,0,vramsize)
memcpy(vram+font_location,font_data,fontsize)
memcpy(palette,palette_data,palettesize)
reg_dispcnt=mode_0 | bg1_enable
reg_bg0cnt=(font_location/16384)*4 + (map_location/2048)*256

drawtext(const char* text, x, y, color) {
addme=(color<<12)+FIRST_CHAR
u16 *map_addr = vram+map_location
u16 *text_addr = map_addr + x + y*32
while (*text!=0) {
*text_addr=*text+addme
text_addr++
text++
}
}
drawtext("Hello World",0,0,0)


An easier example ASM would be one which turns the screen blue.
It would just need to do this:
reg_dispcnt=0
palette[0]=0x7C00
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#168373 - headspin - Sun Apr 26, 2009 12:42 am

Asm version of hello world here. There are a couple more examples at the RetroBytes Portal website.
_________________
Warhawk DS | Manic Miner: The Lost Levels | The Detective Game

#168374 - Dwedit - Sun Apr 26, 2009 12:49 am

Also known as Hello World - Special battery killer edition. No vblank waits seen in the ARM7 code here.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#168375 - headspin - Sun Apr 26, 2009 1:08 am

Dwedit wrote:
Also known as Hello World - Special battery killer edition. No vblank waits seen in the ARM7 code here.


Okay updated :)
_________________
Warhawk DS | Manic Miner: The Lost Levels | The Detective Game

#168376 - kohlrak - Sun Apr 26, 2009 4:28 am

alright, thank you. If i ever manage to get this working with fasmarm, i'll make an example... however, should most of the ram already be 0ed? Oh well, once i get it working with my assembler (due ot time restrictions it could take a while) i'll have everything posted here. I've heard complaints about the trouble of learning arm asm because of having to spend so much time learning just the assembler, so maybe this'll help a bit (this one has lots of features but you don't have to understand any of them to start out).

#168382 - Dwedit - Sun Apr 26, 2009 2:49 pm

Anyway, my recommendation is to use C for as much as you can for the non-performance critical code, then use sweet sweet ASM code placed in fast memory for the fast stuff. Compilers are somewhat good at generating code, but if you need a function to run 4 times faster, that's what ASM is for.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#168385 - kusma - Sun Apr 26, 2009 5:29 pm

Dwedit wrote:
Compilers are somewhat good at generating code, but if you need a function to run 4 times faster, that's what ASM is for.

Actually, no. ASM optimizations can usually only buy you around 20-30% performance. Unless ofcourse, you're really bad at writing fast C code. And in such cases, going ASM isn't really the best first step ;)

#168387 - Dwedit - Sun Apr 26, 2009 7:23 pm

Okay, 4 times faster if you're moving from Thumb code in slow memory to ARM code in fast memory.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#168395 - Miked0801 - Mon Apr 27, 2009 5:05 am

Depends on what the C code is doing and how many times it's called. I've seen many cases where asm hand code doubles to triples performance of a section of code. That's not counting moving code to fast RAM.

#168396 - kusma - Mon Apr 27, 2009 9:04 am

Dwedit wrote:
Okay, 4 times faster if you're moving from Thumb code in slow memory to ARM code in fast memory.

Yeah, but that has pretty much nothing to do with ASM, -marm and IWRAM_CODE ;)

#168397 - kusma - Mon Apr 27, 2009 9:19 am

Miked0801 wrote:
Depends on what the C code is doing and how many times it's called. I've seen many cases where asm hand code doubles to triples performance of a section of code. That's not counting moving code to fast RAM.

In my experience, this is usually only the case for poorly written C-code. There are some extremely rare cases where you can get something like this, but those are so rare that they aren't worth discussing on a general basis, IMO.

#168398 - kohlrak - Mon Apr 27, 2009 10:38 am

Quote:
Actually, no. ASM optimizations can usually only buy you around 20-30% performance. Unless ofcourse, you're really bad at writing fast C code. And in such cases, going ASM isn't really the best first step ;)


I've written two hello world programs on linux, both using assembly. One used headers i created from scratch, the other made an object file for a linker. That very simple difference resulted in the GNU linker building a binary 10 times the size of the perfectly fine one i made (and i didn't even have it optimized at that point). I used the stipper tool, but it didn't take much off. However, since this is arm and it's simpler, i would assume they can make a better compiler.

EIther way, i find HLLs like C too tedious to work with (constantly have to make sure your types are converted [even though the computer never actually converts them] properly, or the compiler won't compile). The best example is with sockets... I remember doing winsocks in C++, and just trying to access the IP address in the one structure and never, ever, wanting to ever do it in C++ ever again.

#168399 - kusma - Mon Apr 27, 2009 11:00 am

kohlrak wrote:
I've written two hello world programs on linux, both using assembly. One used headers i created from scratch, the other made an object file for a linker. That very simple difference resulted in the GNU linker building a binary 10 times the size of the perfectly fine one i made (and i didn't even have it optimized at that point). I used the stipper tool, but it didn't take much off.

Did you notice how I was talking about performance and not size? ;)

Yes, when it comes to code-size, hand-optimizing is often more effective than compilers, since this is not a field that brings as much (perceived?) value to a compiler as performance is.

But do keep in mind that your measurements doesn't mean that linked code is 10 times as big as directly assembled code - it just means that the linker uses more space up-front. Also, since GBA / NDS doesn't run ELF-files you won't get the same overhead. Playing around with the linker-settings can also often save you quite some space. Nevertheless, as long as you link in the CRT, you'll bring in the startup-sequence, and that costs some bytes (at the gain of a working runtime environment).

#168401 - Miked0801 - Mon Apr 27, 2009 5:40 pm

Quote:

In my experience, this is usually only the case for poorly written C-code


And this is a most wonderful topic of discussion in and of itself. What is poorly written code, normal code, and fast code :)

I've found that when writing 'fast' C code, it becomes so compiler/platform specific that it might as well be assembler anyways for its very stringent code alignment/funky accesses/etc.. Case in point:

Poor performance, yet easliy read code: (PS, non compiled, written on the fly code, user beware)
Code:

// assume src and dest are 4-byte aligned
void fastcopy(u8 *src, u8 *dest, u32 size)
{
    for (int i=0; i<size; i++)
    {
        dest[i] = src[i];
    }
}


// better code (by speed, not size)
Code:

// assume src and dest are 4-byte aligned
void fastcopy(u32 *src, u32 *dest, u32 size)
{
    for (int i=0; i<size/4; i++)
    {
        *dest++ = *src++;
    }
     
    int bytesRemaining = size & 0x03;
    if(bytesRemaining)
    {
        u8 *byteDest = (u8 *)dest;
        u8 *byteSrc = (u8 *)src;
        for (int i=0; i<bytesRemaining; i++)
        {
            *byteDest++ = *byteSrc++;
        }
    }
}


And then we get into the realm of platform specific optimizations - where DS code will probably be different from GBA code and both will be different from any other platform. Where assembler will probably be easier to read/understand than C code; where the compiler basically is getting in the way of your code. Don't get me wrong. This type of stuff should be in WAY less than 1% of any application and only where the time spent over the YEARS of you having to maintain it outway its immediate usefulness.

(and yes, we can get another 50-200% of performance over the better code, depending on how much the compiler sucks.)

For instance, in DS land, cache is king. Huge loop unrolls kill. On GBA, what cache? :) Unroll until you run out of RAM/ROM. Yet unrolled code is ugly to read. and harder to maintain in C land without even uglier macros (or inlines).

Now, we can write a better version with jump tables to unrolls, or perhaps self modifying code on the loop lookups, etc. Both of these will get us a speed boost by eliminating the need to track the loop, but both need to be aligned well on DS or caching kills. And self-modifying probably kills the DS anyways.

But you get the point. How far do you go in optimizations before you are better off with asm?

#168403 - kusma - Mon Apr 27, 2009 6:55 pm

Miked0801 wrote:
But you get the point. How far do you go in optimizations before you are better off with asm?

I'll see your copy-loop, and raise you a memcpy - or CpuFastCopy(). There, not compiler specific, yet optimized. A copy-loop isn't a good example here anyway, since the algorithmic complexity is so much lower than usual program logic.

Writing assembly is easy. Writing GOOD assembly is very tedious, mostly due to register allocation and scheduling (do you remember what ARM9-instruction sequences cause interlocks on top of your head? The compiler does...). You can often write C-code that is almost equivalent to ASM without having to deal with register allocation and scheduling. And unless you spend hours reading instruction timings and allocating register life-time, your compiler will still usually beat you by a good margin.

As a clever man once said "Your compiler can write better code in microseconds than you can do in hours". Sure, those extra 20-30% might be worth it in your most timing critical loop, I never denied that. It might be just the little push you need to go from 30 to 60 FPS. But believing that code automatically become significantly faster by writing it in assembly is plain stupid.

#168404 - kusma - Mon Apr 27, 2009 6:56 pm

admins: perhaps this sub-thread should be split from the original topic?

#168406 - elwing - Mon Apr 27, 2009 8:01 pm

hum, what about "...the root of all evil" and other good advices? writing a game is complex enough, reuse the maximum you can from the lib you have... eg: use TTE console from tonclib rather that trying to set up your own... and work on your game logic rather that reinventing the wheel or spending 95% of your time on code running less than 5% of the program lifetime... once everything is done, perform some profiling, do the high level optimisation and if your skilled enough the low level assembly optimisation of the critical method (like memcopy method, trough in my opinion you should reuse assembly too when it's already good...)

#168425 - Miked0801 - Tue Apr 28, 2009 11:10 pm

kusma, you give our compiler too much credit. 99.9% of the time, it does an acceptable job and sometimes it does even a good job. But it rarely does a great job and that is where a little tweaking can go a long ways in VERY specific and carefully choosen places.

Yes, I chose a copy loop because it was something I could easily expand upon and code decently on the fly without thinking about it. And yet, the 'optimized' CpuFastCopy() that we get is not as fast as it could be. Copys tend to be a top 10 cycle eater when profiling so it is a valid discussion point.

And have you ever looked at memcpy? The one I seen used a byte copy loop , bleh.

#168426 - kusma - Tue Apr 28, 2009 11:47 pm

Miked0801 wrote:
kusma, you give our compiler too much credit.

And you're giving it too little credit.
Quote:
99.9% of the time, it does an acceptable job and sometimes it does even a good job.

Absolutely not. In my experience, 95% of the time it does an awesome job (at no effort), and 2% of the time it does a poor job - usually because the code doesn't allow it to optimize further.
Quote:
But it rarely does a great job and that is where a little tweaking can go a long ways in VERY specific and carefully choosen places.

You and I must be from different planets. Either you are much much smarter than me, or you're writing bad C core and/or have been using bad compilers. Or you're exaggerating to make a point.

Quote:
Yes, I chose a copy loop because it was something I could easily expand upon

Which is exactly why I think it's a bad example - it isn't expanded to a practical level of complexity.
Quote:
and code decently on the fly without thinking about it. And yet, the 'optimized' CpuFastCopy() that we get is not as fast as it could be.

No, but it is fast enough for all practical purposes. Optimizing further is too much effort for the gain.
Quote:
Copys tend to be a top 10 cycle eater when profiling so it is a valid discussion point.

But there's always an optimized copy around somewhere, unless you're designing your own CPU or something.
Quote:
And have you ever looked at memcpy? The one I seen used a byte copy loop , bleh.

I've looked at multiple, yes. Some were byte-copies, so we swapped it, and some were fast. But these things is something PROFILING should tell you, not random rewriting in assembly.

#168428 - kohlrak - Wed Apr 29, 2009 1:55 am

Quote:
Yes, when it comes to code-size, hand-optimizing is often more effective than compilers, since this is not a field that brings as much (perceived?) value to a compiler as performance is.


When memory is the bottleneck, size optimization usually is speed optimization. Though that all depends on the cache.

Quote:
But do keep in mind that your measurements doesn't mean that linked code is 10 times as big as directly assembled code - it just means that the linker uses more space up-front. Also, since GBA / NDS doesn't run ELF-files you won't get the same overhead. Playing around with the linker-settings can also often save you quite some space. Nevertheless, as long as you link in the CRT, you'll bring in the startup-sequence, and that costs some bytes (at the gain of a working runtime environment).


Dynamically linking, though, is typically sufficient. Typically a user doesn't mind so much about startup/load time as he or she may mind the download time... Although one may feel the time marginal, the more libraries used, the greater the effect.

Quote:
But you get the point. How far do you go in optimizations before you are better off with asm?


Let's not forget that some algorithems are much easier in assembly than C.

Quote:
Writing assembly is easy. Writing GOOD assembly is very tedious, mostly due to register allocation and scheduling (do you remember what ARM9-instruction sequences cause interlocks on top of your head? The compiler does...). You can often write C-code that is almost equivalent to ASM without having to deal with register allocation and scheduling. And unless you spend hours reading instruction timings and allocating register life-time, your compiler will still usually beat you by a good margin.


Let's not forget that compilers also prefer stack over specific memory location handling... Some may see this good, but it can be cumbersome when you have to do an sp+offset when doing some fancy math (extra uops).

Quote:
As a clever man once said "Your compiler can write better code in microseconds than you can do in hours". Sure, those extra 20-30% might be worth it in your most timing critical loop, I never denied that. It might be just the little push you need to go from 30 to 60 FPS. But believing that code automatically become significantly faster by writing it in assembly is plain stupid.


No one ever claimed that. Though really, writing good assembly code is easy after experience. There's different degrees of optimization. There's no-optimzation, there's rediculous optimization, and then there's the stuff in between. Even a beginner can remember simple optimization tricks that compilers don't, such as left shifting by 1 to multiply by 2. But nobody, not even the compiler, counts cycles, which is what some enthusiasts could do.

Quote:
hum, what about "...the root of all evil" and other good advices? writing a game is complex enough, reuse the maximum you can from the lib you have... eg: use TTE console from tonclib rather that trying to set up your own... and work on your game logic rather that reinventing the wheel or spending 95% of your time on code running less than 5% of the program lifetime... once everything is done, perform some profiling, do the high level optimisation and if your skilled enough the low level assembly optimisation of the critical method (like memcopy method, trough in my opinion you should reuse assembly too when it's already good...)


The biggest mis-conception about assembly is that assemblers don't have "include."

Quote:
You and I must be from different planets. Either you are much much smarter than me, or you're writing bad C core and/or have been using bad compilers. Or you're exaggerating to make a point.


I agree with him, i find it does a very bad job half the time, and i don't mean just size as well.

Quote:
Which is exactly why I think it's a bad example - it isn't expanded to a practical level of complexity.


How about a searching algorithem? I've found people to prefer sorting or a linear search simply because pointers are too tedious in HLLs, whereas a nice bucket and chains algorithem can handle an unsorted search pretty effectively if one has a half decent hashing algorithem.

Quote:
But there's always an optimized copy around somewhere, unless you're designing your own CPU or something.


Is there for the GBA? How about the DS?

Quote:
I've looked at multiple, yes. Some were byte-copies, so we swapped it, and some were fast. But these things is something PROFILING should tell you, not random rewriting in assembly.


I've seen a very nifty one on the x86...

Code:

;Assumes rdi and rsi are already pointing where they belong. If not, simple lea...
mov ecx, [arr_sz]
shr ecx, 3
rep movsq


Probably not the fastest example (huge uops i hear), but it's something i came up with off the top of my head. Optimized for a 64bit processor, assuming that the thing being copied isn't over 4 gigs.

EDIT: Come to think of it, on a linux computer, the code could probably be simplified to...

Code:
shr ecx, 3
rep movsq
ret


because of the calling convention...

Edit2:

That's only seven bytes... While calling a function would actually be 6 bytes... At that point, it would be faster and more efficient to make that an inline function (since the 7 byte function call would be 13 bytes due to the streaming effect, whereas it would be only 1 extra byte per calling, which is really pointless since the ret is actually 1 byte, therefore inline would cost as much as calling in code size).

#168436 - kusma - Wed Apr 29, 2009 10:34 am

kohlrak wrote:
When memory is the bottleneck, size optimization usually is speed optimization. Though that all depends on the cache.

When INSTRUCTION memory is the bottle-neck, yes. The NDS has a separate I-cache, and the GBA doesn't have a cache at all. Total performance in this context is often a trade-off between cache-misses and unrolling. Unrolling can both reduce branch-overhead, and give better scheduling.

kohlrak wrote:
Dynamically linking, though, is typically sufficient. Typically a user doesn't mind so much about startup/load time as he or she may mind the download time... Although one may feel the time marginal, the more libraries used, the greater the effect.

I can't say I understand your point here. The GBA/NDS doesn't "really" use dynamic linkage, although there exist some hack to try to implement it.

kohlrak wrote:
Let's not forget that compilers also prefer stack over specific memory location handling... Some may see this good, but it can be cumbersome when you have to do an sp+offset when doing some fancy math (extra uops).

The compiler uses the stack if you tell it to. In general the stack is better than fixed memory location due to reentrancy, but if you want your variable at a fixed location, sure, use global variables or static local variables.

Also, ARM CPUs are RISC and don't break down things to uops.

kohlrak wrote:
No one ever claimed that. Though really, writing good assembly code is easy after experience. There's different degrees of optimization. There's no-optimzation, there's rediculous optimization, and then there's the stuff in between.

No one said it in those words, but it sure as hell was implied ;)

kohlrak wrote:
Even a beginner can remember simple optimization tricks that compilers don't, such as left shifting by 1 to multiply by 2.

Ehh... Compilers does these tricks:
Quote:
$ echo "unsigned int div2(unsigned int a) { return a / 2; }" > test.c && arm-eabi-c++ -O2 -S test.c && cat test.s
.cpu arm7tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 18, 4
.file "test.c"
.text
.align 2
.global _Z4div2j
.type _Z4div2j, %function
_Z4div2j:
.fnstart
.LFB2:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mov r0, r0, lsr #1
bx lr
.LFE2:
.fnend
.size _Z4div2j, .-_Z4div2j
.ident "GCC: (devkitARM release 25) 4.3.3"


I'm using the division by two optimization instead of multiplication by two, since multiplications by two is just as fast with an add (given that it can't be paired on an existing instruction in ARM mode), so it's less obvious what to pick.

kohlrak wrote:
But nobody, not even the compiler, counts cycles, which is what some enthusiasts could do.

Where do you take your arguments from? Compilers DO count cycles; I've written a couple, I should know...

kohlrak wrote:
The biggest mis-conception about assembly is that assemblers don't have "include."

But if you include assembly-code, it's not optimized to the context (no inlining + optimizations, macros is your best hope...), so it ends up being suboptimal. Not that that's always important, it's just something your compiler would have done for you more or less at no effort.

Quote:
I agree with him, i find it does a very bad job half the time, and i don't mean just size as well.

You DON'T seem to agree with him. Miked0801 said that it did an acceptable job 99.9% of the time, not a very bad job half of the time as you claim. If Miked0801 and I are from different planets, the two of US are from different galaxies. I'm guessing you're blindly trusting some rumors you heard from your asm-friends, or you're just picking random numbers out of the air.

kohlrak wrote:
How about a searching algorithem? I've found people to prefer sorting or a linear search simply because pointers are too tedious in HLLs, whereas a nice bucket and chains algorithem can handle an unsorted search pretty effectively if one has a half decent hashing algorithem.

Algorithms and data-structures are a very important aspect of performance not matter what language you're programming in. You're claiming that these are often more tedious to write in HLLs than assembly, I'd say the opposite.

If you think pointers are tedious in HLLs, it sounds to me like you (or the people you're referring to) aren't experienced in HLLs. Are you sure these experiences are the right ones to base generalizations about a language on in that case?

kohlrak wrote:
Is there for the GBA? How about the DS?

Yes. They are in the BIOS (although the one in the NDS BIOS has a pretty serious performance bug). ARM memcopies has been beaten to death already.

#168471 - Miked0801 - Wed Apr 29, 2009 7:45 pm

Our compiler is famous for crap like this:

Code:

for(int i=10; i>0; i--)
{
    foo--;
}


becomes:

Code:

    mov r0, 10
   
loop
    sub  foo,1
    sub  r0,1
    cmp r0,0
    bne loop


and obviously, foo-- is oversimplified and could be changed to a subtract once, but assume there is a little more going on.

ffs, use subs already in my tight loop. Changing the loop to a while loop gives different code, worst optimized. Changing it to a do/while loop changes it to different code, sometimes better, sometimes worse. This kind of output doesn't matter except in very special places, and in those places, I am forced to resort to asm.

You do know that I am not a big fan of Asm, right kusma? Given a choice, I'll use C++ containers and meta algorithms every time to solve a problem. Asm optimization, while fun, is tedious to maintain. Why? Because as soon as I write it, I am stuck supporting it for the next 10+ years.

Anyways, my other favorite:

Code:

int doSimpleMath(const int x, const int y)
{
    return x+y;
}


usually becomes something like:

Code:

    push r4,r5
    add r0,r0,r1
    pop   r4,r5
    bx    lr


Why must you push/pop unused registers all over the place? Yes, inlining usually fixes this, but there are functions that are called too often to inline and I get sick of the stupid, useless stack crap. Adding funky pragmas sometimes fixes this, depending on the compiler (leaf and such), but grrr.

There are plenty of other head scratching things my compiler does. Metrowerks makes a substandard c/c++ compiler. And I'm stuck using it.

#168475 - kohlrak - Thu Apr 30, 2009 2:23 am

Quote:
When INSTRUCTION memory is the bottle-neck, yes. The NDS has a separate I-cache, and the GBA doesn't have a cache at all. Total performance in this context is often a trade-off between cache-misses and unrolling. Unrolling can both reduce branch-overhead, and give better scheduling.


That's the nice thing about ARMs... They have seperate memory for instructions and data... Remember, my knowledge and points will most likely apply to x86 since i have no experience with arm, yet.

Quote:
I can't say I understand your point here. The GBA/NDS doesn't "really" use dynamic linkage, although there exist some hack to try to implement it.


Sorry, x86 reference...

Quote:
The compiler uses the stack if you tell it to. In general the stack is better than fixed memory location due to reentrancy, but if you want your variable at a fixed location, sure, use global variables or static local variables.


How often does one need to do this, though? If you're only ever going to use one instance of a function at a time (which is typical unless threading [even rarely there] or recursive), re-entry is unimportant.

Quote:
Also, ARM CPUs are RISC and don't break down things to uops.


Sorry, another x86 referance since i'm talking about C vs ASM in general, not specific to the arm CPUs.

Quote:
No one said it in those words, but it sure as hell was implied ;)


Perhaps to you, but any experienced assembly programmer wouldn't make such an assumption.

Quote:
Ehh... Compilers does these tricks:


Not quite sure i can follow that code, so i cannot comment on it (GNU syntax is difficult for me to read).

Quote:
I'm using the division by two optimization instead of multiplication by two, since multiplications by two is just as fast with an add (given that it can't be paired on an existing instruction in ARM mode), so it's less obvious what to pick.


If bit shift's available, that's typically a good choice. Usually much faster than an add or subtract, since it's a little more native to the processor's numbering system.

Quote:
Where do you take your arguments from? Compilers DO count cycles; I've written a couple, I should know...


I would find that very rare, especially since cycle counting can vary between revisions, or more often from newer versions (like arm9 over arm7, but i don't have any examples to point out since i don't have experience with arm).

Quote:
But if you include assembly-code, it's not optimized to the context (no inlining + optimizations, macros is your best hope...), so it ends up being suboptimal. Not that that's always important, it's just something your compiler would have done for you more or less at no effort.


Code:
macro cjmp x, y, [cond, loc]{
   cmp x, y
   j#cond loc }

cjump eax, ebx, a, loc1, b, loc2


Not the most practical example, but it assembles to what you'll see below:

Code:
   cmp eax, ebx
   ja loc1
   jb loc2


Could that be optimal enough? If you're like me, you'd keep macro that you find very useful, which is essentially what a compiler is for... It does common things for you. The difference is that it saves you the trouble of having to invent those things yourself. However, in the long run, who gains?

Quote:
You DON'T seem to agree with him. Miked0801 said that it did an acceptable job 99.9% of the time, not a very bad job half of the time as you claim. If Miked0801 and I are from different planets, the two of US are from different galaxies. I'm guessing you're blindly trusting some rumors you heard from your asm-friends, or you're just picking random numbers out of the air.


Or, perhaps, i'm using experience. However, lemme point out the word "acceptable" in that it's different from the word "good." Some people prefer using buses over cars because, for their situation, they find the wait time acceptable, since most of their work would be done on a laptop or such and time isn't of the essance for them.

Quote:
Algorithms and data-structures are a very important aspect of performance not matter what language you're programming in. You're claiming that these are often more tedious to write in HLLs than assembly, I'd say the opposite.


Well then we disagree. Typcasting is a real pain as well, and that will occure often...

Quote:
If you think pointers are tedious in HLLs, it sounds to me like you (or the people you're referring to) aren't experienced in HLLs. Are you sure these experiences are the right ones to base generalizations about a language on in that case?


pointer, *pointer, &pointer... That's HLL...

pointer, [pointer]... That's assembly...

There's a reason why many programming courses avoid things like pointers and the mystical "goto."

Quote:
Yes. They are in the BIOS (although the one in the NDS BIOS has a pretty serious performance bug). ARM memcopies has been beaten to death already.


Well then for the GBA, use the one in the bios. In which case there's no need to use an HLL just to use it. ASM can use it too... Then we can spend time on other fancy algorithems that need optimized.

Quote:
Adding funky pragmas sometimes fixes this, depending on the compiler (leaf and such), but grrr.


That takes alot of time (more than it would take to program) just to learn a compiler... I thought that was one of the rants against asm (not here though, which shows that people here are too smart to buy into the stupidity)... Then again, i'd personally turn around and throw the same argument at just about every assembler that i've ever used (but fortunately there are assemblers out there that are very quick to learn).

Quote:
There are plenty of other head scratching things my compiler does. Metrowerks makes a substandard c/c++ compiler. And I'm stuck using it.


Remember, getting a product out there faster and lest costly is more important than user satisfaction. Since ram is so cheap these days, people will not only have abundant amounts, but, if necessary, they will buy it to use our product, since it's so easy to upgrade, therefore only a minor invconvenience.[/clinche]