#18403 - Lupin - Thu Mar 25, 2004 4:02 pm
I know there are C functions to do this, but i am generally against using these functions because i use asm :)
Well, i now have an function that divides the number by 10 and takes the mod value of the division. But i think a divide might be too slow for something that's so simple... do you know a better way to convert integers to strings?
I had another idea just as i wrote this post, we can just create an array of n length where each value is 10^n and then subtract the numbers from the number i want to convert and then just check for LowerThan flag set, if yes add the number back and add the number to the apropriate string position... might be a little slow on higher numbers (if 1 number of the input is 9 we will have to subtract the array number 9 times).
_________________
Team Pokeme
My blog and PM ASM tutorials
#18411 - poslundc - Thu Mar 25, 2004 6:05 pm
Check out my posprintf page. It essentially provides the same functionality of the sprintf routine to let you printf integers (as well as a couple of other different things) to strings. It also lets you do things like specify the length of the string to print, pad it with leading zeros, etc.
It's 100% Thumb ASM (so it doesn't consume any of your IWRAM), it's efficient, it's open-source and it uses a very fast base-conversion routine (all additions and shifts for numbers up to +/-65535).
Dan.
#18413 - Lupin - Thu Mar 25, 2004 8:05 pm
uhm, if i understand your documentation right you use the same method that i explained above... but i am searching for a faster way to do this (if there is any :))
How much cycles/instructions does a bios divide take?
_________________
Team Pokeme
My blog and PM ASM tutorials
#18415 - sajiimori - Thu Mar 25, 2004 8:13 pm
He just said it's all additions and shifts for a wide-ish range of numbers.
#18417 - DekuTree64 - Thu Mar 25, 2004 8:33 pm
Here's mine:
Code: |
@r0 = int between -65535 and 65535
.global itoa
.thumb
.align 2
.thumb_func
itoa:
push {r4, r5}
mov r5, r0
bpl itoa_plus
neg r0, r0
itoa_plus:
ldr r4, =tempStr @global var, declared elsewhere
add r4, #15
mov r1, #0
strb r1, [r4]
ldr r1, =6554
itoa_loop:
mov r3, r0
mul r3, r1
lsr r3, r3, #16
lsl r2, r3, #3
add r2, r3
add r2, r3
sub r0, r2
add r0, #'0'
sub r4, #1
strb r0, [r4]
mov r0, r3
bne itoa_loop
cmp r5, #0
bpl itoa_return
mov r0, #'-'
sub r4, #1
strb r0, [r4]
itoa_return:
mov r0, r4
pop {r4, r5}
bx lr
.pool |
It's based on reciprocal multiplication to do the divide by 10, and then subtracting the result*10, which can be done with shifts, to get the remainder. You need to have a global called tempStr for it, it should be at least 7 bytes, because the longest number you can use is -65535, which is 6 chars+null. You could deal with 32 bit numbers if you use ARM so you can have 32x32=64 multiplication for a more accurate reciprocal (0x1999999A would be the one, I think), but I only use it for HP displays and stuff, which are never over 9999 anyway.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#18419 - poslundc - Thu Mar 25, 2004 8:36 pm
Lupin wrote: |
uhm, if i understand your documentation right you use the same method that i explained above... |
Oh I don't. Not even remotely. Numbers up to +/-65536 don't require anything except a bunch of shifts and adds.
Quote: |
How much cycles/instructions does a bios divide take? |
It depends on the numbers being input into it. You can beat it with your own custom ASM routine or other techniques, but it beats the pants off of gcc's default divide routine.
Dan.
#18422 - Miked0801 - Thu Mar 25, 2004 8:44 pm
BIOS divides take roughly 70 - 250 cycles to execute depending on the terms being divided. ~60 cycles of this are overhead getting into the SWI interrupt. Make your own and place it in RAM, don't loop and you can get a function that executes in half this time or better.
#18425 - poslundc - Thu Mar 25, 2004 8:59 pm
Miked0801 wrote: |
~60 cycles of this are overhead getting into the SWI interrupt. |
Whaaaaat...? Is it really that bad?
I mean, what? Branch/link to the SWI handler, branch to the divide routine, push a bunch of registers, pop 'em off at the end and brach back, right? That can't be more than 20-30 cycles to get in and get out.
Dan.
#18431 - Miked0801 - Thu Mar 25, 2004 10:51 pm
33 in, 34 out. That's why I hate using swi stuff.
Swi - 3 cycles to do branch
3 cycles to branch again to correct handler
24 cycles to push registers, change the CPU state to system mode, and setup the jump (it is an interrupt after all!)
3 cycles to actually get into the divide routine.
//------------------------------------
Do Divide somewhat slowly here (5 cycles instead of 3 per bit as it uses a loop to save space instead of unrolling)
//-------------------------------------
3 cycles at end of div to return
28 cycles to unpop, reset the CPU state, and prepare to return
3 cycles to jump return.
So the best case is around 67 + 20 or so cycles of 87 cycles
Worst is 67 + 20 + 5*30 or so is about ~240 cycles
You can do sooo much better with an ARM IWRAM routine.
#18433 - poslundc - Thu Mar 25, 2004 11:12 pm
Wow, I didn't think it was nearly that bad. Maybe I should consider writing my own CPUFastSet while I'm at it.
Dan.
#18437 - Lupin - Thu Mar 25, 2004 11:21 pm
hmm, i wonder why they even implemented the swi functions? Some of them might be usefull for hardware specific stuff, but the arithmetic and compression functions seem to be beaten easily by some hand written IWRAM code...
_________________
Team Pokeme
My blog and PM ASM tutorials
#18438 - Miked0801 - Thu Mar 25, 2004 11:36 pm
Quote: |
Wow, I didn't think it was nearly that bad. Maybe I should consider writing my own CPUFastSet while I'm at it.
|
We already have - though not for speed reasons. It's nice to have your own for error checking of values passed in (asserts again.) But if you ever need a real fast copy/clear - then you should probably write your own copier. Of course, if you assign a 4-byte alligned, 4 byte multiple size structure to another, the compiler will use ldm/stm for you which is really nice. It can beat DMAs on smaller assigns do to the overhead of setting of the registers. In ARM, it would probably be smart enough to do 8 registers at a time which is nice. Try it sometime and check out the output.
#18439 - Miked0801 - Thu Mar 25, 2004 11:39 pm
And Lupin, I totally agree with you. If they couldn't do it right, why do it at all? The only reason to use the built in stuff is if you are copying huge things where the overhead doesn't hurt, or when it's something not speed critical (interruptVBlankWait().) Compression can be written faster and better, the math functions ditto. The utilites ditto. When I realize just what type of code was there and the overhead to use it, I almost cried at the waste of all that 0 wait-state ROM...
#18440 - poslundc - Fri Mar 26, 2004 12:07 am
Miked0801 wrote: |
We already have - though not for speed reasons. It's nice to have your own for error checking of values passed in (asserts again.) But if you ever need a real fast copy/clear - then you should probably write your own copier. Of course, if you assign a 4-byte alligned, 4 byte multiple size structure to another, the compiler will use ldm/stm for you which is really nice. It can beat DMAs on smaller assigns do to the overhead of setting of the registers. In ARM, it would probably be smart enough to do 8 registers at a time which is nice. Try it sometime and check out the output. |
Bah, it'd be quicker to write my own.
The obvious advantage to the BIOS functions is they don't consume IWRAM to get the 0-waitstate functionality, and can be useful (I suppose) for stuff that ought to be fast, but isn't necessarily time-critical. Why they couldn't write better algorithms for them... I dunno, deadlines or stupidity I guess.
Dan.