gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Need help with div-function

#7969 - Lupin - Sun Jun 29, 2003 5:24 pm

This function uses 2 LUTs to cover nearly all 32bit integer values by using approximitations. It seems to work fine with small numbers, but as soon as I want to divide larger numbers it doesn't work, that is because I need to shift the value down after multiplying (see the last 2 instructions).

If you notice any part of the code you'd optimize, please tell me if there's any way possible to optimize it.

Code:
   @Negative numbers are not working atm
   cmp   r1, #0      @check the denomerator against 0
   movlt   r5, #1      @r5 stores if the number is negative or positive
   rsblt   r1, r1, #0   @turn the number positive

   mov   r3, r1, lsr #16      @Index into divtab (r1 /4096)
   mov   r2, r3, lsl #16      @Shift it back 16 bits
   sub   r2, r1, r2      @The difference

   ldr   r4,.DivDat      @load divtab
   ldr   r5,.DiffDat      @load difftab
   ldr   r6,[r4,r3]      @r6 = divtab[r3]
   ldr   r7,[r5,r3]      @r7 = difftab[r3]
   
   mul   r7, r2, r7      @difference * difftab[r3]
   add   r4, r6, r7      @1/r1 = divtab[r3] + r2*difftab[r3]

   @Here I would need an 32b*32b=64b multiplication, because the value gets quite big here....
   mul   r0, r4, r0      @r4*r0 (r0*(1/r1))
   @..but here I shift it 20 bits right and it should again fit into an 32bit register
   mov   r0, r0, lsr #20      @Shift r0 20 bits right
   bx   lr


My intent on writing this function is to create an fast divide function wich covers all real integers without using swi or an lookup table wich is larger then 1mb :)

I would like to solve my problem without using an 32b*32b=64b multiplication if possible, because I think it's slow (maybe slower then an software divide :))

#7973 - DekuTree64 - Sun Jun 29, 2003 6:40 pm

For the last 2 instructions, use
Code:

mull  r0, r1, r4, r0
mov   r0, r0, lsr #20
orr   r0, r0, r1, LSL #12

According to the ARM manual, mull only takes one more cycle than mul, and then the orr only takes one cycle too, so that's not bad at all.

#8009 - Lupin - Mon Jun 30, 2003 5:04 pm

Thanks, but it seems like I need an 32x32=64 multiply here too:

mul r7, r2, r7 @difference * difftab[r3]
add r4, r6, r7 @1/r1 = divtab[r3] + r2*difftab[r3]


Did you mean "smull" by "mull"? mull doesn't seem to exist. The first two registers used by smull are the low and high words, right?

#8013 - DekuTree64 - Mon Jun 30, 2003 7:19 pm

Ah, sorry about that, yes, it is smull, and yes, the first 2 regs are the low/high words of the result.
And for optimizations, you don't really need 2 tables. Just use something like
Code:

   ldr   r4,.DivDat      @load divtab
   add   r4, r4, r3, LSL #2 @ldr doesn't do the shift for you
   ldr   r5,[r4]      @r5 = divtab[r3]
   ldr   r7,[r4, #4] @divtab[r3+1]
   sub   r7, r7, r5
   smull r6, r7, r2, r7
   add   r4, r5, r6, LSR #16  @since r2 was 16 bit fp, and both table vals were 20 bit, you have to correct the result of that mul before adding
   add   r4, r4, r7, LSL #16   @also, r4 is now the reciprocal, so if you're going to divide by this number a lot of times, make a function like this, that returns r4 right now, and another function to do a 12.20 fp mul to use on it (EDIT: if you do this, do that cmp/rsbne at the end of this function on the reciprocal before returning it)
   smull r0, r2, r4, r0
   mov    r0, r0, LSR #20
   orr   r0, r0, r2, LSL #12
   cmp   r1, #0
   rsbne r0, r0, #0 @I think you forgot this before
   bx lr


Hopefully that works, but if not, then maybe you can get some ideas from it anyway^^
Actually, you might even be able to use some clever shifting trickery to use smlal and get those 2 adds for free (smlal adds the result of the mul to whatever's in the 2 result regs), but I think it would involve at least 2 instructions, and smlal takes one more cycle than smull, so it wouldn't be worth it.
Also, I've never used anything like this, but I think you could get away with using
Code:

   ldr   r5,[r4, r3, LSL #2]!      @r5 = divtab[r3] and write the address back to r4 to save that add
   ldr   r7,[r4, #4] @divtab[r3+1]

And then it should actually be FASTER than loading in a seprate table, since sub if faster than ldr.

Oh, and you probably just left this out for clarity, but don't forget to store/load r4-r7 at the start/end of that function

#8072 - Lupin - Tue Jul 01, 2003 4:19 pm

I'm sorry, but I don't understand this (it doesn't seem to work that way though). I need some more input about 64 bit numbers.

Lets say I want to shift an 64bit number, I couldn't just take the high-part and shift it, because the low part doesn't get affected, so how do I do it?

These 2 lines are the most confusing to me:

Why do you shift with 16 here?
add r4, r5, r6, LSR #16

Why do you add the difference between the numbers here?
add r4, r4, r7, LSL #16

Why do you do a left shift with 12 here?
orr r0, r0, r2, LSL #12

I'm sorry that I'm so dumb, but I couldn't get the point :(
The code works fine though until it comes to the first multiplication.

#8084 - DekuTree64 - Tue Jul 01, 2003 8:01 pm

Ok, I went to put that into my demo since I need a fast div for calcularing slopes of triangle edges, and it didn't work at all. So I messed with it a while and got it working. Here's the full code (note: according to ARM's APCS doc, r12 is free to corrupt, as well as r0-r3)
Code:

.global Div
.arm
.align 4
Div:
cmp r1, #0
rsblt r1, r1, #0
mov r12, r1, LSR #16
sub r1, r1, r12, LSL #16
ldr r3, =rcpTable
ldr r3, [r3] @this is because my rcpTable is a pointer, so I can generate it in RAM. Remove this line if your table is a precomputed array in ROM
ldr r2, [r3, r12, LSL #2]!
ldr r3, [r3, #4]
sub r3, r3, r2 @this was the most important thing I forgot
mov r12, r2, LSL #16
mov r2, r2, LSR #16 @now r12:r2 is a 64-bit number, shifted 16 extra bits for the multiply by r1 (the difference) (I used r1:r2 in mine, but it only does rcp, not the full div, so I had to change some regs around for the extra arg)
smlal r12, r2, r1, r3 @(table[denom >>16]<<16) + table[(denom >> 16) + 1] * (denom - int(denom)) <-this last term is 16.16, hence the shift by 16 on the first term
mov r1, r12, LSR #16
orr r1, r1, r2, LSL #16
smull r2, r3, r0, r1 @numer * rcp(denom)
mov r0, r2, LSR #20
orr r0, r0, r3, LSL #12
rsblt r0, r0, #0 @flags are still set from cmp at start
bx lr

That should work, unless I made a typo or something. And all without having to store/load any regs^^
Also, you can get more accuracy if you use a 30-bit rcp table. Just change those last shifts to 30 and 2, instead of 20 and 12.

#8089 - Lupin - Tue Jul 01, 2003 9:28 pm

cool, this function works pretty good!

How did you design your lookup table? For me only 16/16 works for the last 2 shifts and I think I have 30bit accuracy, since the highest value (second one) is 0x40000000.

I create my table like this:
((1 / y<<16) * 1073741824) * 65536)

If we would change 16 to 24 or so we would be able to cover an much higher range of numbers with an less sized table, if you want to cover all integer numbers you need an array with 32768 entries...

I didn't fully understand your code, but it looks very good to me! I'm still at learning asm :(

good job! Thank you!

*edit* i forgot about that, you said:
Quote:
...but don't forget to store/load r4-r7 at the start/end of that function

why do I do this? And how do I do this? Why did you left that out?

#8092 - DekuTree64 - Wed Jul 02, 2003 12:00 am

I meant cause you were using r4-r7 in your original one, and for an ASM function to work together with C functions, you need to follow the ARM procedure calling standard, which says you can corrupt r0-r3, and r12, but all other registers have to be the same as they were when the function was called. So if you're writing a function that needs anything more than r0-r3 and r12, you need to store them on the stack. Normally that would be done with stmfd sp!, {r4-r7} at the start, and ldmfd sp!, {r4-r7} at the end (sp is another name for r13, which means stack pointer, and the fd after rthe ldm/stm means a full descending stack. 'Full' means the value it currently points to is used (so you shouldn't change it), and 'descending' means the stack grows downward). Be sure to store lr (r14) too, if you plan to call other functions from an ASM function. And since functions are allowed to corrupt r0-r3 and r12, either store them before calling a function, or carefully plan out your function to where they're not being used when the function is called.

As for my table, I just used
Code:

for(i = 0; i < 256; i++)
   rcpTable[i] = div(1 << 30, i);

since the screen is only 240 pixels wide, so I don't need any more than that. You could just keep looping to 32768 if you want, it shouldn't make any difference.
The first shift near the end of that gets it back to 30 bits of fraction after the 30 bit table * 16 bit difference multiplication. Then you multiply by the 16 bit numerator, so then you're back to 16+30=46 bits of fraction again, so you shift 30 off, and you're back to 16, like the original numbers.
Just keep writing in ASM and you'll get better at it. That function is pretty confusing, but that's just how it goes when you're trying to conserve registers and cycles at the same time.

#8133 - Lupin - Wed Jul 02, 2003 5:09 pm

Did you try to divide 64/64 for example? It results in 0...

32k entries is a bit much... too much for iwram though :)

I want to cover all 32bit numbers, because I wanted to use the divide for fixed point too, but I'm starting to dislike fixed point math and I think I will look for ways to live without it.

#8137 - DekuTree64 - Wed Jul 02, 2003 6:38 pm

That's strange, I tried dividing all the powers of 2 up to 128 by themselves and got 1 every time. You do mean (64 << 16) / (64 << 16), and not just 64 / 64, right? Numbers smaller than 1 << 16 get pretty inacurate, as you can see on a graph of 1/x, which goes quickly from 1 to infinity as it approaches 0 from 1, so for this method, picture a straight line drawn from the point (1, 1) to the point (0, infinity). Actually that would be pretty hard to picture, so just say some really big number, so the line will be going up at a really sharp angle. Numbers close to 0 won't be so bad, since the real graph of 1/x is going up at a really sharp angle there, but close to 1 will be pretty far off, since it's still curving upward more slowly. But the larger x gets, the slower the curve changes direction, so our linear interpolation gets more accurate.
Which is how you can divide by larger numbers without too big of a table. I just came up with this, so I don't know exactly how it would go in code (or how well I can explain it), but here's the theory: Every time you hit a power of 2, you double the step size of your table. Now for the explanation...

Make (for instance) the range 128-255 just like the current one, where the step size is 1. Then 256-511 would have a step size of 2, so like one table entry for 256, one for 258, one for 260, and you interpolate between those for numbers like 257, 259.9, just like the normal table, but taking the diffrence as denom-(denom>>17<<17), instead of >>16<<16, so you're cutting off one more bit of resolution. Then from 512 to 1024, you'd go 4 at a time, so 512, 516, 520, 524, and the difference would be denom-(denom>>18<<18), from 1024-2047, go 8 at a time, and so on. Then that also means for numbers less than 128, the step size will decrease, so 64-127 would have one entries for 64, 64.5, 65, 65.5, and so on. Then 32-63 would be 32, 32.25, 32.5, 32.75, 33... So you get more resolution as you get closer to 0, and by the time you get there, it will be really accurate. The table will still get really big though especially in the smaller numbers, so maybe it would be better to have from 2 to 127 with step size 1, and then 0-2 with 8 or so.
But that means for the range 256-511, you'd have 128 entries, since you're doing half as many, and then for 512-1023, you'd have 1/4 as many, so 128 again, then 1024-2047 1/8 as many, so 128 again. So with 7 bits (actually 8, but one is the sign) past the special-cased 0-255, that would mean 7 128 entry tables, or 896 entries to go all the way to 32767. Much better than 32768.
If that made any sense, and you want to try it out, write it in C first. If you try to do complicated stuff like that straight in assembly, you'll end up running around in circles until you give up. C is pretty easy to translate into ASM, so once you know exactly what you're trying to do, ASM is pure joy to write^_^