gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > faster normalizer

#171888 - ritz - Thu Dec 31, 2009 3:37 pm

Using the reciprocal instead of division is well known, but I thought I'd post my DS version of it for those who are interested. Compared to the libnds normalizef32(), the code below is more than 2x faster on the hardware:

Code:
STATIC_INL
void v_fnormalizef32 (int32* a)
{
   int32 m;

   m = sqrtf32(dotf32(a,a));

   // get reciprocal, not using divf32()
   // DIV_32_32 is half the clks: numerator = 1

   REG_DIVCNT = DIV_32_32;
   while (REG_DIVCNT & DIV_BUSY) ;
   REG_DIV_NUMER_L = 4096 << 12;
   REG_DIV_DENOM_L = m;
   while (REG_DIVCNT & DIV_BUSY) ;

   m = REG_DIV_RESULT_L;

   // multiply reciprocal instead of 3 divides

   a[0] = mulf32(a[0],m);
   a[1] = mulf32(a[1],m);
   a[2] = mulf32(a[2],m);
}

I'm posting here instead of supplying a patch because there is a very tiny precision hit using this method.

I actually played around with that magic inverse sqrt hack but it was only a tad faster. The hardware sqrt on the DS is very, very good. Here's my (unused) inverse sqrt code for fun:

Code:
STATIC_INL
int32 v_finvsqrtf32 (int32 f32x)
{
   union { float f; int32 i; } u;
   u.f = f32tofloat(f32x);
   u.i = 0x5f375a86 - (u.i >> 1);
   int32 uff32 = floattof32(u.f);
   return mulf32(uff32,(6144-mulf32((f32x>>1),mulf32(uff32,uff32))));
}

This would be a lot faster if I had the time/smarts to figure out using fixed instead of floats for this (there's actually code in Clutter that figured this out to some degree). Obviously, f32tofloat() is dragging. Even so, this code is really quite fast.

Anyway, happy new year to all :) Don't drink and drive!

#171891 - DiscoStew - Thu Dec 31, 2009 5:55 pm

To add to this, because the hardware for math operations like sqrt and division work independently of the CPU, you can streamline calculations for normalizing by having those math operations work at the same time while doing other calculations with the CPU. But, for this to work, you have to have your list of vectors to be ready for normalizing prior to starting it.

I'll have to look through my backups for the actual code, but I've used the technique back when trying to reduce processing time for calculating normals on a model with smooth-skinning on the NDS.
_________________
DS - It's all about DiscoStew

#171895 - sajiimori - Thu Dec 31, 2009 9:27 pm

You can be slightly more parallel by setting up most of the div while the sqrt is busy.

Due to the non-triviality of the code, I'd suggest uninlining it to reduce cache misses in the average case. (A cache miss -- loading 8 ARM instructions -- takes longer than a divide.) For the special case, where it's called in a loop, a separate inline version can be provided, e.g. v_fnormalizef32_inl.