gbadev.org forum archive

This is something I've put off for a while now. As of right now, I've got the MODE 7 perspective working for my demo, but a lot of processing time is spent on the "lambda" being calculated for each scanline (not during each scanline), more specifically, the divide being used for it. Now, from the line of code I have from my demo....

lam = Asm_Divide(Cam_Pos.y << 12, BG_Cam_Incs.y);

...Where "Cam_Pos.y" is the height position of the camera, and "BG_Cam_Inc.y" is an incrementation of an initial value incremented/decremented with "BG_Cam_YAxis.y", which is affected by the cosine of theta. The bit-wise shift in it is merely for more precision. The only optimization I've done with it thus far is an added check if theta of the camera is 128. or, looking straight down, which in this case, would make the layer look as if MODE 7 wasn't being used at all, because at theta = 128, "BG_Cam_Incs.y" does not increment/decrement...

lam = Cam_Pos.y >> 3;

So far, I've tried other possible methods of optimization, like linker settings, putting the full code of the needed calculations into IWRAM, etc. The only other idea I've come up with for optimizing (though not implemented yet) is if "Cam_Pos.y" and theta do not change, a LUT can be processed in one frame, and then reused for following frames, but if either of these variables change, the LUT would have to be filled in again.

What I'm looking for is one single thing. To get rid of that divide per scanline. It isn't so bad if it is used a couple of times, but when used at max 160 times, it really limits anything else I can do, and even with the LUT idea, if any change is made to the camera's height or the camera's theta, the new lambda values would have to be calculated, and it could show among other things being processed.
The only other thing I can see is taking a different approach to calculating lambda using the same values being put into it. However, I have no idea where to begin, which is why I'm here to ask you all about it. The info I can give is that throughout the lambda calculation, the camera's height is constant, and "BG_Cam_Incs.y" starts off at a calculated value, of which then per scanline, it is incrementing/decrementing at a constant rate.

If you have any thoughts, or even a formula for a faster lambda calculation with minimum error, please fell free to post.
_________________
DS - It's all about DiscoStew

For Mode 7, you should only have to divide once per frame, by your zoom value.

Calculate (1 / z) before looping over your scanlines, then multiply by that value for every scanline.

There are a number of optimizations you can do on your division, but ultimately one division per frame shouldn't have a noticeable impact.

Dan.

Thanks for your suggestion poslundc, but I don't see where I can implement that though. My brain is pretty fried atm for just working on this.

To give a very close example of what I'm working with, my code is based off Cearn's Mode 7 example (not the one showing type A,B,C), of which he has lambda being calculated every scanline. Of course I've worked through a few optimizations, but the concept of how it works is still the same...

Code:

// calculates bg affine pameters for the next frame
// pa= lam*cos(phi) pc= lam*sin(phi)
//
// dx' = cam_pos + lam*C*b, where
// b = (L, T-ys, -D)
// dx'= (DX, 0, DY)
// get lambda from dx'.y, then DX=dx.x, DY= dx.z
//
// NOTE: heavy optimisations possible!
// Like compiling in ARM and putting it in IWRAM
void m7_aff_calc()
{
if(_m7_horizon >= VID_HEIGHT) // no visible floor, so wots the point?
   return;

int ii, ii0= _m7_horizon>=0 ? _m7_horizon : 0;

// lam*C*xp = r
// C*Xp: C= [cam.u cam.v cam.w]
// Xp = (xp, scanline (==ii), D)
// The multiplication can be done nicely with
// incremental offsets
FIXED dcx= _m7_cam.v.x, dcy= _m7_cam.v.y, dcz= _m7_cam.v.z;
FIXED lam= INT_MAX, cx, cy, cz;
cx= (_m7_cam.w.x<<M7_D_SH) + (ii0 - M7_TOP)*dcx;
cy= (_m7_cam.w.y<<M7_D_SH) + (ii0 - M7_TOP)*dcy;
cz= (_m7_cam.w.z<<M7_D_SH) + (ii0 - M7_TOP)*dcz;

BGAFF_EX *bga= &_m7_bgaff_ex[ii0];
FIXED pa, pc;

for(ii= ii0; ii<VID_HEIGHT; ii++)
{
   // I'm sure there's a faster way of getting lam
   // via trigonometry, but I'll have to check whether
   // it's as accurate.
   lam= DivSafe(_m7_cam.pos.y<<12, cy);// .12

   // having the scale might be useful later,
   // so store it somewhere for later retrieval.
   // pb will do, since it's unused anyway
   bga->pb= lam>>4;    // .8

   pa= (lam*_m7_cam.u.x)>>12;
   bga->pa= pa;
   bga->dx= _m7_cam.pos.x + M7_LEFT*pa - (lam*cx>>12);

   pc= (lam*_m7_cam.u.z)>>12;
   bga->pc= pc;
   bga->dy= _m7_cam.pos.z + M7_LEFT*pc - (lam*cz>>12);

   cx += dcx;    cy += dcy;    cz += dcz;
   bga++;
}
}

Heh, even he says in here about lam possibly being calculate by trig, but right now, I'm not even sure where to begin.
_________________
DS - It's all about DiscoStew

poslundc wrote:

For Mode 7, you should only have to divide once per frame, by your zoom value.

Calculate (1 / z) before looping over your scanlines, then multiply by that value for every scanline.

But unless you're looking straight down, the zoom changes non-linearly for every scanline.

DiscoStew wrote:

Heh, even he says in here about lam possibly being calculate by trig, but right now, I'm not even sure where to begin.

That would be eq 16 right below the code :P. You'd need a small arctan lut for the angles of each scanline and a high-resolution 1/sin(x) lut to make it work accurately. That way you'd to get rid off all divisions, save the one for focus length D.
Another thing that would cut down on the code is to fill in all the sines and cosines for the matrix elements, or use better division code, stuff like that.

Shoot, you guys are right; it's been too long since I've done Mode 7. I was thinking about sprite-placement, where you only need to calculate the reciprocal once for both rotational components.

I've looked at my old Mode 7 code, and what I did was create a secant lookup table, which gave me the reciprocal of the cosine of the angle I was viewing at, essentially handling two of the operations I needed to perform with a single LUT, and since the secant function is periodic I could easily decide the precision of the domain it needed to operate on. This may work for you as well.

Dan.

Cearn wrote:

That would be eq 16 right below the code :P. You'd need a small arctan lut for the angles of each scanline and a high-resolution 1/sin(x) lut to make it work accurately. That way you'd to get rid off all divisions, save the one for focus length D.
Another thing that would cut down on the code is to fill in all the sines and cosines for the matrix elements, or use better division code, stuff like that.

Ok, I'm beginning to work on this using your method and such (sorry, but a lot of it took me a while to take into concept. I'm good at math, but lousy at interpreting these advanced calculations). About the small arctan lut, I'm still a little unsure about how i goes. As you said here...

Code:

If β is the angle between (0, yp, -D) and (0, 0, -D), then tan β = yp/D.

...I am to make an LUT of 160 entries for the scanlines. Looking at it, if theta = 0 (of which I'm assuming where this default expression is concerned), and that the camera is looking at scanline 80, the top 80 entries would assume their β < 0, and the bottom 79 entires would assume their β > 0. If that is so, would the first 80 entries be just a mirror of the next 80 entries, but negative? I'm just trying to understand where you are coming from on this. What I figure, using your Excellut, is this formula (my D is 128, and x is 0 to 159 for scanlines)...

Code:

ATAN((x - 80) / 128) * (256 / pi)

Would that be correct?

With that, I'd have my β for each scanline, but what about using them for the COS() and 1/SIN() parts? My current COS() LUT is the same one made by your Excellut (interpreted using the SIN() LUT) that is 512/circle, but would I need to make a new one for just this, or can I keep it? As for the 1/SIN(), you propose that using 1024/circle is best, so a new LUT would need to be made, but with using β made for 512/circle, I can just do a (<< 1) to the final angle to use with that LUT, or would other adjustments need to be made?

Having 160 divides for MODE 7 is not something I want, even if the Asm divide I'm using is fast, and I want to make sure I'm going through this process of alterations carefully and correctly. Once I get this general method down, I can make some extra tuning whereever needed.

EDIT:
Eww, another problem. Got this formula for 1/SIN() for Excellut...

Code:

=1 / SIN(x * pi / 512)

...but the x=0 is not pretty. (1 / 0)
Can't divide by 0, and can't export the LUT with that problem for some reason.
_________________
DS - It's all about DiscoStew

DiscoStew wrote:

...I am to make an LUT of 160 entries for the scanlines. Looking at it, if theta = 0 (of which I'm assuming where this default expression is concerned), and that the camera is looking at scanline 80, the top 80 entries would assume their β < 0, and the bottom 79 entires would assume their β > 0. If that is so, would the first 80 entries be just a mirror of the next 80 entries, but negative? I'm just trying to understand where you are coming from on this. What I figure, using your Excellut, is this formula (my D is 128, and x is 0 to 159 for scanlines)...

Code:

ATAN((x - 80) / 128) * (256 / pi)

Would that be correct?

Pretty much. At least under the current conditions. The key point here is that you need looking angles for each scanline, which means arctan. The fact that they are antisymmetrical around line 80 is just a side effect of the configuration. Unless you intend to get these angles dynamically, it's really not important.
By the way, the symmetry is centered around line 80, not around the center of the screen (the border between lines 79 and 80) This is one of those round-off errors you can find in discretisation (which also occurs for affine objects). If you want to be completely correct, you should use 79.5, but I don't think anyone would really notice.

DiscoStew wrote:

With that, I'd have my β for each scanline, but what about using them for the COS() and 1/SIN() parts? My current COS() LUT is the same one made by your Excellut (interpreted using the SIN() LUT) that is 512/circle, but would I need to make a new one for just this, or can I keep it? As for the 1/SIN(), you propose that using 1024/circle is best, so a new LUT would need to be made, but with using β made for 512/circle, I can just do a (<< 1) to the final angle to use with that LUT, or would other adjustments need to be made?

You'll see that the β-lut will almost be linear with a slope of about 0.6. The fractional part is the problem, as you can't really use it as lut indices: you'd get a lot of aliased values that way. That was why I suggested a bigger lut: the higher resolution would correct for the truncation. However, now I'm thinking that you could also do linear interpolations between lut entries, using β and θ-&946; as fixed-point indices.

Code:

// General lerp for point xm between xa and xb
// ya= f(xa) ; yb= f(xb)
// w= (xm-xa)/(xb-xa)
// -> ym = ya + (yb-ya)*w

//! Example of LUT linear interpolation, with fixed point lut-indices (untested)
//! \param lut The look-up table
//! \param xm Fixed point indec
//! \param fshift Fixed-point shift
static inline int lut_lerp(const int lut[], u32 xm, int fshift)
{
int xa= xm>>fshift;
int ya= lut[xa], yb= lut[xa+1];
return ya + (yb-ya)*(xm-(xa<<fshift))>>fshift;
}

Something like that. You could use this for both the cos(β) and 1/sin(φ-β) lookups, as that's where the fractional nature of β is used. This does go at the expense of a few cycles, but inlining and modifying it for a specific purpose should still be a lot faster than all the divisions.

I suppose with this,the 512/circle might still be accurate enough and you wouldn't need a completely new cosine lut. You do need to be careful where you wrap around though: lerping between entries 511 and 512 would be bad. Or, as β is always constant for each scanline, you could create a 160 cos(β) lut and be done with the whole thing.

You still need a 1/sine lut, because that's how you get rid of the divisions.

DiscoStew wrote:

Having 160 divides for MODE 7 is not something I want, even if the Asm divide I'm using is fast, and I want to make sure I'm going through this process of alterations carefully and correctly. Once I get this general method down, I can make some extra tuning whereever needed.

EDIT:
Eww, another problem. Got this formula for 1/SIN() for Excellut...

Code:

=1 / SIN(x * pi / 512)

...but the x=0 is not pretty. (1 / 0)
Can't divide by 0, and can't export the LUT with that problem for some reason.

Heh, oops. VB isn't so forgiving when it comes to 1/0 as C, that's where this is coming from. Or with overflow problems btw, as I found out a little while ago. You could try inputting a bogus value for it and disabling 'calc @ export' so it doesn't refill all the cells, but I'll see what I can do about it.

Oh, and as a general note: never blindly use someone else's trig equations. With slightly different definitions for the angles (in both direction and origin) sines turn into cosines and pluses into minuses and the whole thing goes pearshaped. They are an accident waiting to happen. I don't care if you got them from people who can pull Nobel Prize winning theories out of their ass on cue, always check in a comfortable environment before actual use. This is why the page was done with matrices, where you don't have this problem as much.

I will admit, when directed to your example of figuring out lambda without a divide per scanline, I dove right into it, hoping (yes, hoping) that when I took the equations and functions and placed them into my library, I'd get results close to what you described. How sad that wasn't true. My entire Mode 7 went from looking very nice to trying to figure out what the heck was on the screen. I then began looking through the code, hoping to find something, but couldn't. I even went and within the program itself, wrote the correct lambda from the divide and the odd lambda from the trig functions into an unused space in the GBA memory. I saw the comparision of the values, and they were not pretty. So not pretty as to see that when the direction of one set of lambda values went one way, the trig lambda went in all directions.
So, I took your advice and took both, and began experimenting under Excel (also used Excellut for more than just making LUTs). When comparing the figures you displayed with code I had, it hit me. The code I displayed for figuring out the Arctan for each scanline was wrong, as I was thinking incorrectly as to the direction of an incrementing angle would be. I was thinking of going from looking at the horizon to looking straight at the graphics plane as incrementing, but your example of figuring out β had it incrementing the other direction. So I changed the equation to make the Arctan LUT from...

Code:

ATAN((x - 80) / 128) * (256 / pi)

...to...

Code:

ATAN((80 - x) / 128) * (256 / pi)

...and now, I began to actually see the Mode 7 effect to a degree. I was still using 512/circle for both figuring out Arctan and 1/Sin because I wasn't focusing on accuracy for that time, but I increased them to 1024/circle, and made correct adjustments to working with .8 and .16 Fixed numbers to reduce the error and accuracy of rounding off, and now it really looks great. Of course there are the side effects to working it (liek having the camera really close to the perspective plane but looking off to the horizon, it shears really bad), but I'm not too worried because the range of which I'm using this Mode 7 effect won't go close to these extreme ranges. I have yet to try lerp'ing, or even making a scanline COS LUT (which I should since it would be very easy to implement), but if needs be, they will be done.

Thanks again for your help.
_________________
DS - It's all about DiscoStew

For most common uses of rotation backgrounds on the GBA and DS, you don't need to mess with trig per scanline. Just use similar triangles (one division per scanline) to determine the distance from the camera to the center of each scanline, and then determine the texels per pixel and scanline texel origin from that.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

Hmm, somehow I missed this reply. Think it might have saved me a bit of trouble if I hadn't.

DiscoStew wrote:

The code I displayed for figuring out the Arctan for each scanline was wrong, as I was thinking incorrectly as to the direction of an incrementing angle would be. I was thinking of going from looking at the horizon to looking straight at the graphics plane as incrementing, but your example of figuring out β had it incrementing the other direction. ...

I noticed this too. The problem is that under my definition of β, it should have the same sign as θ in the cosecant (=1/sin) lookup. I messed up there, sorry about that. Well, I did warn you not to blindly trust untested equations :P Your solution of changing the definition of β works too, of course.

I've also ran some additional tests. I now have a 0.16fixed beta_lut[160] and a 20.12fixed csc_lut[513], which works very well once you apply the lerp to the cosecant lookup. It's not even necessary for the cos(β). Anyway, I went from an original 20k-56k cycles (depending on angle and altitude) to 9k-19k (depending on angle, independent of altitude).
Current code looks something like the snippet below. It uses lerps for both cos and csc, but these can be modified easily, and also filling in the camera matrix vectors, which allows for a few simplifications as well.

Code:

inline CODE_IN_IWRAM int trig_lam(u16 beta, u16 theta)
{
FIXED cosb, csc, ya, yb, dx;
u32 xa;

// get cos(beta) (no need for lerp here actually)
xa= beta>>7;
ya= lut_cos(xa); yb= lut_cos(xa+1);
dx= beta-(xa<<7);
cosb= ya + ((yb-ya)*dx>>7);

// get cosecant (remove lerp here and get crap (if you only have a small lut))
theta += beta;
xa= theta>>7;
ya= csc_lut[xa]; yb= csc_lut[xa+1];
dx= theta-(xa<<7);
csc= ya + ((yb-ya)*dx>>7);

return cosb*csc>>8;
}

// Lambda via trigonometry.
CODE_IN_IWRAM void m7_aff_calc_trig()
{
if(_m7_horizon >= VID_HEIGHT) // no visible floor, so wots the point?
return;

int ii, ii0= _m7_horizon>=0 ? _m7_horizon : 0;

FIXED xc= _m7_cam.pos.x, yc= _m7_cam.pos.y, zc=_m7_cam.pos.z;
FIXED ys0, lam;

BGAFF_EX *bga= &_m7_bgaff_ex[ii0];

FIXED cf, sf, ct, st, lcf, lsf;
// cos/sin of phi (f) and theta (t)
cf= _m7_cam.u.x; sf= _m7_cam.u.z;
ct= _m7_cam.v.y; st= _m7_cam.w.y;
for(ii= ii0; ii<VID_HEIGHT; ii++)
{
lam= yc*trig_lam(beta_lut[ii], _m7_cam.theta<<7)>>15;
lcf= lam*cf>>12; lsf= lam*sf>>12;
bga->pa= lcf; bga->pc= lsf;

ys0= (ii-M7_TOP)*st - M7_D*ct; // lambda?Rx?b
bga->dx= xc + lcf*M7_LEFT - (lsf*ys0>>8);
bga->dy= zc + lsf*M7_LEFT + (lcf*ys0>>8);

// hack that I need for fog. pb and pd are unused anyway
bga->pb= lam>>4;
bga++;
}
}

I'm sure you can get a few extra things from manual asm, but don't think it'll be that much.

gbadev.org forum archive

Coding > lambda optimizing....

#84537 - DiscoStew - Tue May 23, 2006 5:25 am

#84540 - poslundc - Tue May 23, 2006 5:51 am

#84550 - DiscoStew - Tue May 23, 2006 8:34 am

#84567 - Cearn - Tue May 23, 2006 12:49 pm

#84600 - poslundc - Tue May 23, 2006 4:50 pm

#84880 - DiscoStew - Thu May 25, 2006 8:17 am

#84923 - Cearn - Thu May 25, 2006 7:56 pm

#85239 - DiscoStew - Sun May 28, 2006 3:58 am

#85249 - tepples - Sun May 28, 2006 5:12 am

#85813 - Cearn - Fri Jun 02, 2006 1:19 pm