#135430 - DiscoStew - Mon Jul 23, 2007 8:39 am
Something I just recently got into when working on my Maya exporter, but it seems that the code I used and converted to calculating normals looks to be so CPU intensive, that the entire program is slowing down by it, and I was wondering if anyone has any suggestions.
As far as calculating the normals, I basically used the method found here....
http://www.spacesimulator.net/tut5_vectors_and_lighting.html
....but edited all the floats out, used 20.12 fixed point math and functions (crossf32 and normalizef32 in libnds), made modifications so quads were added, and allowed polygons using the same vertex to possibly not use the same normal value (imagine a box). Everything works out fine, except it seems to be going pretty slow compared to not doing any normal calculation.
At first I thought it could just be the emulator (using no$gba), as not everything is perfected enough for full emulation and speed, but just commenting out the normal calculations made a world of difference from pretty slow to full speed, as far as the program goes. As of this moment, I don't have anything to test this out of hardware, but it's leaning me towards making a purchase.
Any suggestions for improving the normal calculations?
_________________
DS - It's all about DiscoStew
#135431 - DekuTree64 - Mon Jul 23, 2007 8:56 am
I don't know any specific tricks to speed up the normal calculation process itself, but you can get better rendering speed if you use display lists with normals included. And since I'm assuming you're generating all the normals when loading the model on the DS anyway, then all you're saving compared to doing it in the exporter is a bit of ROM space.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#135437 - kusma - Mon Jul 23, 2007 10:51 am
I usually only generate normals the general way (your link- minus the bugs they have, that is) for static objects, and transform it. For dynamic objects, it depends a bit.
For skinned meshes, you can skin the normals in the same way as you skin the positions.
For other deformations, you *could* construct tangent-space information, and deform a very small triangle constructed by position, position + binormal and position + tangent. Then just use the triangle-normal for that vertex. I'm not sure if this is faster or not for nds-games, but it fits a programmable vertex shader a lot better ;)
#135450 - silent_code - Mon Jul 23, 2007 1:37 pm
i reccomend doing it offline when exporting. i had a quite fast implementation (on pc) that turned out dead slow on thends. too slow. i prefer sacrificing a few bytes and instead gaining shorter loading times (from tens of seconds to tenths of a second).
good luck.
#135470 - Ant6n - Mon Jul 23, 2007 7:06 pm
for the dynamic stuff (i.e. runtime normals), assuming you do it exaclty like the way described by your linke, you could try
1) doing the normals in arm instead of thumb (long long multiplies are faster), maybe in itcm (although with caching enabled by default this should'nt do much)
2) instead of counting polygons and dividing to get the averaged normals, you could use a switch statement on the number of polygons and multiply by a const, i.e.
case (numpolygons) of
3: normalx *= fixed(0.333), normaly *= fixed(0.333), normalz *= fixed(0.333)
4: normalx *= fixed(0.25) ... etc.
up to maybe 7, do a divide for the rest
#135471 - DiscoStew - Mon Jul 23, 2007 7:30 pm
DekuTree64 wrote: |
I don't know any specific tricks to speed up the normal calculation process itself, but you can get better rendering speed if you use display lists with normals included. And since I'm assuming you're generating all the normals when loading the model on the DS anyway, then all you're saving compared to doing it in the exporter is a bit of ROM space. |
Hehe, I knew display lists were gonna be brought up. The fact is, I'm not using display lists. For the past few months, I have been on and off creating my own exporter that takes information of a 3D model (polygons, materials, joints, etc), dumps them into a custom file, and under my code, it generates a real-time poseable model, and all that is needed to pose it is to send in an array that contains joint data. The whole reason for needing to generate normals each frame is because as the joints move, the polygons affected by those joints will move with them.
Since I last posted, I made a slight modification to when dividing the collection of normals by the number of polygons each normal will affect, by replacing the divide with a small reciprocal LUT. However, I barely saw any improvement, but an improvement nonetheless. The main time sink is with normalizing the 2 vectors, and then normalizing the overall normal for each polygon, and with a 300 triangle model, that's roughly 900 Sqrt calls and 2700 divide calls, not to mention everything else involved like calculations and storing that data. Not very pretty if you ask me.
_________________
DS - It's all about DiscoStew
#135472 - sajiimori - Mon Jul 23, 2007 7:43 pm
Sounds like pretty intense stuff -- I might consider making some quality sacrifices. For instance, you could update only a half or a third of the normals on each frame, causing a slight delay in lighting updates.
Another sacrifice would be to animate at half frame rate (or less) and use linear interpolation to fill in the missing frames, using the extra cycles on the in-between frames to partially calculate the upcoming keyframe.
#135475 - Ant6n - Mon Jul 23, 2007 8:40 pm
"900 Sqrt calls and 2700 divide"
instead of calculating
k = sqrt(x^2+y^2+z^2)
x = x/k; y= y/k; z = z/k
try
k = 1/sqrt(x^2+y^2+z^2)
x *= k;y *= k;z *= k;
do you know about fast inverse square approximations?
given some initial approximation x' on sqrt(x), you can find a better approximation x" with
x" = x'*(1.5-(x/2)*x'*x');
The initial approximation could be found with a lookup table on clz(x), or maybe a small lookup table for each clz(x).
Again there are tons of long long multiplies, and I read somewhere that long long multiplies in thumb are expanded by gcc into something 'ugly', so try making these routines arm.
#135512 - DiscoStew - Tue Jul 24, 2007 2:26 am
Ant6n wrote: |
"900 Sqrt calls and 2700 divide"
instead of calculating
k = sqrt(x^2+y^2+z^2)
x = x/k; y= y/k; z = z/k
try
k = 1/sqrt(x^2+y^2+z^2)
x *= k;y *= k;z *= k; |
Oh boy, do I feel dumb. I used the normalizef32 function provided by libnds, but didn't give a single thought that I could take the basics of that function and optimize it for this purpose. Very similar to the reciprocal LUT I used to replace the end portion of the normal calculations from multiple polygons, but never came to mind for this. This will reduce the number of divisions from 9 per triangle to 3 easily.
Ant6n wrote: |
do you know about fast inverse square approximations?
given some initial approximation x' on sqrt(x), you can find a better approximation x" with
x" = x'*(1.5-(x/2)*x'*x');
The initial approximation could be found with a lookup table on clz(x), or maybe a small lookup table for each clz(x).
Again there are tons of long long multiplies, and I read somewhere that long long multiplies in thumb are expanded by gcc into something 'ugly', so try making these routines arm. |
Sadly, I don't know much about inverse square approximations. Guess I've got some reading to do.
As far as making the routines arm-based, I haven't dived into that yet. I know with my past GBA projects, I edited the Make files to allow code to be in Arm if the code file has the ".iwram" extension on it, as IWRAM code was best as ARM, and ROM code as THUMB. Because all the ARM9 code is being put into RAM (am I correct on this?), is it best to make it all ARM-based? From the Make file I have for combining Arm7 and Arm9, I see this for C-files in the Arm9 Make file...
Code: |
ARCH := -mthumb -mthumb-interwork |
Was thinking -marm would work instead, but it didn't work. This isn't the GBA :P
_________________
DS - It's all about DiscoStew
#135517 - Ant6n - Tue Jul 24, 2007 4:21 am
code is in fact put in ram, but the 4mb ram is actually pretty slow; I found it takes 6 cycles for a 1 cycle arm instruction. Thumb instructions should run about twice as fast, but considering thumb can do less per instruction, its actually only 60% faster.
Main ram is cached both with an instruction (8K) and data cache (4K). so running thumb from main ram is not only faster, but since its smaller there is more code that can stay cached. Instructions that are in cache run at full speed (afaik). Thus your overall code should stay thumb.
the internal working fram of the DS is the tcm, again there is instruction and data tcm (itcm, dtcm). Again code runs at full speed. Instead of extention iwram, try extention dtcm.
#135546 - kusma - Tue Jul 24, 2007 12:38 pm
another thing to keep in mind is that you can (and it's generally more correct, explanation later) generate non-normalized polygon-normals, accumulate them, and normalize in the end. This has the advantage that a large polygon next to a small one will contribute more to the normal (since the area of the polygon is larger, it's non-normalized normal is bigger).
Also, dividing by the number of polygons contributing to a vertex is wrong. You can (and will) end up with non-normalized vectors that way. Normalizing is the way to go, and when you normalize any divides prior will only lead to rounding-errors.
edit: here's some pseudo-code:
Code: |
vector calcNormal(p)
{
return cross(...);
}
for (p in polygon list)
{
vector face_normal = calcNormal(p);
for (v in p)
{
v.normal += face_normal;
}
}
for (v in vertex list)
{
v.normal = normalize(v.normal);
}
|
[/code]
#135571 - DiscoStew - Tue Jul 24, 2007 6:53 pm
Ant6n wrote: |
code is in fact put in ram, but the 4mb ram is actually pretty slow; I found it takes 6 cycles for a 1 cycle arm instruction. Thumb instructions should run about twice as fast, but considering thumb can do less per instruction, its actually only 60% faster.
Main ram is cached both with an instruction (8K) and data cache (4K). so running thumb from main ram is not only faster, but since its smaller there is more code that can stay cached. Instructions that are in cache run at full speed (afaik). Thus your overall code should stay thumb.
the internal working fram of the DS is the tcm, again there is instruction and data tcm (itcm, dtcm). Again code runs at full speed. Instead of extention iwram, try extention dtcm. |
Interesting, I guess I can see why in the Make file it was set to Thumb mode. Maybe after a little examining, I'll edit the Arm9 Make file so tcm can be taken advantage of. It's basically comparable to iwram on the GBA, just more to it and slightly called differently?
kusma wrote: |
another thing to keep in mind is that you can (and it's generally more correct, explanation later) generate non-normalized polygon-normals, accumulate them, and normalize in the end. This has the advantage that a large polygon next to a small one will contribute more to the normal (since the area of the polygon is larger, it's non-normalized normal is bigger).
Also, dividing by the number of polygons contributing to a vertex is wrong. You can (and will) end up with non-normalized vectors that way. Normalizing is the way to go, and when you normalize any divides prior will only lead to rounding-errors.
edit: here's some pseudo-code:
Code: |
vector calcNormal(p)
{
return cross(...);
}
for (p in polygon list)
{
vector face_normal = calcNormal(p);
for (v in p)
{
v.normal += face_normal;
}
}
for (v in vertex list)
{
v.normal = normalize(v.normal);
}
|
|
You've got my attention, not just because the normalizing is relocated to the ending part, but because the overall amount of code within the polygon processing is reduced. After seeing your example compared to the method by the "other guy", I decided to do a bit more searching around on the internet for polygon normalizing, and many sources, if not all, reference doing it the same way you've shown.
So, I gave your pseudo code a whirl, and I must say, everything is running at least 2 times faster because of that reduced calculation, and I haven't even gone with Ant6n's suggestion yet. The normals themselves look a bit different compared to the previous method. A bit brighter, and in one of my test models, some of the small polygons that are surrounded by bigger polygons look like the light isn't really reaching them, making them dark (the light is from the viewer's direction). I guess I'm just used to the look of the previous method, but I'm sure I'll get out of that phase :P
Thanks again for all the help given. I really appreciate it. Maybe one of these days soon I'll put out a small demonstration of what I've done so far.
_________________
DS - It's all about DiscoStew
#135573 - Ant6n - Tue Jul 24, 2007 7:30 pm
oh right, adding and dividing is crap didnt even think about it.
I.e. ((1,0) + (0,1))/2 gives (0.5.0.5) has a magnitude of 0.5.
If you think that your answer is too much off, you can try an approximate normalization before adding the vectors. Especially since you normalize for reel later anyway.
For example a very approximate magnitude (that surely is in the right order of magnitude ;-) is
|v| = x+y+z
or
|v| = max(x,y,z).
Then if you have 3 vectors a,b,c you could use
||(a/|a| + b/|b| + c/|c|)||
where ||v|| is the actual normalization. In order to get rid of the divisions, you could also write
||(a*|b|*|c| + b*|a|*|c| + c*|a|*|b|)||
The vector inside the || || differs from the above one only by a factor |a|*|b|*|c|, which gets sucked up by the normalization in the end.
if you do this in the smart way, you also dont have to do that many multiplies, i.e.
Code: |
c = 1
for all vectors v[i]
k = |v[i]|
for all previous vectors v[j] with j<i
v[j] = k*v[j]
v[i] = c*v[j]
c = k*c
|
#135576 - kusma - Tue Jul 24, 2007 7:39 pm
Ant6n wrote: |
If you think that your answer is too much off, you can try an approximate normalization before adding the vectors. |
But you shouldn't! The triangle's contribution to the vertex normal should be proportional to the triangle's area, and skipping the normalization does just this. If you are worried about the accumulation overflowing in fixed-point maths, you could always do something like tracking a separate exponent for each vector and do some cheap pseudo-float additions.
#135577 - Ant6n - Tue Jul 24, 2007 7:50 pm
kusma wrote: |
The triangle's contribution to the vertex normal should be proportional to the triangle's area |
Why?
#135581 - kusma - Tue Jul 24, 2007 8:21 pm
Ant6n wrote: |
kusma wrote: | The triangle's contribution to the vertex normal should be proportional to the triangle's area |
Why? |
because when you gouraud-shade your mesh afterwards, the bigger polygon is more visible, and correctness of it's lighting is more important. Play around with it a bit, and you'll quickly see that it makes perfect sense.
#135690 - jonnyhilly - Wed Jul 25, 2007 5:00 pm
how about pre-calculating all the normals, sort them into bone groups to match the polygon groups assiciated with each bone on the skeleton, then just rotate them by your character's local bone matrix.
I haven't calculated it, but it might be faster to just rotate them, than to generate them each time.
You can probably skip normalization, so no square roots or divides needed at all. Just multiply and adds.
Have pre allocated space for the target normals, have pre allocated pointers in the polygons pointing to the targets so they also dont need to be updated .
You can also time-slice which bones to update, bases on how much each bone moves. (this can also be pre-calculated in your animation data)
if a bone doesn't move at all (or much) in a frame then don't rotate the normals for that bone, just use the ones from the last frame. (and same lighting values from last frame, dont do the lighting calculations either)
this assumes your lights aren't moving
if you wanna take it further....
if you are rendering multiple characters.... time slice the whole animation update to be less often when the characters are far away... one every 2 frames for moderate... once every 4 frames for far away... its amazing what you can get away with, especially if they aren't moving terribly fast.
#135731 - DiscoStew - Wed Jul 25, 2007 9:25 pm
As far as the inverse square root is concerned, I remember this bit of code....
Code: |
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
INT i = *(INT*)&x;
i = 0x5f3759df - (i >> 1); // This line hides a LOT of math!
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
RETURN x;
} |
...and thought about if there was a way to convert this for fixed-point inverse sqrts, but considering that the "special line" is tied in with the float format, I doubt I'll ever find a good method by this. I had thought about just taking the fixed-point value, converting it to float, then getting this mix of stuff done (other than the extra approximation), then convert back to fixed-point and doing the extra approximation, but I fear that just converting to float and back will require too many cycles vs just going by the hardware method of sqrt and divide.
_________________
DS - It's all about DiscoStew
#135732 - sajiimori - Wed Jul 25, 2007 9:30 pm
Yeah, especially when you can do a sqrt, a divide, and other math all in parallel on the DS.
#135735 - DiscoStew - Wed Jul 25, 2007 11:20 pm
sajiimori wrote: |
Yeah, especially when you can do a sqrt, a divide, and other math all in parallel on the DS. |
Hmm, by this, do you mean that while I'm doing, say a hardware divide, while that is processing, I could get a hardware sqrt going too?
That could potentially save me some time, as I could have done the sqrt for the first normal, and while the divide for that is taking place, I could get started on the sqrt part of the 2nd normal, instead of waiting for everything of the first normal to finish before starting the 2nd normal.
_________________
DS - It's all about DiscoStew
#135738 - DiscoStew - Wed Jul 25, 2007 11:31 pm
jonnyhilly wrote: |
how about pre-calculating all the normals, sort them into bone groups to match the polygon groups assiciated with each bone on the skeleton, then just rotate them by your character's local bone matrix.
I haven't calculated it, but it might be faster to just rotate them, than to generate them each time.
You can probably skip normalization, so no square roots or divides needed at all. Just multiply and adds.
Have pre allocated space for the target normals, have pre allocated pointers in the polygons pointing to the targets so they also dont need to be updated . |
I'm trying to imagine this in my head, but wouldn't that be more for polygon groups that are only influenced by a single bone, and the vertices from one poly group aren't being used by another?
_________________
DS - It's all about DiscoStew
#135740 - Ant6n - Wed Jul 25, 2007 11:46 pm
I was referring to this 'trick' earlier.
As i was suggesting, one could use a lookup table for clz(x) Instead of the magic floating line. Clz is the asm instruction that finds the position of the most significant 1 in an integer.
I.e. find out how many significant bits are zero, and then have a lookup table that gives some average/median of all the values in that range. One would only have one entry for 1,1x,1xx,1xxx,1xxxx,1xxxxx.., but it should give a good initial approximation, like +/- 50%ish.
Then Doing 2 or 3 of the iteration steps should probably give a sufficient approximation.
One could also have a bunch of tiny lookup tables instead of one value, or two values which give a linear approximation.
Although in the end of the day I don't know whether this can compete with a 36 cycle division and 26 cycle sqrt.
Maybe this kinda common math will make it into libnds in some optimized manner.
EDIT: since 1/sqrt(x) and 1/sqrt(2*x) always differ by a factor of sqrt(2) ~ 1.44, The range of values covered by any 1xx..xx is from some number a to 1.44*a, so by choosing 1.19*a, the initial approximation would only be 19% off (18.9207115%), and that with only 32 look up values.
#135746 - jonnyhilly - Thu Jul 26, 2007 12:13 am
Yes you are correct...
it would assume you are either not using weighted vertex groups, or you dont care that the lighting would be slightly off at the locations with multiple weights
might be difficult to see any difference
DiscoStew wrote: |
jonnyhilly wrote: | how about pre-calculating all the normals, sort them into bone groups to match the polygon groups assiciated with each bone on the skeleton, then just rotate them by your character's local bone matrix.
I haven't calculated it, but it might be faster to just rotate them, than to generate them each time.
You can probably skip normalization, so no square roots or divides needed at all. Just multiply and adds.
Have pre allocated space for the target normals, have pre allocated pointers in the polygons pointing to the targets so they also dont need to be updated . |
I'm trying to imagine this in my head, but wouldn't that be more for polygon groups that are only influenced by a single bone, and the vertices from one poly group aren't being used by another? |
#135753 - sajiimori - Thu Jul 26, 2007 1:11 am
Yeah DiscoStew, for bulk operations like this, you can probably save lots of time by carefully threading sqrts, divs, and other work so each unit will stay busy as often as possible.
#135783 - a128 - Thu Jul 26, 2007 8:18 am
DiscoStew wrote: |
As far as the inverse square root is concerned, I remember this bit of code....
|
I posted here InvSin instead of invSqrt
anyway here is a link to the origins of the above inverse sqrt code
http://www.beyond3d.com/content/articles/8/
Last edited by a128 on Fri Jul 27, 2007 11:04 am; edited 1 time in total
#135835 - sajiimori - Thu Jul 26, 2007 7:04 pm
Uh... that says inverse sine.
#135861 - DiscoStew - Thu Jul 26, 2007 9:18 pm
I was about to say that myself. Didn't think that 1.0 - fValue with a Sqrt function made much sense for finding the InvSqrt.
Anyways, after doing some of these other optimizations, things are going pretty good. One thing I completely forgot was a problem I had before making most of these adjustments. It involves a normal that lays straight down an axis (or close to it), because one of the values of the normal computed could possibly turn out to be 1/-1, of which the DS is unable to work with properly.
I've tried a few hacks to fix this, but the only one that seemed to really work was checking if the normal was >= 1 or <= -1, and adjust it, but having to check all 3 values per normal felt like it was bogging down the whole process.
_________________
DS - It's all about DiscoStew
#135870 - Ant6n - Thu Jul 26, 2007 10:06 pm
if the maximum value of your components cannot be bigger than, say, 0.98, you could multiply your inverse squareroot with it
i.e.
c = 0.98/sqrt(x^2+y^2+z^2)
x = c*c .. etc..
but considering how many operations you have per normal anyway, and that multiplication is not blazing fast either, it could be faster to just check in the end, after converting to fixed point.
#135948 - DiscoStew - Fri Jul 27, 2007 7:57 pm
Well, after a lot of improvements to the normal code, this is what I have so far...
Code: |
// (s16*)VertStore -> The storage array for the vertices after alterations from the joints has been done
// (s32*)NormalStore -> The storage array for our processed normals
// (u16*)TriIndices -> The index list for each triangle (3 indexes each)
// (u16*)QuadIndices -> The index list for each quad (4 indexes each)
// (u16*)TriNormInfo -> The normal index list for each triangle (3 indexes each)
// (u16*)QuadNormInfo -> The normal index list for each quad (4 indexes each)
|
Code: |
// Does the model have even one normal in it?
//
if(NormalCount)
{
// Set up vector variables to process the poly normals
//
s32 vect_b1[3], vect_b2[3], vect_b3[3], normal[3];
u32 CurPoly;
// Start with the triangles
//
for(CurPoly = 0; CurPoly < TriCount; CurPoly++)
{
// Retrieve the indexes that our triangle references to
//
u32 PolyIndex = CurPoly * 3;
u32 TriIndex1 = *(TriIndices + PolyIndex) * 3;
u32 TriIndex2 = *(TriIndices + PolyIndex + 1) * 3;
u32 TriIndex3 = *(TriIndices + PolyIndex + 2) * 3;
// Create the two vectors of the triangle
//
vect_b1[0] = VertStore[TriIndex2] - VertStore[TriIndex1];
vect_b1[1] = VertStore[TriIndex2 + 1] - VertStore[TriIndex1 + 1];
vect_b1[2] = VertStore[TriIndex2 + 2] - VertStore[TriIndex1 + 2];
vect_b2[0] = VertStore[TriIndex3] - VertStore[TriIndex1];
vect_b2[1] = VertStore[TriIndex3 + 1] - VertStore[TriIndex1 + 1];
vect_b2[2] = VertStore[TriIndex3 + 2] - VertStore[TriIndex1 + 2];
// Get the cross product of the two vectors
//
normal[0] = ((vect_b1[1] * vect_b2[2]) - (vect_b2[1] * vect_b1[2])) >> 12;
normal[1] = ((vect_b1[2] * vect_b2[0]) - (vect_b2[2] * vect_b1[0])) >> 12; // long long used to be casted here
normal[2] = ((vect_b1[0] * vect_b2[1]) - (vect_b2[0] * vect_b1[1])) >> 12;
// Get the address of the indexes where the triangle's normal will reference to
//
u16 *TriNormal = (TriNormInfo + PolyIndex);
// Store the normal in the designated areas
//
u32 TriNormIndex = *TriNormal * 3;
NormalStore[TriNormIndex] += normal[0];
NormalStore[TriNormIndex + 1] += normal[1];
NormalStore[TriNormIndex + 2] += normal[2];
TriNormIndex = *(TriNormal + 1) * 3;
NormalStore[TriNormIndex] += normal[0];
NormalStore[TriNormIndex + 1] += normal[1];
NormalStore[TriNormIndex + 2] += normal[2];
TriNormIndex = *(TriNormal + 2) * 3;
NormalStore[TriNormIndex] += normal[0];
NormalStore[TriNormIndex + 1] += normal[1];
NormalStore[TriNormIndex + 2] += normal[2];
}
// Now with the quads
//
for(CurPoly = 0; CurPoly < QuadCount; CurPoly++)
{
// Retrieve the indexes that our quad references to
//
u32 PolyIndex = CurPoly * 4;
u32 QuadIndex1 = *(QuadIndices + PolyIndex) * 3;
u32 QuadIndex2 = *(QuadIndices + PolyIndex + 1) * 3;
u32 QuadIndex3 = *(QuadIndices + PolyIndex + 2) * 3;
u32 QuadIndex4 = *(QuadIndices + PolyIndex + 3) * 3;
// Create the three vectors of the quad
//
vect_b1[0] = VertStore[QuadIndex2] - VertStore[QuadIndex1];
vect_b1[1] = VertStore[QuadIndex2 + 1] - VertStore[QuadIndex1 + 1];
vect_b1[2] = VertStore[QuadIndex2 + 2] - VertStore[QuadIndex1 + 2];
vect_b2[0] = VertStore[QuadIndex3] - VertStore[QuadIndex1];
vect_b2[1] = VertStore[QuadIndex3 + 1] - VertStore[QuadIndex1 + 1];
vect_b2[2] = VertStore[QuadIndex3 + 2] - VertStore[QuadIndex1 + 2];
vect_b3[0] = VertStore[QuadIndex4] - VertStore[QuadIndex1];
vect_b3[1] = VertStore[QuadIndex4 + 1] - VertStore[QuadIndex1 + 1];
vect_b3[2] = VertStore[QuadIndex4 + 2] - VertStore[QuadIndex1 + 2];
// Get the cross product from the first two vectors of the quad
//
normal[0] = ((vect_b1[1] * vect_b2[2]) - (vect_b2[1] * vect_b1[2])) >> 12;
normal[1] = ((vect_b1[2] * vect_b2[0]) - (vect_b2[2] * vect_b1[0])) >> 12; // long long used to be casted here
normal[2] = ((vect_b1[0] * vect_b2[1]) - (vect_b2[0] * vect_b1[1])) >> 12;
// Get the address of the indexes where the triangle's normal will reference to
//
u16 *QuadNormal = (QuadNormInfo + PolyIndex);
// Store the normal in the designated areas
//
u32 QuadNormIndex = *QuadNormal * 3;
NormalStore[QuadNormIndex] += normal[0];
NormalStore[QuadNormIndex + 1] += normal[1];
NormalStore[QuadNormIndex + 2] += normal[2];
QuadNormIndex = *(QuadNormal + 1) * 3;
NormalStore[QuadNormIndex] += normal[0];
NormalStore[QuadNormIndex + 1] += normal[1];
NormalStore[QuadNormIndex + 2] += normal[2];
QuadNormIndex = *(QuadNormal + 2) * 3;
NormalStore[QuadNormIndex] += normal[0];
NormalStore[QuadNormIndex + 1] += normal[1];
NormalStore[QuadNormIndex + 2] += normal[2];
// Get the cross product from the last two vectors of the quad
//
normal[0] = ((vect_b2[1] * vect_b3[2]) - (vect_b3[1] * vect_b2[2])) >> 12;
normal[1] = ((vect_b2[2] * vect_b3[0]) - (vect_b3[2] * vect_b2[0])) >> 12; // long long used to be casted here
normal[2] = ((vect_b2[0] * vect_b3[1]) - (vect_b3[0] * vect_b2[1])) >> 12;
// Store the normal in the designated areas
//
QuadNormIndex = *QuadNormal * 3;
NormalStore[QuadNormIndex] += normal[0];
NormalStore[QuadNormIndex + 1] += normal[1];
NormalStore[QuadNormIndex + 2] += normal[2];
QuadNormIndex = *(QuadNormal + 2) * 3;
NormalStore[QuadNormIndex] += normal[0];
NormalStore[QuadNormIndex + 1] += normal[1];
NormalStore[QuadNormIndex + 2] += normal[2];
QuadNormIndex = *(QuadNormal + 3) * 3;
NormalStore[QuadNormIndex] += normal[0];
NormalStore[QuadNormIndex + 1] += normal[1];
NormalStore[QuadNormIndex + 2] += normal[2];
}
// A few variables to work with for normalizing our......normals
//
u32 TotalNorm = NormalCount * 3, CurNormalIndex = 3, FinNormalIndex = 0;
s32 MagCalc = 0;
// Set up the divide and sqrt hardware, and get set up for the first sqrt while hardware gets ready
//
SQRT_CR = SQRT_64;
DIV_CR = DIV_64_32;
s32 Val1 = (NormalStore[0] * NormalStore[0]) >> 12;
s32 Val2 = (NormalStore[1] * NormalStore[1]) >> 12; // long long used to be casted here
s32 Val3 = (NormalStore[2] * NormalStore[2]) >> 12;
while(SQRT_CR & SQRT_BUSY);
while(DIV_CR & DIV_BUSY);
// Set in the numerator for the divide for all normals processed, and start our first sqrt calculation
//
DIV_NUMERATOR64 = ((s64)4096) << 12;
SQRT_PARAM64 = ((s64)Val1 + Val2 + Val3) << 12;
// Go through each of the normal
//
for(; CurNormalIndex < TotalNorm; CurNormalIndex += 3, FinNormalIndex += 3)
{
// Get set up for the next sqrt while the current one is being processed
//
Val1 = (NormalStore[CurNormalIndex] * NormalStore[CurNormalIndex]) >> 12;
Val2 = (NormalStore[CurNormalIndex + 1] * NormalStore[CurNormalIndex + 1]) >> 12; // long long used to be casted here
Val3 = (NormalStore[CurNormalIndex + 2] * NormalStore[CurNormalIndex + 2]) >> 12;
while(SQRT_CR & SQRT_BUSY);
// Spit out the sqrt result into the denominator of the divide to get it going, and start the next sqrt calc
//
DIV_DENOMINATOR32 = SQRT_RESULT32;
SQRT_PARAM64 = ((s64)Val1 + Val2 + Val3) << 12;
while(DIV_CR & DIV_BUSY);
// Retrieve the divide result, and multiply by .98, because the hardware can't handle values of 1/-1
// Multiple the inverse magnitude, and shift the value within the correct limits
//
MagCalc = (((s32)DIV_RESULT32) * 4014) >> 12;
NormalStore[FinNormalIndex] = (NormalStore[FinNormalIndex] * MagCalc) >> 15;
NormalStore[FinNormalIndex + 1] = (NormalStore[FinNormalIndex + 1] * MagCalc) >> 15;
NormalStore[FinNormalIndex + 2] = (NormalStore[FinNormalIndex + 2] * MagCalc) >> 15;
}
// Finish off the last normal with the same procedure as above
//
while(SQRT_CR & SQRT_BUSY);
DIV_DENOMINATOR32 = SQRT_RESULT32;
while(DIV_CR & DIV_BUSY);
MagCalc = (((s32)DIV_RESULT32) * 4014) >> 12;
NormalStore[FinNormalIndex] = (NormalStore[FinNormalIndex] * MagCalc) >> 15;
NormalStore[FinNormalIndex + 1] = (NormalStore[FinNormalIndex + 1] * MagCalc) >> 15;
NormalStore[FinNormalIndex + 2] = (NormalStore[FinNormalIndex + 2] * MagCalc) >> 15;
} |
I'm still know very little about TCM, but this code is within my main model process function, which is set with the attribute ITCM_CODE. But, the thing is that of the storage arrays, only NormalStore is assigned with DTCM_DATA, as for some reason my model gets all crazy on me if I put VertStore into it, even if I didn't have NormalStore in there. As it is right now, each have 1500 elements, resulting in ~9KB worth of data, and the only thing my program is doing is loading the model, and processing it. Nothing else is happening.
There are still a few tweaks that can be done, but that's mainly for such things like getting the PolyIndex, which can be changed from a multiply to addition.
If you notice, there are places where it says "long long used to be casted here". They used to be casted, until I found out that even in ITCM, if those are removed, my programs runs much better. Unfortunately, making the change would result in problems "if" the meshes/polygons are large, or turn out large after the joints have manipulated them. For now, I'll keep them off.
So, here is my code. If by chance you see anything that could potentially increase performance, or you have any questions/information that could do the same, please let me know through this thread.
I still think that one reason why the programs is running slower than intended is because I'm still using emulators for testing (no$gba to be exact), and emulators are still not up to snuff. I was planning on getting the R4DS (or it's twin, the M3 Simply) so that I could truly test this.
_________________
DS - It's all about DiscoStew
#135950 - sajiimori - Fri Jul 27, 2007 8:19 pm
Waiting for the division unit is painful. I'd suggest rearranging the loop to have the divide span more code. This may involve skipping ahead to calculating Val1-3 for the next normal while waiting for the divide to finish, then backtracking to finish writing to NormalStore.
I'd also suggest eliminating the 'while' loops that wait for the square root and divide units to finish. If they span enough code, they are guaranteed to be finished already.
#135957 - Ant6n - Fri Jul 27, 2007 9:48 pm
isn't it dangerous to make these kind of assumptions (i.e. order of execution) in code that gets o3ptimized by gcc?
i can't see from the code whether you took the advice and only run this code every couple of frames for each model.
#135958 - DiscoStew - Fri Jul 27, 2007 10:03 pm
Ant6n wrote: |
isn't it dangerous to make these kind of assumptions (i.e. order of execution) in code that gets o3ptimized by gcc?
i can't see from the code whether you took the advice and only run this code every couple of frames for each model. |
That is just the general code for a one-shot at calculating all the normals on a model. For your method, I'll have to evolve it to take those calculated results, and store them with the model it is calculating from. Shouldn't be too hard, but just for the moment, I'm focusing on optimizing the code that calculates those normals.
EDIT:
This is gonna sound really silly, well, to me anyways, but considering I'm grabbing the poly info, which is segmented by "halfwords" each, wouldn't grabbing that same data, but in "word" segments, be better? As it is right now, I'm basically just grabbing information in it's compiled format, whereas if I were to make a loading function for the models, expand some of those halfword arrays into word arrays, would that increase overall performance?
_________________
DS - It's all about DiscoStew
#135965 - sajiimori - Sat Jul 28, 2007 12:39 am
If I understand what you're saying, no, LDRH instructions (to load 16 bits) are just as fast as LDR instructions (to load 32 bits).
Padding out the values with zeroes would hurt performance a lot due to increased cache misses. Compact data is very important.
#135970 - tepples - Sat Jul 28, 2007 1:58 am
sajiimori wrote: |
I'd also suggest eliminating the 'while' loops that wait for the square root and divide units to finish. If they span enough code, they are guaranteed to be finished already. |
But can you guarantee that GCC won't get too clever when it reorders your instructions?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#135975 - DekuTree64 - Sat Jul 28, 2007 2:57 am
tepples wrote: |
sajiimori wrote: | I'd also suggest eliminating the 'while' loops that wait for the square root and divide units to finish. If they span enough code, they are guaranteed to be finished already. |
But can you guarantee that GCC won't get too clever when it reorders your instructions? |
For a critical calculation, I would say wait for it always. But since a bad normal would just cause a slight visual glitch for one frame, it's probably safe to assume that execution time is consistent enough that if things have been reordered too much, it will be very obviously broken.
Another thing to watch out for with the divider/square root unit is using them in interrupt handlers. Even if you back up/restore the values, that restarts the calculation so it could appear to take longer than it should to outside code. Plus the interrupt could happen between checking the busy flag and fetching the result. So probably the best solution is just to wait for the calculations to finish before returning from the interrupt.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#135990 - DiscoStew - Sat Jul 28, 2007 8:16 am
I was thinking of a little hack that might increase performance, but it would deviate from the normal method somewhat, of which optimization from there may be hindered. The same amount of processing will probably happen, but where that processing happens will be changed, to take advantage of the divide and sqrt running alongside everything else.
It would require that I make many changes to both my exporter and my code, but the concept is pretty much that not only do I get which vertices go to which normal group, but I order those particular normal groups, so that the normals connected to the least number of polygons will be first, and the most last. With that, add a bool marker to each vertex index of each polygon. The only vertices that will have a non-0 value will be those vertices that will add last to a normal group, of which the normalizing of that normal will begin, getting the sqrt and divide going. By this, the actual normalizing of the final normal values will be going even while the cross products of the polygons are still being processed, and any remaining normals needing to be processed will be done after.
A problem with this though would be if a polygon has more than 2 vertices that are ready for normalizing. Perhaps making a queue the size of the number of normals the model has, and placing ready normals for processing into it, and after a polygon is processed, it checks the queue to see if another normal is ready, and gets that stuff going.
I hope I'm making sense with this idea. If you understand this, you think it's a good idea? It wouldn't really interfere with only processing these normals every few frames, and can possibly be split so that a certain number of normals are processed per frame.
_________________
DS - It's all about DiscoStew
#136117 - sajiimori - Sun Jul 29, 2007 11:42 am
You can eliminate waits for divides and square roots even using your current scheme -- you don't need to go to great lengths for that purpose.
Moving the divides and square roots earlier just means they will overlap other work, without saving any time.
At this point, I'd focus on simplifying the triangle and quad loops. Try to eliminate unnecessary variables, and make sure the rest fit in registers.
#136128 - DiscoStew - Sun Jul 29, 2007 3:17 pm
Looking at my loops for the triangles and quads, I couldn't really see any more optimization I could do, until I reread what Ant6n has said earlier.
Ant6n wrote: |
code is in fact put in ram, but the 4mb ram is actually pretty slow; I found it takes 6 cycles for a 1 cycle arm instruction. Thumb instructions should run about twice as fast, but considering thumb can do less per instruction, its actually only 60% faster.
Main ram is cached both with an instruction (8K) and data cache (4K). so running thumb from main ram is not only faster, but since its smaller there is more code that can stay cached. Instructions that are in cache run at full speed (afaik). Thus your overall code should stay thumb.
the internal working fram of the DS is the tcm, again there is instruction and data tcm (itcm, dtcm). Again code runs at full speed. Instead of extention iwram, try extention dtcm. |
If code executed in main RAM is that slow, would that same effect also affect retrieving and writing data in the main RAM? If so, then I think I see where I can optimize my code. Although I have my normal-processing code in ITCM, VertStore is, for the moment, in main RAM, because of a problem that I couldn't figure out, but after reading a little bit, it seems I can't use DMA to copy data from the main RAM to DTCM, right?
As it is right now, if my model has joints, the vertices of the model will get processed through that, and stored into VertStore, which in that case, that array shouldn't have problems being in DTCM, and any optimizations will need to be focused on that part of my code, not my normal-processing code. For models that don't have joints, I'll just have to rig up my exporter and my program code to put those models into a more "display-list" friendly form, as nothing is changing on them, so nothing needs to be processed, just executed.
I'll get working on that before making any other changes, but in any case, if I need to get an array or something into DTCM, is there a faster way than to copy one element at a time (or more, depending on the element size)?
_________________
DS - It's all about DiscoStew
#136137 - tepples - Sun Jul 29, 2007 4:46 pm
DiscoStew wrote: |
I'll get working on that before making any other changes, but in any case, if I need to get an array or something into DTCM, is there a faster way than to copy one element at a time (or more, depending on the element size)? |
memcpy() of an area at least 32 bytes should be fast. Profile it to be sure.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#136164 - sajiimori - Sun Jul 29, 2007 9:09 pm
One likely way to simplify your loops is to avoid calculating array indexes, and instead try to increment pointers as you go. This will decrease register usage because you don't need to remember the base address of the array in addition to an index.
For instance, if you find yourself accessing array[i], array[i+1], then array[i+2], try to eliminate i and do *ptr++, *ptr++, *ptr++ instead.
#136170 - tepples - Sun Jul 29, 2007 9:30 pm
sajiimori wrote: |
One likely way to simplify your loops is to avoid calculating array indexes, and instead try to increment pointers as you go. This will decrease register usage because you don't need to remember the base address of the array in addition to an index. |
So you're suggesting strength reduction by hand. Before you try this, try using arm-eabi-gcc -S (rest of flags) mysourcefile.c -o mysourcefile.s to get a disassembly and then seeing if the compiler isn't already doing this for you. (The -S instead of -c tells GCC not to pass the file to the assembler.)
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#136173 - sajiimori - Sun Jul 29, 2007 9:49 pm
In the context of this code, I strongly doubt the compiler will figure out how to eliminate all the indexing, but you never know.
And yes, I would suggest doing the optimization by hand, if you intend to make further low-level optimizations afterwards. If the code doesn't visually correspond to the work being done, it's very hard to find additional optimizations.
In essence, automatic code transformations add an extra level of indirection between you and the assembler output. When you're concerned about the assembler output, it's helpful to stay within reach of it.
It's not impossible to write code that's both fast and abstract, but it can be rather like trying to make a marionette control another marionette.
(Incidentally, I think cutting down on the indexing would also make the code easier to understand, but that's subjective.)
#136189 - tepples - Sun Jul 29, 2007 11:06 pm
sajiimori wrote: |
It's not impossible to write code that's both fast and abstract, but it can be rather like trying to make a marionette control another marionette. |
Geppetto has to teach his "son" a trade sometime, doesn't he?
If you need it explained, wait for me to show up on AIM.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#136338 - DiscoStew - Tue Jul 31, 2007 7:23 pm
tepples wrote: |
DiscoStew wrote: | I'll get working on that before making any other changes, but in any case, if I need to get an array or something into DTCM, is there a faster way than to copy one element at a time (or more, depending on the element size)? |
memcpy() of an area at least 32 bytes should be fast. Profile it to be sure. |
Hmm, was trying to get memcpy into my project, but for some reason, it can't seem to locate the <memory.h> file (under Visual Studio). <string.h> has it through, so I included it that way for now.
memcpy works fine, and I'll get to profiling it in time when copying from main RAM to DTCM. A problem I now have involves some adjustments I've made for models that can just have their contents copied to DTCM (that have no bone structure), and accessed for rendering them. While the normals that have to be calculated are done in a signed word array, if the model doesn't need to do calculating, it will grab an premade array within itself for the normals. Problem though is that they are in signed halfword form, not signed word, and so I can't simply copy from one array to the other via copy functions like memcpy, unless the destination array were to be casted as signed halfword too. Because the resulting calculation of the normals fit into signed halfword format anyways, I decided to make a pointer that pointed to the main normal array, but under the half-word type.
Code: |
DTCM_DATA s32 NormalStore[...];
s16 *NormalGrab = (s16*)NormalStore; |
but, when I try this, I get this warning....
Code: |
warning: dereferencing type-punned pointer will break strict-aliasing rules |
I'm not quite sure what's going on, but I do know it isn't because of adding that attribute, because I removed it, and I still got the warning. Any thoughts?
EDIT:
Actually, after typing this, I did a little browsing, and found that I could use "-fno-strict-aliasing" to fix it. Good idea?
_________________
DS - It's all about DiscoStew
#136340 - tepples - Tue Jul 31, 2007 7:46 pm
Consider using a union, C's counterpart to C++'s reinterpret_cast:
Code: |
union {
s16 *s16_pointer;
s32 *s32_pointer;
} s;
s.s32_pointer = NormalStore;
NormalGrab = s.s16_pointer;
|
The result of this is defined by each application binary interface (ABI), but in ARM EABI, it should do what you expect as long as both areas are aligned to a 4-byte boundary.
The aliasing rules determine what optimizations the compiler is allowed to make on pointer dereferences. Stricter rules may lead to faster code.
This post to the NetBSD mailing list has more information.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.