Archive

Archive for December, 2011

Trigonometric Look-Up Tables Revisited

December 6, 2011 28 comments

Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially sin and cos to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

Solution

The solution that came to my mind was to use a look-up table (LUT) for sin and cos. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

Representing Angles
To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical and with 65535 to remap the values into this range.

LUT Size
At first, I started with a 16384 entries LUT because I wanted to minimize errors. The sin function looked like this:

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);  // Map the angle to [0..16383]
    return SinTable[Index & 16383];   // Lookup angle
}

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

LUT Steps
I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):


(This graph was generated with the excellent Inigo Quilez’s Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);                  // Map the angle to [0..16383]
    const tFloat A = SinTable[Index & 16383];         // Lookup angle
    const tFloat B = SinTable[(Index + 1) & 16383];   // Lookup angle+1
    const Weight = (Value & 3) / 4.0f;                // Compute weight
    return Weight * (B - A) + A;                      // Linear interpolation
}

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

LUT Size Optimization
With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:

As we can see, the error is quite low even for a 64 entries LUT.

Further Optimizations

Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

Using Half-Precision Floats
Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

Minimal LUT Encoding
It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

Using Smooth Interpolation
Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

Conclusion

Here’s the source code I used for this post, if anyone’s interested in using it:

template<tU32 SinTableSize>; struct TrigLookup
{
    tFloat SinTable[SinTableSize];

    TrigLookup()
    {
        CHECK(IsPowerOfTwo(SinTableSize));

        for(tU32 i=0; i<SinTableSize; i++)
        {
            SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2);
        }
    }

    FORCEINLINE tFloat Lookup(tU16 Value)
    {
        const tU32 Divisor = (65536 / SinTableSize);
        const tU32 Index = Value / Divisor;
        const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)];
        const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)];
        const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor;
        return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA;
    }

    FORCEINLINE tFloat Sin(tU16 Value)
    {
        return Lookup(Value);
    }

    FORCEINLINE tFloat Cos(tU16 Value)
    {
        return Lookup(Value + 16384);
    }

    FORCEINLINE tFloat Tan(tU16 Value)
    {
        return Lookup(Value) / Lookup(Value + 16384);
    }
};

And here’s the assembly dump of the Lookup function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53  movzx       ecx,si  
0126FA56  mov         eax,ecx  
0126FA58  shr         eax,8  
0126FA5B  mov         edx,eax  
0126FA5D  inc         eax  
0126FA5E  and         edx,0FFh  
0126FA64  and         eax,0FFh  
0126FA69  fld         dword ptr [esp+edx*4+40h]  
0126FA6D  fld         dword ptr [esp+eax*4+40h]  
0126FA71  fsub        st,st(1)  
0126FA73  movzx       eax,cl  
0126FA76  mov         dword ptr [esp+3Ch],eax  
0126FA7A  fild        dword ptr [esp+3Ch]  
0126FA7E  fmul        dword ptr [__real@3b800000 (12D4FC8h)]  
0126FA84  fmulp       st(1),st  
0126FA86  faddp       st(1),st  
Categories: Mathematics, Programming Tags: