Home > Mathematics, Programming > Trigonometric Look-Up Tables Revisited

Trigonometric Look-Up Tables Revisited

Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially sin and cos to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

Solution

The solution that came to my mind was to use a look-up table (LUT) for sin and cos. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

Representing Angles
To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical and with 65535 to remap the values into this range.

LUT Size
At first, I started with a 16384 entries LUT because I wanted to minimize errors. The sin function looked like this:

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);  // Map the angle to [0..16383]
    return SinTable[Index & 16383];   // Lookup angle
}

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

LUT Steps
I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a "stepcase" effect (exaggerated to better illustrate the problem):


(This graph was generated with the excellent Inigo Quilez's Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);                  // Map the angle to [0..16383]
    const tFloat A = SinTable[Index & 16383];         // Lookup angle
    const tFloat B = SinTable[(Index + 1) & 16383];   // Lookup angle+1
    const Weight = (Value & 3) / 4.0f;                // Compute weight
    return Weight * (B - A) + A;                      // Linear interpolation
}

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

LUT Size Optimization
With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here's the results:

As we can see, the error is quite low even for a 64 entries LUT.

Further Optimizations

Here's some other optimizations that I didn't investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

Using Half-Precision Floats
Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

Minimal LUT Encoding
It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

Using Smooth Interpolation
Another solution that I didn't investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

Conclusion

Here's the source code I used for this post, if anyone's interested in using it:

template<tU32 SinTableSize>; struct TrigLookup
{
    tFloat SinTable[SinTableSize];

    TrigLookup()
    {
        CHECK(IsPowerOfTwo(SinTableSize));

        for(tU32 i=0; i<SinTableSize; i++)
        {
            SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2);
        }
    }

    FORCEINLINE tFloat Lookup(tU16 Value)
    {
        const tU32 Divisor = (65536 / SinTableSize);
        const tU32 Index = Value / Divisor;
        const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)];
        const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)];
        const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor;
        return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA;
    }

    FORCEINLINE tFloat Sin(tU16 Value)
    {
        return Lookup(Value);
    }

    FORCEINLINE tFloat Cos(tU16 Value)
    {
        return Lookup(Value + 16384);
    }

    FORCEINLINE tFloat Tan(tU16 Value)
    {
        return Lookup(Value) / Lookup(Value + 16384);
    }
};

And here's the assembly dump of the Lookup function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53  movzx       ecx,si  
0126FA56  mov         eax,ecx  
0126FA58  shr         eax,8  
0126FA5B  mov         edx,eax  
0126FA5D  inc         eax  
0126FA5E  and         edx,0FFh  
0126FA64  and         eax,0FFh  
0126FA69  fld         dword ptr [esp+edx*4+40h]  
0126FA6D  fld         dword ptr [esp+eax*4+40h]  
0126FA71  fsub        st,st(1)  
0126FA73  movzx       eax,cl  
0126FA76  mov         dword ptr [esp+3Ch],eax  
0126FA7A  fild        dword ptr [esp+3Ch]  
0126FA7E  fmul        dword ptr [__real@3b800000 (12D4FC8h)]  
0126FA84  fmulp       st(1),st  
0126FA86  faddp       st(1),st  
About these ads
Categories: Mathematics, Programming Tags:
  1. Steve
    December 6, 2011 at 11:43 am | #1

    My first thought is that lookup tables are often avoided for a reason – pulling the page of memory into the cache can be slow.

    On modern machines, a simple calculation is often faster than a lookup table, but a small test program can hide the problem because not enough is going on for the LUT to ever be discarded from the cache, so it never needs to be pulled back in again. In a larger program, the LUT will be repeatedly discarded from the cache and then pulled back in.

    The reason I know this – because I got shouted at a while ago when I suggested using a lookup table for another kind of calculation. Though that (sodoku related) calculation was all-integer, which has very different performance implications, of course.

    Still, the famous “fast inverse square root” – see http://en.wikipedia.org/wiki/Fast_inverse_square_root – is a bizarre calculation, but still a calculation – no lookup table anywhere. This is actually old – probably developed in the early 90s – but there was still a memory hierarchy back then, at least on some machines, and (although it’s just my personal guess) that’s maybe why this was preferred to a LUT.

    Based on that, you may get better performance with a polynomial (Taylor series or similar) approximation of sine. I haven’t tested it, though, and it obviously depends a lot on how many steps of the series you need to get enough accuracy – and how you deal with the modulo 2pi thing, limiting the angle to a reasonable range.

    Another option – avoid the need for sine and cosine completely. Store orientations as vectors. Apply rotations using matrix multiplications, with a few precalculated matrices. Again, this is just an idea – it seems likely to only work out for digital all-or-nothing controls.

    • Adam Hyland
      December 6, 2011 at 2:37 pm | #2

      Much of the motivation for the fast inverse square root trick was architecture specific. Not only were lookup tables generally less efficient (either slower or took up more memory doing nothing for more cycles) but floating point operations were slower than integer operations on 90s x86 machines *and* the routine needed to split out a floating point number. All of those specific conditions made the hack fortuitous. If any one of those conditions is violated, a more direct function or a LUT may be more efficient.

      “Another option – avoid the need for sine and cosine completely. Store orientations as vectors. Apply rotations using matrix multiplications, with a few precalculated matrices. Again, this is just an idea – it seems likely to only work out for digital all-or-nothing controls.”

      I agree with this.

    • Alex
      December 6, 2011 at 9:15 pm | #3

      What were you doing with Sudoku and a lookup table that would push things out of cache?

      • Steve
        December 6, 2011 at 10:26 pm | #4

        Not much that I recall – something to do with figuring out which cells related to which 3×3 blocks maybe. I’m not sure cache was the issue thinking about it – it was an all-integer calculation, and table lookup involves an integer calculation anyway, so that may have been it. If I find the old Stack Overflow question in was on, I’ll add a link.

        • Steve
          December 6, 2011 at 10:43 pm | #5

          I just went through my whole list of past Stack Overflow answers and can’t find it. Maybe I deleted it – it was rejected after all. Probably no great loss, anyway.

  2. gwenhwyfaer
    December 6, 2011 at 12:12 pm | #6

    You may find this article useful:

    http://www.coranac.com/2009/07/sines/

    Highlights: A 4th-order cosine approximation – a phase shifted sine – accurate to 12 bits can be computed in about a dozen (ARM9) cycles, without going anywhere near memory. A comparison is made with LUT-based methods – even in ARM9, where the CPU’s speed isn’t that different from RAM’s, the approximation is 3 times the speed. (It also turns out that the Taylor series is far from the best approach when it comes to approximating a sine wave – and that approximating a cosine is also a promising approach.)

    • Steven Pigeon
      December 27, 2011 at 10:46 pm | #7

      A good source of (not too complex) approximations can be found in older books such as Abramowitz and Stegun’s Handbook of Mathematical Functions (which you can find a part of here http://people.math.sfu.ca/~cbm/aands/ ) which was written for a time when “a computer” did not necessarily mean an electronic device. There is a whole section dedicated to approximations of all kinds of functions, including the dreaded gaussian (which would make the sine look simple).

  3. John J
    December 6, 2011 at 1:06 pm | #8

    If you have an optimization to shrink the LUT by a factor of 4, why bother with interpolation? Go back to the faster lookup, using a more precise LUT, but only one quadrant.

    • jfdube
      December 7, 2011 at 2:26 pm | #9

      Those optimizations also come with a cost; using halfs will probably double the execution time, and using only a quadrant will add branches. Since using a small LUT with interpolation fixes the problem I had, I simply chose to work on the game instead!

      • T.
        November 20, 2012 at 6:31 pm | #10

        On arm, you got conditional ops the compiler is aware of.
        On x86, you got cmovcc.
        Elsewhere, you can emulate conditional moves with carefully crafted cmp/shift/mask in C.
        Wether this will prove better on an OoO pipe remains to be determined.

        Other than that, switching to bit-oriented angles but keeping the result float seems a bit half-done to me :).
        I don’t think sin result needs float dyna range, especially in your case (i suspect coordinates of rotated quads). Scaled integer result would be suited, can only take less mem by construction and be faster afterwards, but you’d need to change the rest like you did for the angle.
        For 2D I cant think of a good reason to use floats except for physics. Maybe.
        If you’re feeding the coords to a gpu, the hardware can rotate by itself.

        • T.
          November 20, 2012 at 7:01 pm | #11

          And yes, I know I’m a bit late to the party :)

  4. December 6, 2011 at 1:14 pm | #12

    Interesting findings JF. I wonder how this approach would stack up? http://lab.polygonal.de/2007/07/18/fast-and-accurate-sinecosine-approximation/

    Looking forward to the 2D game ;)

    • jfdube
      December 7, 2011 at 2:21 pm | #13

      I just tried it, and it’s ~0.05ms, and looking at the code I think it could be made branchless quite easily. The maximum error is 0.0010904001917918036, which compares to a 64-entries LUT with linear interpolation. Not bad, I like the idea!

  5. krish
    December 7, 2011 at 7:42 am | #14

    Am new to this. So this is probably a noob question: In that interpolation, shouldn’t it be a Value*3/4 ? How is it a (Value & 3)/4.0 . Thanks.

    • jfdube
      December 7, 2011 at 2:23 pm | #15

      That’s a good question! ANDing the angle with 3 gives 4 possibilities for every value: 0, 1, 2 and 3. Then, dividing the result by 4 gives 0.0, 0.25, 0.50 and 0.75, which is used to linearly interpolate between sample points, which are 4 values appart (since that particular example was using a 16384 entries LUT and I use 65536 angles, 65536/16384=4). Is it clearer in your mind?

      • krish
        December 8, 2011 at 9:41 pm | #16

        Thanks. Yes it is.

  6. jfdube
    December 7, 2011 at 2:55 pm | #17

    Adam Hyland :

    “Another option – avoid the need for sine and cosine completely. Store orientations as vectors. Apply rotations using matrix multiplications, with a few precalculated matrices. Again, this is just an idea – it seems likely to only work out for digital all-or-nothing controls.”

    I agree with this.

    Using angles is so convenient in my case, mainly for network replication: it’s far easier to replicate angles than vectors, and it keeps bandwidth low, especially when replicating thousands of entities.

  7. rabidcow
    December 8, 2011 at 2:01 am | #18

    You can get away with a table that has about half of the precision that you want to end up with: Split the angle in half by bits (big = angle >> half, small = angle & (1<<half – 1)) and use a lookup table for the top half. The bottom half is tiny, so you can approximate sin(x)=x and cos(x)=1. Then use the sum formulas: sin(x+y) = sin(x)*cos(y) + cos(x)*sin(y) or sin(big+small) ~= sin(big) + cos(big)*small. If you always need sin and cos of the same angles, this doesn't even add any extra table lookups.

  8. Bubbaccia
    December 10, 2011 at 4:46 am | #19

    Why are you still using the old and slow FPU unit and not the SSE instruction set?
    Take a look at http://gruntthepeon.free.fr/ssemath/

    • Steve
      December 11, 2011 at 6:10 am | #20

      Possibly because most peoples phones don’t have an x86 compatible processor? The most common platforms for simple games these days are not PCs, so it’s probably not a good idea to be too dependent on non-portable techniques.

    • jfdube
      December 11, 2011 at 3:45 pm | #21

      I turned off sse2 code generation on purpose to make the generated code easier to understand for people without deep knowledge of assembler. The main purpose of the assembly dump was to show how faster it could be compared to the ~100 cycles fsin instruction, without pretending to be the fastest existing method.

  9. GDemirev
    December 12, 2011 at 3:18 pm | #22

    The first thing that comes to my mind, when talking about coordination transformations is the Cordic method: http://en.wikipedia.org/wiki/Cordic.

    That is the way to go if you want speed and small footprint.

  10. Gearoid
    April 1, 2012 at 11:38 am | #23

    Thanks for sharing this, it’s a nice piece of work.

  11. December 8, 2013 at 11:09 pm | #24

    Absolutely amazing, the first complete implementation of a sin/cos LUT I found on the web that actually works AND works for wrapping AND for negative numbers. You are a genius and just increased the framerate in my current project from 15 to 30 FPS. WONDERFUL!

    • jfdube
      January 14, 2014 at 7:17 am | #25

      Glad it helped, Alexander!

  12. January 13, 2014 at 7:19 pm | #26

    There is an Intel approximate math library that you could use. Still I am interested in what LUT’s could do in terms of high speed AI.

  1. December 6, 2011 at 3:07 pm | #1
  2. February 28, 2012 at 10:48 am | #2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: