## Trigonometric Look-Up Tables Revisited

## Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially *sin* and *cos* to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions *fsin* and *fcos* takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

## Solution

The solution that came to my mind was to use a look-up table (LUT) for *sin* and *cos*. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

**Representing Angles**

To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical *and* with 65535 to remap the values into this range.

**LUT Size**

At first, I started with a 16384 entries LUT because I wanted to minimize errors. The *sin* function looked like this:

FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] return SinTable[Index & 16383]; // Lookup angle }

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

**LUT Steps**

I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):

(This graph was generated with the excellent Inigo Quilez’s Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] const tFloat A = SinTable[Index & 16383]; // Lookup angle const tFloat B = SinTable[(Index + 1) & 16383]; // Lookup angle+1 const Weight = (Value & 3) / 4.0f; // Compute weight return Weight * (B - A) + A; // Linear interpolation }

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

**LUT Size Optimization**

With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:

As we can see, the error is quite low even for a 64 entries LUT.

## Further Optimizations

Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

**Using Half-Precision Floats**

Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

**Minimal LUT Encoding**

It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

**Using Smooth Interpolation**

Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

## Conclusion

Here’s the source code I used for this post, if anyone’s interested in using it:

template<tU32 SinTableSize>; struct TrigLookup { tFloat SinTable[SinTableSize]; TrigLookup() { CHECK(IsPowerOfTwo(SinTableSize)); for(tU32 i=0; i<SinTableSize; i++) { SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2); } } FORCEINLINE tFloat Lookup(tU16 Value) { const tU32 Divisor = (65536 / SinTableSize); const tU32 Index = Value / Divisor; const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)]; const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)]; const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor; return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA; } FORCEINLINE tFloat Sin(tU16 Value) { return Lookup(Value); } FORCEINLINE tFloat Cos(tU16 Value) { return Lookup(Value + 16384); } FORCEINLINE tFloat Tan(tU16 Value) { return Lookup(Value) / Lookup(Value + 16384); } };

And here’s the assembly dump of the *Lookup* function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53 movzx ecx,si 0126FA56 mov eax,ecx 0126FA58 shr eax,8 0126FA5B mov edx,eax 0126FA5D inc eax 0126FA5E and edx,0FFh 0126FA64 and eax,0FFh 0126FA69 fld dword ptr [esp+edx*4+40h] 0126FA6D fld dword ptr [esp+eax*4+40h] 0126FA71 fsub st,st(1) 0126FA73 movzx eax,cl 0126FA76 mov dword ptr [esp+3Ch],eax 0126FA7A fild dword ptr [esp+3Ch] 0126FA7E fmul dword ptr [__real@3b800000 (12D4FC8h)] 0126FA84 fmulp st(1),st 0126FA86 faddp st(1),st