## Trigonometric Look-Up Tables Revisited

## Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially *sin* and *cos* to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions *fsin* and *fcos* takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

## Solution

The solution that came to my mind was to use a look-up table (LUT) for *sin* and *cos*. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

**Representing Angles**

To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical *and* with 65535 to remap the values into this range.

**LUT Size**

At first, I started with a 16384 entries LUT because I wanted to minimize errors. The *sin* function looked like this:

FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] return SinTable[Index & 16383]; // Lookup angle }

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

**LUT Steps**

I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):

(This graph was generated with the excellent Inigo Quilez’s Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] const tFloat A = SinTable[Index & 16383]; // Lookup angle const tFloat B = SinTable[(Index + 1) & 16383]; // Lookup angle+1 const Weight = (Value & 3) / 4.0f; // Compute weight return Weight * (B - A) + A; // Linear interpolation }

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

**LUT Size Optimization**

With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:

As we can see, the error is quite low even for a 64 entries LUT.

## Further Optimizations

Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

**Using Half-Precision Floats**

Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

**Minimal LUT Encoding**

It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

**Using Smooth Interpolation**

Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

## Conclusion

Here’s the source code I used for this post, if anyone’s interested in using it:

template<tU32 SinTableSize>; struct TrigLookup { tFloat SinTable[SinTableSize]; TrigLookup() { CHECK(IsPowerOfTwo(SinTableSize)); for(tU32 i=0; i<SinTableSize; i++) { SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2); } } FORCEINLINE tFloat Lookup(tU16 Value) { const tU32 Divisor = (65536 / SinTableSize); const tU32 Index = Value / Divisor; const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)]; const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)]; const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor; return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA; } FORCEINLINE tFloat Sin(tU16 Value) { return Lookup(Value); } FORCEINLINE tFloat Cos(tU16 Value) { return Lookup(Value + 16384); } FORCEINLINE tFloat Tan(tU16 Value) { return Lookup(Value) / Lookup(Value + 16384); } };

And here’s the assembly dump of the *Lookup* function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53 movzx ecx,si 0126FA56 mov eax,ecx 0126FA58 shr eax,8 0126FA5B mov edx,eax 0126FA5D inc eax 0126FA5E and edx,0FFh 0126FA64 and eax,0FFh 0126FA69 fld dword ptr [esp+edx*4+40h] 0126FA6D fld dword ptr [esp+eax*4+40h] 0126FA71 fsub st,st(1) 0126FA73 movzx eax,cl 0126FA76 mov dword ptr [esp+3Ch],eax 0126FA7A fild dword ptr [esp+3Ch] 0126FA7E fmul dword ptr [__real@3b800000 (12D4FC8h)] 0126FA84 fmulp st(1),st 0126FA86 faddp st(1),st

My first thought is that lookup tables are often avoided for a reason – pulling the page of memory into the cache can be slow.

On modern machines, a simple calculation is often faster than a lookup table, but a small test program can hide the problem because not enough is going on for the LUT to ever be discarded from the cache, so it never needs to be pulled back in again. In a larger program, the LUT will be repeatedly discarded from the cache and then pulled back in.

The reason I know this – because I got shouted at a while ago when I suggested using a lookup table for another kind of calculation. Though that (sodoku related) calculation was all-integer, which has very different performance implications, of course.

Still, the famous “fast inverse square root” – see http://en.wikipedia.org/wiki/Fast_inverse_square_root – is a bizarre calculation, but still a calculation – no lookup table anywhere. This is actually old – probably developed in the early 90s – but there was still a memory hierarchy back then, at least on some machines, and (although it’s just my personal guess) that’s maybe why this was preferred to a LUT.

Based on that, you may get better performance with a polynomial (Taylor series or similar) approximation of sine. I haven’t tested it, though, and it obviously depends a lot on how many steps of the series you need to get enough accuracy – and how you deal with the modulo 2pi thing, limiting the angle to a reasonable range.

Another option – avoid the need for sine and cosine completely. Store orientations as vectors. Apply rotations using matrix multiplications, with a few precalculated matrices. Again, this is just an idea – it seems likely to only work out for digital all-or-nothing controls.

Much of the motivation for the fast inverse square root trick was architecture specific. Not only were lookup tables generally less efficient (either slower or took up more memory doing nothing for more cycles) but floating point operations were slower than integer operations on 90s x86 machines *and* the routine needed to split out a floating point number. All of those specific conditions made the hack fortuitous. If any one of those conditions is violated, a more direct function or a LUT may be more efficient.

“Another option – avoid the need for sine and cosine completely. Store orientations as vectors. Apply rotations using matrix multiplications, with a few precalculated matrices. Again, this is just an idea – it seems likely to only work out for digital all-or-nothing controls.”

I agree with this.

What were you doing with Sudoku and a lookup table that would push things out of cache?

Not much that I recall – something to do with figuring out which cells related to which 3×3 blocks maybe. I’m not sure cache was the issue thinking about it – it was an all-integer calculation, and table lookup involves an integer calculation anyway, so that may have been it. If I find the old Stack Overflow question in was on, I’ll add a link.

I just went through my whole list of past Stack Overflow answers and can’t find it. Maybe I deleted it – it was rejected after all. Probably no great loss, anyway.

You may find this article useful:

http://www.coranac.com/2009/07/sines/

Highlights: A 4th-order cosine approximation – a phase shifted sine – accurate to 12 bits can be computed in about a dozen (ARM9) cycles, without going anywhere near memory. A comparison is made with LUT-based methods – even in ARM9, where the CPU’s speed isn’t that different from RAM’s, the approximation is 3 times the speed. (It also turns out that the Taylor series is far from the best approach when it comes to approximating a sine wave – and that approximating a cosine is also a promising approach.)

A good source of (not too complex) approximations can be found in older books such as Abramowitz and Stegun’s Handbook of Mathematical Functions (which you can find a part of here http://people.math.sfu.ca/~cbm/aands/ ) which was written for a time when “a computer” did not necessarily mean an electronic device. There is a whole section dedicated to approximations of all kinds of functions, including the dreaded gaussian (which would make the sine look simple).

If you have an optimization to shrink the LUT by a factor of 4, why bother with interpolation? Go back to the faster lookup, using a more precise LUT, but only one quadrant.

Those optimizations also come with a cost; using halfs will probably double the execution time, and using only a quadrant will add branches. Since using a small LUT with interpolation fixes the problem I had, I simply chose to work on the game instead!

On arm, you got conditional ops the compiler is aware of.

On x86, you got cmovcc.

Elsewhere, you can emulate conditional moves with carefully crafted cmp/shift/mask in C.

Wether this will prove better on an OoO pipe remains to be determined.

Other than that, switching to bit-oriented angles but keeping the result float seems a bit half-done to me🙂.

I don’t think sin result needs float dyna range, especially in your case (i suspect coordinates of rotated quads). Scaled integer result would be suited, can only take less mem by construction and be faster afterwards, but you’d need to change the rest like you did for the angle.

For 2D I cant think of a good reason to use floats except for physics. Maybe.

If you’re feeding the coords to a gpu, the hardware can rotate by itself.

And yes, I know I’m a bit late to the party🙂

Interesting findings JF. I wonder how this approach would stack up? http://lab.polygonal.de/2007/07/18/fast-and-accurate-sinecosine-approximation/

Looking forward to the 2D game😉

I just tried it, and it’s ~0.05ms, and looking at the code I think it could be made branchless quite easily. The maximum error is 0.0010904001917918036, which compares to a 64-entries LUT with linear interpolation. Not bad, I like the idea!

Am new to this. So this is probably a noob question: In that interpolation, shouldn’t it be a Value*3/4 ? How is it a (Value & 3)/4.0 . Thanks.

That’s a good question! ANDing the angle with 3 gives 4 possibilities for every value: 0, 1, 2 and 3. Then, dividing the result by 4 gives 0.0, 0.25, 0.50 and 0.75, which is used to linearly interpolate between sample points, which are 4 values appart (since that particular example was using a 16384 entries LUT and I use 65536 angles, 65536/16384=4). Is it clearer in your mind?

Thanks. Yes it is.

Using angles is so convenient in my case, mainly for network replication: it’s far easier to replicate angles than vectors, and it keeps bandwidth low, especially when replicating thousands of entities.

You can get away with a table that has about half of the precision that you want to end up with: Split the angle in half by bits (big = angle >> half, small = angle & (1<<half – 1)) and use a lookup table for the top half. The bottom half is tiny, so you can approximate sin(x)=x and cos(x)=1. Then use the sum formulas: sin(x+y) = sin(x)*cos(y) + cos(x)*sin(y) or sin(big+small) ~= sin(big) + cos(big)*small. If you always need sin and cos of the same angles, this doesn't even add any extra table lookups.

Why are you still using the old and slow FPU unit and not the SSE instruction set?

Take a look at http://gruntthepeon.free.fr/ssemath/

Possibly because most peoples phones don’t have an x86 compatible processor? The most common platforms for simple games these days are not PCs, so it’s probably not a good idea to be too dependent on non-portable techniques.

I turned off sse2 code generation on purpose to make the generated code easier to understand for people without deep knowledge of assembler. The main purpose of the assembly dump was to show how faster it could be compared to the ~100 cycles fsin instruction, without pretending to be the fastest existing method.

The first thing that comes to my mind, when talking about coordination transformations is the Cordic method: http://en.wikipedia.org/wiki/Cordic.

That is the way to go if you want speed and small footprint.

Thanks for sharing this, it’s a nice piece of work.

Absolutely amazing, the first complete implementation of a sin/cos LUT I found on the web that actually works AND works for wrapping AND for negative numbers. You are a genius and just increased the framerate in my current project from 15 to 30 FPS. WONDERFUL!

Glad it helped, Alexander!

There is an Intel approximate math library that you could use. Still I am interested in what LUT’s could do in terms of high speed AI.