Trigonometric Look-Up Tables Revisited
Introduction
In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.
For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially sin and cos to update the transformation matrix of the relevant entities (approximately 50K entities).
According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.
Solution
The solution that came to my mind was to use a look-up table (LUT) for sin and cos. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.
Representing Angles
To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical and with 65535 to remap the values into this range.
LUT Size
At first, I started with a 16384 entries LUT because I wanted to minimize errors. The sin function looked like this:
FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] return SinTable[Index & 16383]; // Lookup angle }
The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.
LUT Steps
I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):
(This graph was generated with the excellent Inigo Quilez’s Graph Toy).
The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).
Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):
FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] const tFloat A = SinTable[Index & 16383]; // Lookup angle const tFloat B = SinTable[(Index + 1) & 16383]; // Lookup angle+1 const Weight = (Value & 3) / 4.0f; // Compute weight return Weight * (B - A) + A; // Linear interpolation }
With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!
LUT Size Optimization
With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:
As we can see, the error is quite low even for a 64 entries LUT.
Further Optimizations
Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.
Using Half-Precision Floats
Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.
Minimal LUT Encoding
It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.
Using Smooth Interpolation
Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.
Conclusion
Here’s the source code I used for this post, if anyone’s interested in using it:
template<tU32 SinTableSize>; struct TrigLookup { tFloat SinTable[SinTableSize]; TrigLookup() { CHECK(IsPowerOfTwo(SinTableSize)); for(tU32 i=0; i<SinTableSize; i++) { SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2); } } FORCEINLINE tFloat Lookup(tU16 Value) { const tU32 Divisor = (65536 / SinTableSize); const tU32 Index = Value / Divisor; const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)]; const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)]; const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor; return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA; } FORCEINLINE tFloat Sin(tU16 Value) { return Lookup(Value); } FORCEINLINE tFloat Cos(tU16 Value) { return Lookup(Value + 16384); } FORCEINLINE tFloat Tan(tU16 Value) { return Lookup(Value) / Lookup(Value + 16384); } };
And here’s the assembly dump of the Lookup function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:
0126FA53 movzx ecx,si 0126FA56 mov eax,ecx 0126FA58 shr eax,8 0126FA5B mov edx,eax 0126FA5D inc eax 0126FA5E and edx,0FFh 0126FA64 and eax,0FFh 0126FA69 fld dword ptr [esp+edx*4+40h] 0126FA6D fld dword ptr [esp+eax*4+40h] 0126FA71 fsub st,st(1) 0126FA73 movzx eax,cl 0126FA76 mov dword ptr [esp+3Ch],eax 0126FA7A fild dword ptr [esp+3Ch] 0126FA7E fmul dword ptr [__real@3b800000 (12D4FC8h)] 0126FA84 fmulp st(1),st 0126FA86 faddp st(1),st
Using the Moving Average Filter for Profiling Stats
Introduction
When displaying profiling stats or FPS counters or whatever value that varies a lot frame to frame, you don’t want to show the real values since they can vary too much and make the stats unreadable.
Displaying an average over a certain period of time is a good idea. But how to compute that average?
Simple Moving Average
A common solution is to keep the n last values, and then compute the average of these n values to display. This is called a simple moving average and is computed like this:
tFloat MeanValue = 0.0f; for(tU32 i=0; i<n; i++) MeanValue += RecentValues[i]; MeanValue /= n;
It works fine in some scenarios, like displaying a FPS counter. On the other hand, it’s not very appropriate when a lot of values needs to be computed, like for profiling stats, as the amount of data can become too large and the time spent updating the values can become too high.
Exponential Moving Average
Another solution that I’ve been using for years now is the exponential moving average. Essentially, it’s moving the current mean value toward the actual value every iteration, by a certain percentage. It’s computed like this:
MeanValue = MeanValue * 0.99f + CurrentValue * 0.01f;
The advantage of this solution is that you don’t need to keep the n last values, as opposed to the simple moving average.
Comparison
Here’s a graph comparing the two solutions:
As you can see at the beginning of the graph, the EMA takes some time to react to big changes. This is easily fixed by initializing the mean value to the first value, but wasn’t done here to show how it reacts to big changes in the source data.
Conslusion
Compared to the SMA, the EMA is closer to the real averaged value, has less variations and takes less memory and CPU, which makes it a good function to compute and display averaged values in a profiling system.
Generating primes
Here’s my implementation of the sieve of Eratosthenes in pyhton:
cur_num = 2 multiples = {2 : 2} while True: is_prime = True for v in multiples: nv = multiples[v] - 1 if nv == 0: is_prime = False multiples[v] = v else: multiples[v] = nv if is_prime == True: print cur_num multiples[cur_num] = cur_num cur_num = cur_num + 1
The difference with the pseudocode found on the wikipedia page is that it doesn’t try to reject numbers on a fixed range, it just compute every occurrence of multiples of known primes at every step. I based my implementation on the fact that every non-prime number is composed of n primes multiplied together where n>1, in a unique way. My previous post on prime numbers used that fact to implement a quick RTTI IsChildOf test.
How does it works? For every prime number encountered, a counter equal to that number is created (in the multiples map). Then, at each iteration, we decrement every counters. When a counter reaches zero, it means that the current number can be divided by this prime number, thus it is not prime. When all counters are different from zero, the current number must be a prime, since there’s no known prime divisor to compose it.
A nice property of this implementation is that when we encounter a non-prime number, all counters that are currently at zero represents all the primes that are used to compose the current non-prime number. Here’s the updated code:
cur_num = 2 multiples = {2 : 2} while True: is_prime = True for v in multiples: nv = multiples[v] - 1 if nv == 0: is_prime = False multiples[v] = v else: multiples[v] = nv if is_prime == True: print cur_num, "[prime]" multiples[cur_num] = cur_num else: primes = list() for v in multiples: if multiples[v] == v: primes.append(v) print cur_num, "[nonprime] ", primes cur_num = cur_num + 1
Which outputs this:
... 83 [prime] 84 [nonprime] [2, 3, 7] 85 [nonprime] [5, 17] 86 [nonprime] [2, 43] 87 [nonprime] [3, 29] 88 [nonprime] [2, 11] 89 [prime] 90 [nonprime] [2, 3, 5] 91 [nonprime] [7, 13] 92 [nonprime] [2, 23] 93 [nonprime] [3, 31] 94 [nonprime] [2, 47] 95 [nonprime] [5, 19] 96 [nonprime] [2, 3] 97 [prime] ...
This is not the best implementation for this algorithm, as it was mainly done to test the idea I had in mind and to learn python. The main problem is as we search for higher primes, speed decrease quite fast due to the number of prime counters to lookup. There are several things we could do, like skip even numbers and updates counters accordingly or write a C++ implementation.
Prime numbers
Prime numbers are a very interesting topic. I often find myself quite uncomfortable when setting my car’s radio volume on an even number, and I really feel better when it’s a prime number. But this is a complete different story.
So, what are the uses of prime numbers in programming? Cryptography, random numbers generation, better hashing functions, etc. One thing that I recently discovered about prime numbers is that if you multiply n different primes together, it creates a special composite number (if anyone knows the real name of those, please tell me) that has a quite interesting property: there’s a way to know if a certain prime p was used or not to compose it simply by dividing by p. If the result is an integer, then p was used to compose the number. Take this example:
- 3 * 7 * 13 * 53 * 227 = 3284463
To know if prime number 19 was used to compose it, we do:
- 3284463 / 19 = 172866.474
Since the result is not an integer, 19 wasn’t used to compose the number 3284463. But, if we take a look at:
- 3284463 / 3 = 1094821
3284463 / 7 = 469209
3284463 / 13 = 252651
3284463 / 53 = 61971
3284463 / 227 = 14469
A practical usage of this property is for a RTTI system. By assigning a different prime number to each class in the hierarchy, and by multiplying each child’s class value by it’s parent value, we can determine if a class A derives from class B simply by dividing its composite number by B‘s number, instead of going through all the parents chain:
Bool Class::IsChildOf(const Class* B) const { return ((Prime % B->Prime) == 0); }