Archive

Archive for the ‘Mathematics’ Category

Trigonometric Look-Up Tables Revisited

December 6, 2011 28 comments

Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially sin and cos to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

Solution

The solution that came to my mind was to use a look-up table (LUT) for sin and cos. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

Representing Angles
To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical and with 65535 to remap the values into this range.

LUT Size
At first, I started with a 16384 entries LUT because I wanted to minimize errors. The sin function looked like this:

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);  // Map the angle to [0..16383]
    return SinTable[Index & 16383];   // Lookup angle
}

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

LUT Steps
I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):


(This graph was generated with the excellent Inigo Quilez’s Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);                  // Map the angle to [0..16383]
    const tFloat A = SinTable[Index & 16383];         // Lookup angle
    const tFloat B = SinTable[(Index + 1) & 16383];   // Lookup angle+1
    const Weight = (Value & 3) / 4.0f;                // Compute weight
    return Weight * (B - A) + A;                      // Linear interpolation
}

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

LUT Size Optimization
With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:

As we can see, the error is quite low even for a 64 entries LUT.

Further Optimizations

Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

Using Half-Precision Floats
Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

Minimal LUT Encoding
It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

Using Smooth Interpolation
Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

Conclusion

Here’s the source code I used for this post, if anyone’s interested in using it:

template<tU32 SinTableSize>; struct TrigLookup
{
    tFloat SinTable[SinTableSize];

    TrigLookup()
    {
        CHECK(IsPowerOfTwo(SinTableSize));

        for(tU32 i=0; i<SinTableSize; i++)
        {
            SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2);
        }
    }

    FORCEINLINE tFloat Lookup(tU16 Value)
    {
        const tU32 Divisor = (65536 / SinTableSize);
        const tU32 Index = Value / Divisor;
        const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)];
        const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)];
        const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor;
        return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA;
    }

    FORCEINLINE tFloat Sin(tU16 Value)
    {
        return Lookup(Value);
    }

    FORCEINLINE tFloat Cos(tU16 Value)
    {
        return Lookup(Value + 16384);
    }

    FORCEINLINE tFloat Tan(tU16 Value)
    {
        return Lookup(Value) / Lookup(Value + 16384);
    }
};

And here’s the assembly dump of the Lookup function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53  movzx       ecx,si  
0126FA56  mov         eax,ecx  
0126FA58  shr         eax,8  
0126FA5B  mov         edx,eax  
0126FA5D  inc         eax  
0126FA5E  and         edx,0FFh  
0126FA64  and         eax,0FFh  
0126FA69  fld         dword ptr [esp+edx*4+40h]  
0126FA6D  fld         dword ptr [esp+eax*4+40h]  
0126FA71  fsub        st,st(1)  
0126FA73  movzx       eax,cl  
0126FA76  mov         dword ptr [esp+3Ch],eax  
0126FA7A  fild        dword ptr [esp+3Ch]  
0126FA7E  fmul        dword ptr [__real@3b800000 (12D4FC8h)]  
0126FA84  fmulp       st(1),st  
0126FA86  faddp       st(1),st  
Advertisements
Categories: Mathematics, Programming Tags:

Using the Moving Average Filter for Profiling Stats

October 19, 2011 Leave a comment

Introduction
When displaying profiling stats or FPS counters or whatever value that varies a lot frame to frame, you don’t want to show the real values since they can vary too much and make the stats unreadable.

Displaying an average over a certain period of time is a good idea. But how to compute that average?

Simple Moving Average
A common solution is to keep the n last values, and then compute the average of these n values to display. This is called a simple moving average and is computed like this:

tFloat MeanValue = 0.0f;
for(tU32 i=0; i<n; i++)
    MeanValue += RecentValues[i];
MeanValue /= n;

It works fine in some scenarios, like displaying a FPS counter. On the other hand, it’s not very appropriate when a lot of values needs to be computed, like for profiling stats, as the amount of data can become too large and the time spent updating the values can become too high.

Exponential Moving Average
Another solution that I’ve been using for years now is the exponential moving average. Essentially, it’s moving the current mean value toward the actual value every iteration, by a certain percentage. It’s computed like this:

MeanValue = MeanValue * 0.99f + CurrentValue * 0.01f;

The advantage of this solution is that you don’t need to keep the n last values, as opposed to the simple moving average.

Comparison
Here’s a graph comparing the two solutions:

20111019-133655.jpg

As you can see at the beginning of the graph, the EMA takes some time to react to big changes. This is easily fixed by initializing the mean value to the first value, but wasn’t done here to show how it reacts to big changes in the source data.

Conslusion
Compared to the SMA, the EMA is closer to the real averaged value, has less variations and takes less memory and CPU, which makes it a good function to compute and display averaged values in a profiling system.

Categories: Mathematics, Programming Tags:

Generating primes

September 27, 2011 1 comment

Here’s my implementation of the sieve of Eratosthenes in pyhton:

cur_num = 2
multiples = {2 : 2}

while True:
    is_prime = True

    for v in multiples:
        nv = multiples[v] - 1
        if nv == 0:
            is_prime = False
            multiples[v] = v
        else:
            multiples[v] = nv

    if is_prime == True:
        print cur_num
        multiples[cur_num] = cur_num

    cur_num = cur_num + 1

The difference with the pseudocode found on the wikipedia page is that it doesn’t try to reject numbers on a fixed range, it just compute every occurrence of multiples of known primes at every step. I based my implementation on the fact that every non-prime number is composed of n primes multiplied together where n>1, in a unique way. My previous post on prime numbers used that fact to implement a quick RTTI IsChildOf test.

How does it works? For every prime number encountered, a counter equal to that number is created (in the multiples map). Then, at each iteration, we decrement every counters. When a counter reaches zero, it means that the current number can be divided by this prime number, thus it is not prime. When all counters are different from zero, the current number must be a prime, since there’s no known prime divisor to compose it.

A nice property of this implementation is that when we encounter a non-prime number, all counters that are currently at zero represents all the primes that are used to compose the current non-prime number. Here’s the updated code:

cur_num = 2
multiples = {2 : 2}

while True:
    is_prime = True

    for v in multiples:
        nv = multiples[v] - 1
        if nv == 0:
            is_prime = False
            multiples[v] = v
        else:
            multiples[v] = nv
            
    if is_prime == True:
        print cur_num, "[prime]"
        multiples[cur_num] = cur_num
    else:
        primes = list()
        for v in multiples:
            if multiples[v] == v:
                primes.append(v)
        print cur_num, "[nonprime] ", primes

    cur_num = cur_num + 1

Which outputs this:

...
83 [prime]
84 [nonprime]  [2, 3, 7]
85 [nonprime]  [5, 17]
86 [nonprime]  [2, 43]
87 [nonprime]  [3, 29]
88 [nonprime]  [2, 11]
89 [prime]
90 [nonprime]  [2, 3, 5]
91 [nonprime]  [7, 13]
92 [nonprime]  [2, 23]
93 [nonprime]  [3, 31]
94 [nonprime]  [2, 47]
95 [nonprime]  [5, 19]
96 [nonprime]  [2, 3]
97 [prime]
...

This is not the best implementation for this algorithm, as it was mainly done to test the idea I had in mind and to learn python. The main problem is as we search for higher primes, speed decrease quite fast due to the number of prime counters to lookup. There are several things we could do, like skip even numbers and updates counters accordingly or write a C++ implementation.

Prime numbers

September 18, 2011 6 comments

Prime numbers are a very interesting topic. I often find myself quite uncomfortable when setting my car’s radio volume on an even number, and I really feel better when it’s a prime number. But this is a complete different story.

So, what are the uses of prime numbers in programming? Cryptography, random numbers generation, better hashing functions, etc. One thing that I recently discovered about prime numbers is that if you multiply n different primes together, it creates a special composite number (if anyone knows the real name of those, please tell me) that has a quite interesting property: there’s a way to know if a certain prime p was used or not to compose it simply by dividing by p. If the result is an integer, then p was used to compose the number. Take this example:

    3 * 7 * 13 * 53 * 227 = 3284463

To know if prime number 19 was used to compose it, we do:

    3284463 / 19 = 172866.474

Since the result is not an integer, 19 wasn’t used to compose the number 3284463. But, if we take a look at:

    3284463 / 3 = 1094821
    3284463 / 7 = 469209
    3284463 / 13 = 252651
    3284463 / 53 = 61971
    3284463 / 227 = 14469

A practical usage of this property is for a RTTI system. By assigning a different prime number to each class in the hierarchy, and by multiplying each child’s class value by it’s parent value, we can determine if a class A derives from class B simply by dividing its composite number by B‘s number, instead of going through all the parents chain:

Bool Class::IsChildOf(const Class* B) const
{
    return ((Prime % B->Prime) == 0);
}
Categories: Prime numbers Tags: ,
%d bloggers like this: