After reading a post on fast integer square roots on Robert Basler’s blog I got curious about performance of those algorithms on modern hardware. The most famous is of course the “Quake 3 invsqrt”. I’ve also tested two algorithms mentioned on Stack Overflow.
My microbenchmark approach was basically to compute sum of square roots from 1 to 1e9 into a UINT64 variable. loops is 1e9.
// math.h double sqrt
QueryPerformanceCounter(&pc1);
UINT64 uu = 0;
for (UINT32 i = 0; i < loops; i++)
{
uu += (UINT64)sqrt((double)i);
}
QueryPerformanceCounter(&pc2);
elapsed = (pc2.QuadPart-pc1.QuadPart)/(double)freq.QuadPart;
wprintf(L"Sqrt double: %.3fms (%u)\n", elapsed*1000.0, uu);
Results from my laptop (DualCore Intel Core i7 620M, 2666 MHz (20 x 133), times in milliseconds):
Long double sum of sqrts: 21081851051977.781000
32bit:
Sqrt float: 17741.012 (1652390483)
Sqrt double: 16131.129 (1651579237)
Fast uint sqrt: 20723.943 (1651579237)
Fast ulong sqrt: 93737.070 (1651579237)
Fast int sqrt: 27791.603 (2558293500)
Fast Q3 float sqrt: 20919.534 (3883084482)
Fast Q3 double sqrt: 21357.330 (3882261863)
64bit:
Sqrt float: 7135.535 (1652359905)
Sqrt double: 12613.416 (1651579237)
Fast uint sqrt: 23390.084 (1651579237)
Fast ulong sqrt: 32878.867 (1651579237)
Fast int sqrt: 24430.502 (2558293500)
Fast Q3 float sqrt: 5352.906 (3883079922)
Fast Q3 double sqrt: 15216.076 (3882257735)
Compiled as speed-optimized release using VC2010. Enabling SSE/SSE2 didn’t produce noticeable differences. Using fast floating-point model over precise or strict does improve built-in sqrt function performance (as well as Q3′s invsqrt) but I chose to stick to precise model since it’s the default. static_cast had the same performance as C-style casts. Numbers in parentheses are integer sqrt sums, as you can see they vary wildly due to different accuracy of the algorithms.
64bit executable is pretty much always faster, especially for floats. Casting to UINT64 is slower on 32bits of course, but using UINT32 doesn’t change much (tested that as well, minimal differences). Built-in sqrt is still fastest except for float version of Q3′s invsqrt, which is blazingly fast and still very accurate.
~ ~ ~
Also, a small quiz: why the times increase twofold on FP sqrts if I remove the (UINT64) cast in the inner loop (hint: look at the sum)?
(32bit, but it's similar for 64bit)
uu += sqrt((float)i);
Sqrt float: 41906.894 (1652612419)
~ ~ ~
To check how much casting the results to integer affects performance I also did test without it: sqrt functions returning double were summed to a double variable (same with floats). Results below:
Long double sum of sqrts: 21081851051977.781000
32bit:
Sqrt float: 12666.459 (549755813888.000000)
Sqrt double: 12647.830 (21081851051977.781000)
Fast uint sqrt: 23757.851 (1651579237)
Fast ulong sqrt: 92842.952 (1651579237)
Fast int sqrt: 28243.427 (2558293500)
Fast Q3 float sqrt: 11077.491 (549755813888.000000)
Fast Q3 double sqrt: 9390.088 (21062606916280.066000)
64bit:
Sqrt float: 7151.176 (549755813888.000000)
Sqrt double: 12596.555 (21081851051977.781000)
Fast uint sqrt: 20426.025 (1651579237)
Fast ulong sqrt: 35248.122 (1651579237)
Fast int sqrt: 24476.170 (2558293500)
Fast Q3 float sqrt: 3429.053 (549755813888.000000)
Fast Q3 double sqrt: 9192.112 (21062606916106.129000)
Integer sqrts are unaffected since they didn’t have casts anyway. Built-in sqrt is much faster now, but it’s outperformed by the invsqrt for doubles (and floats on x64)! Something to consider if you don’t need very accurate numbers (it’s still pretty damn accurate).

C++ sqrt performance comparison
32-native: 32bit executable with sqrts summed to native types (without casts)
32-uint64: 32bit executable with sqrts summed to UINT64
~ ~ ~
I wouldn’t be myself if I didn’t test the same algorithms in C#. Both “fast uint” and “fast ulong” algorithms translate without changes. Q3′s invsqrt just needs unsafe context to use pointers for bit conversions (using BitConverter is slooow). Results? Might surprise you
(using .NET 4.0)
32bit summing to UINT64:
Sqrt double: 42939 (21081351068005)
Fast uint sqrt: 26769 (21081851045129)
Fast ulong sqrt: 120578 (21081851045129)
Fast Q3 float sqrt: 58303 (21062106914641)
Fast Q3 double sqrt: 60314 (21061877534026)
64bit summing to UINT64:
Sqrt double: 12597 (21081351068005)
Fast uint sqrt: 32374 (21081851045129)
Fast ulong sqrt: 24520 (21081851045129)
Fast Q3 float sqrt: 17643 (21062107732210)
Fast Q3 double sqrt: 6810 (21061877534026)
32bit summing to native types:
Sqrt double: 25200 (21081851051977.8)
Fast uint sqrt: 26084 (2151556361)
Fast ulong sqrt: 120248 (21081851045129)
Fast Q3 float sqrt: 38560 (5.497558E+11)
Fast Q3 double sqrt: 42891 (21062377537121.3)
64bit summing to native types:
Sqrt double: 13952 (21081851051977.8)
Fast uint sqrt: 28848 (2151556361)
Fast ulong sqrt: 24573 (21081851045129)
Fast Q3 float sqrt: 17068 (5.497558E+11)
Fast Q3 double sqrt: 5147 (21062377537121.3)

C# sqrt performance comparison
~ ~ ~
And as an appendix, source of all the algorithms used:
uint32_t FastSqrtUint(uint32_t a_nInput) // c++
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
return res;
}
uint64_t FastSqrtUlong(uint64_t a_nInput)
{
uint64_t op = a_nInput;
uint64_t res = 0;
uint64_t one = 4611686018427387904ul; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
return res;
}
#define SQRT_MAGIC_F 0x5f3759df
#define SQRT_MAGIC_D 0x5fe6ec85e7de30da
inline float invSqrtF(const float x)
{
const float xhalf = 0.5f*x;
union // get bits for floating value
{
float x;
int i;
} u;
u.x = x;
u.i = SQRT_MAGIC_F - (u.i >> 1); // gives initial guess y0
return u.x*(1.5f - xhalf*u.x*u.x);// Newton step, repeating increases accuracy
}
inline float fastSqrt_Q3(const float x)
{
return x * invSqrtF(x);
}
inline double invSqrtD(const double x)
{
const double xhalf = 0.5F*x;
union // get bits for floating value
{
double x;
long i;
} u;
u.x = x;
u.i = SQRT_MAGIC_D - (u.i >> 1); // gives initial guess y0
return u.x*(1.5F - xhalf*u.x*u.x);// Newton step, repeating increases accuracy
}
inline double fastSqrt_Q3D(const double x)
{
return x * invSqrt(x);
}
int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};
int fisqrt(int val)
{
int cnt=0;
int t=val;
while (t)
{
cnt++;
t>>=1;
}
if (6>=cnt)
t=(val<<(6-cnt));
else
t=(val>>(cnt-6));
return (ftbl[cnt]*ftbl2[t&31])>>15;
}
static uint FastSqrtUint(uint a_nInput) // C#
{
unchecked
{
uint op = a_nInput;
uint res = 0;
uint one = 1 << 30;
// The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
/* Do arithmetic rounding to nearest integer */
if (op > res)
{
res++;
}
return res;
}
}
static ulong FastSqrtUlong(ulong a_nInput)
{
unchecked
{
ulong op = a_nInput;
ulong res = 0;
ulong one = 1 << 62;
// The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
/* Do arithmetic rounding to nearest integer */
if (op > res)
{
res++;
}
return res;
}
}
unsafe static float InvSqrtF(float x)
{
float xhalf = 0.5f * x;
int i = *(int*)&x;
i = 0x5f3759df - (i >> 1);
x = *(float*)&i;
x = x * (1.5f - xhalf * x * x);
return x;
}
public static float SqrtF(float x)
{
return x * InvSqrtF(x);
}
unsafe static double InvSqrtD(double x)
{
double xhalf = 0.5 * x;
long i = *(long*)&x;
i = 0x5fe6ec85e7de30da - (i >> 1);
x = *(double*)&i;
x = x * (1.5 - xhalf * x * x);
return x;
}
public static double SqrtD(double x)
{
return x * InvSqrtD(x);
}
Posted by omeg in C#, C++, code, performance