Why Does Changing 0.1F to 0 Slow Down Performance by 10X

Why does changing 0.1f to 0 slow down performance by 10x?

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

double start = omp_get_wtime();

const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
float y[16];
for(int i=0;i<16;i++)
{
y[i]=x[i];
}
for(int j=0;j<9000000;j++)
{
for(int i=0;i<16;i++)
{
y[i]*=x[i];
y[i]/=z[i];
#ifdef FLOATING
y[i]=y[i]+0.1f;
y[i]=y[i]-0.1f;
#else
y[i]=y[i]+0;
y[i]=y[i]-0;
#endif

if (j > 10000)
cout << y[i] << " ";
}
if (j > 10000)
cout << endl;
}

double end = omp_get_wtime();
cout << end - start << endl;

system("pause");
return 0;
}

Output:

#define FLOATING
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007

//#define FLOATING
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.46842e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.45208e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.


To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0 : 26.7669

// Flush denormals to zero.
0.1f: 0.587117
0 : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.

Different float values in array impact performance by 10x - why?

This is because you are hitting denormal numbers (also see this question).

You can get rid of denormals like so:

#include <cmath>

// [...]

for (int i = 0; i < 5; i++) {
damping[i] = damping[i] * modeDampingTermsExp2[i];
if (std::fpclassify(damping[i]) == FP_SUBNORMAL) {
damping[i] = 0; // Treat denormals as 0.
}

float cosT = 2 * damping[i];

for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}

Effect of date type precision on performance

The choice of precision in a chrono::duration is a trade off between between precision and range, and has no impact on performance.

The chrono-supplied clocks each have a "native precision" documented by their nested duration type, and that is what it is (can not be changed by the client). If you desire a time_point or duration different than that (after calling now()) the cost is a single multiplication or division to get your desired precision. And once you have your desired precision, there is no further cost in using that precision.

The higher the precision, generally the smaller your range. There is no over-flow protection unless you are using a custom Rep which supplies such checking. You can check your range with static duration::min()/max() member functions.

A source of run-time error can come about in converting a courser duration with a very large but in-range value to a finer precision which results in overflow at the finer precision. For example if you have more than 292 years worth of microseconds and convert that to nanoseconds, you will get overflow.

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

On most x86 systems, the cause of slowness is that denormal values trigger an FP_ASSIST which is very costly as it switches to a micro-code flow (very much like a fault).

see for example -
https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/487262

The reason why this is the case, is probably that the architects decided to optimize the HW for normal values by speculating that each value is normalized (which would be more common), and did not want to risk the performance of the frequent use case for the sake of rare corner cases. This speculation is usually true, so you only pay the penalty when you're wrong. These trade-offs are very common in CPU design since any investment in one case usually adds an overhead on the entire system.

In this case, if you were to design a system that tries to optimize all type of irregular FP values, you would have to either add HW to detect and record the state of each value after each operation (which would be multiplied by the number of physical FP registers, execution units, RS entries and so on - totaling in a significant number of transistors and wires.
Alternatively, you would have to add some mechanism to check the value on read, which would slow you down when reading any FP value (even on the normal ones).

Furthermore, based on the type, you would need to perform some correction or not - on x86 this is the purpose of the assist code, but if you did not make a speculation, you would have to perform this flow conditionally on each value, which would already add a large chunk of that overhead on the common path.

Why comparing a small floating-point number with zero yields random result?

Barring the undefined behavior which can be easily be fixed, you're seeing the effect of denormal numbers. They're extremely slow (see Why does changing 0.1f to 0 slow down performance by 10x?) so in modern FPUs there are usually denormals-are-zero (DAZ) and flush-to-zero (FTZ) flags to control the denormal behavior. When DAZ is set the denormals will compare equal to zero which is what you observed

Currently you'll need platform-specific code to disable it. Here's how it's done in x86:

#include <string.h>
#include <stdio.h>
#include <pmmintrin.h>

int main(void){
int i = 12;
float f;
memcpy(&f, &i, sizeof i); // avoid UB

_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
printf("%e %s 0\n", f, (f == 0)? "=": "!=");

_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_OFF);
printf("%e %s 0\n", f, (f == 0)? "=": "!=");

return 0;
}

Output:

0.000000e+00 = 0
1.681558e-44 != 0

Demo on Godbolt

See also:

  • flush-to-zero behavior in floating-point arithmetic
  • Disabling denormal floats at the code level
  • Setting the FTZ and DAZ Flags

Why is 0.1f's last binary bit rounded to 1?

Is the computer rounding the last bit based on what would be the next bit when the 23 bits of the mantissa are used?

Yes, of course. By default, the compiler and the floating-point arithmetic system tries to give you correctly rounded results.

As an analogy, if I asked you to write 2/3 to three decimal places, would you answer with 0.666 or 0.667? It should be 0.667, because it's closer to the true answer.

c# Denormalized Floating Point: is zero literal 0.0f slow?

There's nothing special about denormals that makes them inherently slower than normalized floating point numbers. In fact, a FP system which only supported denormals would be plenty fast, because it would essentially only be doing integer operations.

The slowness comes from the relative difficulty of certain operations when performed on a mix of normals and denormals. Adding a normal to a denormal is much trickier than adding a normal to a normal, or adding a denormal to a denormal. The machinery of computation is simply more involved, requires more steps. Because most of the time you're only operating on normals, it makes sense to optimize for that common case, and drop into the slower and more generalized normal/denormal implementation only when that doesn't work.

The exception to denormals being unusual, of course, is 0.0, which is a denormal with a zero mantissa. Because 0 is the sort of thing one often finds and does operations on, and because an operation involving a 0 is trivial, those are handled as part of the fast common case.

I think you've misunderstood what's going on in the answer to the question you linked. The 0 isn't by itself making things slow: despite being technically a denormal, operations on it are fast. The denormals in question are the ones stored in the y array after a sufficient number of loop iterations. The advantage of the 0.1 over the 0 is that, in that particular code snippet, it prevents numbers from becoming nonzero denormals, not that it's faster to add 0.1 than 0.0 (it isn't).

Why some arithmetic operations take more time than usual?

Denormal (or rather subnormal) numbers are often a performance hit. Slowly converging to 0, per your second example, will generate more subnormals. Read more here and here. For more serious reading, check out the oft-cited (and very dense) What Every Computer Scientist Should Know About Floating-Point Arithmetic.

From the second source:

Under IEEE-754, floating point numbers are represented in binary as:

Number = signbit \* mantissa \* 2exponent

There are potentially multiple ways of representing the same number,
using decimal as an example, the number 0.1 could be represented as
1*10-1 or 0.1*100 or even 0.01 * 10. The standard dictates that the
numbers are always stored with the first bit as a one. In decimal that
corresponds to the 1*10-1 example.

Now suppose that the lowest exponent that can be represented is -100.
So the smallest number that can be represented in normal form is
1*10-100. However, if we relax the constraint that the leading bit be
a one, then we can actually represent smaller numbers in the same
space. Taking a decimal example we could represent 0.1*10-100. This
is called a subnormal number. The purpose of having subnormal numbers
is to smooth the gap between the smallest normal number and zero.

It is very important to realise that subnormal numbers are represented
with less precision than normal numbers. In fact, they are trading
reduced precision for their smaller size. Hence calculations that use
subnormal numbers are not going to have the same precision as
calculations on normal numbers. So an application which does
significant computation on subnormal numbers is probably worth
investigating to see if rescaling (i.e. multiplying the numbers by
some scaling factor) would yield fewer subnormals, and more accurate
results.

I was thinking about explaining it myself, but the explanation above is extremely well written and concise.

Why does MSVS not optimize away +0? [duplicate]

The compiler cannot eliminate the addition of a floating-point positive zero because it is not an identity operation. By IEEE 754 rules, the result of adding +0. to −0. is not −0.; it is +0.

The compiler may eliminate the subtraction of +0. or the addition of −0. because those are identity operations.

For example, when I compile this:

double foo(double x) { return x + 0.; }

with Apple GNU C 4.2.1 using -O3 on an Intel Mac, the resulting assembly code contains addsd LC0(%rip), %xmm0. When I compile this:

double foo(double x) { return x - 0.; }

there is no add instruction; the assembly merely returns its input.

So, it is likely the code in the original question contained an add instruction for this statement:

y[i] = y[i] + 0;

but contained no instruction for this statement:

y[i] = y[i] - 0;

However, the first statement involved arithmetic with subnormal values in y[i], so it was sufficient to slow down the program.



Related Topics



Leave a reply



Submit