Denormals

Denormals (or subnormals) are very small floating point numbers. When they drop below a threshold size, many CPUs (under many different conditions) exhibit considerable drops in performance when processing them. These drops can be as much as a factor of 100 times!

What can be done?

There are a few things we can do which may solve this problem. They vary in efficacy depending on the CPU.

fast-math	pass `-ffast-math` to GCC
SSE	pass `-msse -mfpmath=sse` to GCC.
SSE DAZ	pass `-msse -mfpmath=sse` to GCC and switch the CPU to denormals-are-zero mode.
SSE FTZ	pass `-msse -mfpmath=sse` to GCC and switch the CPU to flush-to-zero mode.
SSE fast-math	pass `-ffast-math -msse -mfpmath=sse` to GCC.
SSE DAZ fast-math	do SSE fast-math and switch the CPU to denormals-are-zero mode.
SSE FTZ fast-math	do SSE fast-math and switch the CPU to flush-to-zero mode.

Finally, it is usually possible to fix the plugin code so that it does not generate denormals.

What works?

The following rules appear to hold:

All CPUs are slower when processing denormals with GCC’s default flags.
All CPUs get back to full speed with -msse -mfpmath=sse -ffast-math.
All CPUs get back to full speed with -msse -mfpmath=sse and DAZ set.
Some CPUs get back to full speed with -ffast-math only, some do not.
Some CPUs get back to full speed with -msse -mfpmath=sse and FTZ set, some do not.
The VIA Nehemiah has no serious problem with denormals no matter what GCC flags or CPU modes are used.
The P3 always has problems with denormals no matter what GCC flags or CPU modes are used.
Fixing the plugin code will always work, if done right; but it may not be easy.

To summarise the summary

Pass -msse -mfpmath=sse to GCC when building your plugins for distribution, unless you want to support 1999-ish-era CPUs, in which case build and distribute two versions: one with SSE and one without.
Pass -msse -mfpmath=sse -ffast-math to GCC if you do not mind what -ffast-math does to your FP code.
If you are writing a host, set the CPU to denormals-are-zero mode if the CPU supports it.
If you want to get it right from the start, fix your plugin so it does not generate denormals in the first place. My plugin torture tester may help you with this.

Setting FTZ and DAZ

Andrew Belt reports that the following piece of code will disable denormals with Linux GCC and MinGW:

#include <xmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

and the following with clang on OS X:

#include <fenv.h>
fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);

Are you sure about all this?

Well, maybe.

I wrote a small, simple test program. This makes a 256-sample buffer of 1s and the multiplies the buffer by some value x 10 million times. It compares the time taken to do this when x is 1 as against the case when x is 1e-39. It performs this test using various GCC flags and other settings.

This program was run on several different platforms. The results are shown below. Numbers are the approximate factors by which the denormal test is slower than the normal test.

CPU	GCC	No flags	fast-math	SSE	SSE DAZ	SSE FTZ	SSE fast-math	SSE DAZ fast-math	SSE FTZ fast-math
64-bit Core i3	4.6.1	7	1	7	1	7	1	1	1
64-bit Phenom II X6 1090T	4.4.5	8	1	8	1	1	1	1	1
64-bit Phenom II X4 940	4.6.2	8	1	8	1	1	1	1	1
64-bit Atom	4.6.1	12	1	11	1	11	1	1	1
64-bit Athlon 64 X2	4.6.1	8	1	8	1	1	1	1	1
32-bit Xeon	3.3.2	113	110	57	1	54	57	1	55
32-bit Core 2 Duo	3.4.5	40	40	8	1	11	1	1	1
32-bit Athlon 64 X2	4.1.2	6	6	9	1	1	1	1	1
32-bit Pentium 3	4.1.2	12	12	6	N/A	4	4	N/A	4
32-bit VIA Nehemiah	4.1.2	1.2	1.3	1.5	N/A	1.4	1.4	N/A	1.4