Double vs. Float, Which is Better?
There are three floating point types in C and C++:
What the Standard Has to Say
There are exactly two guarantees provided by the standard:
precision(float) <= precision(double) <= precision(long double). We are guaranteed that
floathas no more precision than
doublehas no more precision than
long double. It is possible for all three to be the same data type in a given implementation.
- The default floating point type is
double, that is,
- f suffix
- If you want a
floatconstant, you must specify this with the
- L suffix
- Similarly, a
long doubleconstant must be specified with the
Guidance Provided by Stroustrup
"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best." [emphasis added]
Standard Library Implementation
Where there is only one version of a standard library floating point operation, the library defaults to working with
double. This includes the functions
strtod. In C89 the only data type supported by all math.h functions was
On modern hardware
float in every case. In the higher optimization levels
long double even outperforms float.
The test code was compiled with the command line
Type name: f Size in bytes: 4 Summation time in s: 2.82 summed value: 6.71089e+07 // float
Type name: d Size in bytes: 8 Summation time in s: 2.78585 summed value: 6.6e+09 // double
Type name: e Size in bytes: 16 Summation time in s: 2.76812 summed value: 6.6e+09 // long double
The test code was:
T sum(int num_times, T value)
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < num_times; ++i)
val += value;
std::chrono::high_resolution_clock::duration d = std::chrono::high_resolution_clock::now() - t1;
std::cout << "Type name: " << typeid(T).name() << " Size in bytes: " << sizeof(T) << " Summation time in s: " << std::chrono::duration_cast<std::chrono::duration<double>>(d).count();
std::cout << " summed value: " << sum<float>(2000000000, 3.3) << std::endl;
std::cout << " summed value: " << sum<double>(2000000000, 3.3) << std::endl;
std::cout << " summed value: " << sum<long double>(2000000000, 3.3) << std::endl;
std::cout << " summed value: " << sum<unsigned char>(2000000000, 3.3) << std::endl;
std::cout << " summed value: " << sum<short>(2000000000, 3.3) << std::endl;
std::cout << " summed value: " << sum<long>(2000000000, 3.3) << std::endl;
std::cout << " summed value: " << sum<long long>(2000000000, 3.3) << std::endl;
double should be your preferred floating point type in nearly every situation.
- It's faster.
- It's the default in C and C++.
- It's more portable and the default across all C and C++ library functions.
- It's significantly higher precision. The answer above is outright wrong (off by 2 orders of magnitude) for float. My simple 3.3 summation example quickly loses precision. In a smaller case, summing the value 3.3 2000 times results in: 6599.89 when using
floatinstead of the correct answer of 6600.
- Stroustrup recommends it.
There is exactly one case where you should use
float instead of
- It's smaller. On 64bit hardware with a modern gcc,
doubleis 8 bytes and
floatis 4 bytes.
Similarly, there is exactly one time you would need to use
- It's very high precision. You might have an application that needs the absolutely most correct floating point precision you can get today.
Finally, and perhaps most importantly: 64bit floating point has been the standard supported in Intel compatible CPU's supporting the SSE2 instruction set since 2001. On most platforms
double uses this same 64 SSE2 (IEEE64 bit) type.